28.02.2014 Views

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

An Integrated Data Analysis Suite and Programming ... - TOBIAS-lib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>An</strong> <strong>Integrated</strong> <strong>Data</strong> <strong>An</strong>alysis <strong>Suite</strong> <strong>and</strong> <strong>Programming</strong> Framework<br />

for High-Throughput DNA Sequencing<br />

Dissertation<br />

der Mathematisch-Naturwissenschaftlichen Fakultät<br />

der Eberhard Karls Universität Tübingen<br />

zur Erlangung des Grades eines Doktors der Naturwissenschaften<br />

(Dr. rer. nat.)<br />

vorgelegt von<br />

Dipl.-Inform. Felix Ott<br />

aus Basel<br />

Tübingen<br />

2013


Tag der mündlichen Qualikation: 18.12.2013<br />

Dekan:<br />

Prof. Dr. Wolfgang Rosenstiel<br />

1. Berichterstatter: Prof. Dr. Detlef Weigel<br />

2. Berichterstatter: Prof. Dr. Daniel Huson


Erklärung<br />

Hiermit erkläre ich, dass ich diese Arbeit selbstständig und nur mit den angegebenen Hilfsmitteln<br />

angefertigt habe, und dass alle Stellen, die im Wortlaut oder dem Sinn nach <strong>and</strong>eren Werken<br />

entnommen sind, durch <strong>An</strong>gabe der Quellen kenntlich gemacht sind.<br />

Beiträge Dritter, auf die im Text eingegangen wird, sind auf der Seite Contributions<br />

aufgelistet. Des Weiteren wird auf thematische Überschneidungen mit Publikationen, an denen<br />

ich mitgewirkt habe, im Text explizit hingewiesen. Weitere Überschneidungen mit zukünftigen<br />

Publikationen sind möglich.<br />

Tübingen, 19. Dezember 2013<br />

Felix Ott


Contributions<br />

The high-throughput DNA sequencing data analysis pipeline SHORE that this work is built<br />

around is the work of a variety of contributors. The foundations to the software were laid by<br />

Stephan Ossowski <strong>and</strong> Korbinian Schneeberger, <strong>and</strong> are meticulously described in Stephan Ossowski's<br />

Ph. D. thesis Computational Approaches for Next Generation Sequencing <strong>An</strong>alysis <strong>and</strong><br />

MiRNA Target Search [1]. Concerned with a direct continuation of this work are sections 2.1, 2.3<br />

<strong>and</strong> 2.6. The prerequisites are however discussed explicitly, <strong>and</strong> further delineation is provided<br />

in section 1.7. Jörg Hagmann has supported the development with ideas <strong>and</strong> has contributed a<br />

multitude of improvements as well as modules for RNA editing analysis, bisulte sequencing data<br />

analysis <strong>and</strong> multi-reference variation calling, which are however not subject of this work. Jonas<br />

Müller has contributed RAD-Seq motivated extensions to the read demultiplexing code, which<br />

were incorporated into the methods described in section 2.4. Alf Scotl<strong>and</strong> has supported parts<br />

of the coding eort, <strong>and</strong> Sebastian Bender has contributed further extensions to the software<br />

pipeline not discussed in this work.<br />

In accordance with st<strong>and</strong>ard scientic protocol, rst-person plural pronouns will be used to refer to<br />

the reader <strong>and</strong> the writer, or to my scientic collaborators <strong>and</strong> myself.


Acknowledgment<br />

First of all I would like to thank my supervisors Detlef Weigel <strong>and</strong> Daniel Huson for supporting<br />

my thesis <strong>and</strong> the opportunity to work on countless interesting <strong>and</strong> challenging questions at the<br />

Max-Planck-Institut Tübingen. Further I am grateful to Steen Schmidt <strong>and</strong> Karsten Borgwardt,<br />

for helpful <strong>and</strong> encouraging advice given over the years in their role as members of my Thesis<br />

Advisory Committee.<br />

Jörg Hagmann, Stephan Ossowski <strong>and</strong> Korbinian Schneeberger must be singled out, but I<br />

am grateful for the excellent collaboration with all other contributors to the SHORE software.<br />

Norman Warthmann, Jun Cao, Sang-Tae Kim, Stefan Henz out of many others have been of<br />

great help by using, testing <strong>and</strong> suggesting improvements to the application in various areas.<br />

Markus Schmid, Levi Yant, David Posé Padilla, François Parcy, Edwige Moyroud, Stephan<br />

Wenkel, Ronny Br<strong>and</strong>t, Richard Immink <strong>and</strong> others deserve my gratitude for excellent ChIP-Seq<br />

collaborations, discussions <strong>and</strong> explanations.<br />

There are plenty of other people I would like to thank for outst<strong>and</strong>ing professional collaborations,<br />

advice or interesting discussions <strong>and</strong> input. Among them are Lisa Smith, Br<strong>and</strong>on Gaut,<br />

Jesse Hollister, Xi Wang, Marco Todesco, Felipe Fenselau de Felippes, Pablo Manavella, Ignacio<br />

Rubio Somoza <strong>and</strong> Christa Lanz. To Rebecca Schwab, Jörg Hagmann <strong>and</strong> Jonas Müller I am<br />

immensely grateful for doing a great job proofreading this thesis.<br />

Finally, I apologize to those I forgot.


Zusammenfassung<br />

Die Dideoxynukleotid-Kettenabbruchmethode wird seit dem Jahr 2005 durch eine<br />

Vielzahl an parallelen DNA-Sequenzierungsmethoden ergänzt. Diese haben sich als eine<br />

Schlüsseltechnologie mit einer groÿen <strong>An</strong>zahl von Einsatzmöglichkeiten in der Genetik<br />

erwiesen, stellen <strong>and</strong>ererseits mit einer Flut an erzeugten Daten auch eine Herausforderung<br />

dar. Eine Vielfalt an Algorithmen, die sich mit unterschiedlichen Aspekten der<br />

Sequenzierdaten-Verarbeitung und <strong>An</strong>alyse befassen, wurde bereits erarbeitet und implementiert.<br />

Zur routinemäÿigen <strong>An</strong>alyse dieser Daten benötigt es jedoch zudem eine optimal<br />

aufein<strong>and</strong>er abgestimmte Sammlung von <strong>An</strong>alysewerkzeugen, die alle notwendigen Schritte<br />

von der initialen Rohdatenverarbeitung bis zum Erlangen der primären <strong>An</strong>alyseergebnisse<br />

abdeckt.<br />

Mit der Software SHORE wird eine modulare, vielfältig nutzbare Datenanalyse-Lösung<br />

für die parallele DNA-Sequenzierung bereitgestellt. In der vorliegenden Arbeit werden<br />

die SHORE zugrunde liegenden Konzepte erweitert und verallgemeinert, um den Entwicklungen<br />

der neuen Sequenziermethoden Rechnung zu tragen. Diese Entwicklungen<br />

schlieÿen unter <strong>and</strong>erem einen deutlich erhöhten Datenertrag ein, sowie die Verbreitung<br />

von Protokollen, die das Multiplexen von Proben mittels spezieller Kennzeichnungssequenzen<br />

ermöglichen. Um eine breitgefächerte Unterstützung grundlegender Funktionen<br />

wie Datenkompression und indexierter Suchalgorithmen zu ermöglichen, wurde eine<br />

allgemein nutzbare C++ -Programmierumgebung <strong>lib</strong>shore entwickelt, die als gemeinsame<br />

Basis aller Datenverarbeitungsmodule in SHORE fungiert. Ein weiteres Ziel war hierbei,<br />

Programmierschnittstellen bereitzustellen, die modulares Design sowie Parallelisierung von<br />

Sequenzierdaten-Verarbeitungsalgorithmen unterstützen.<br />

Ein weiterer Schwerpunkt dieser Arbeit war die Entwicklung von <strong>An</strong>alysealgorithmen<br />

für mittels der Chromatin-Immunopräzipitation gewonnener Daten. Eine Schlüsselrolle in<br />

der Regulierung der DNA-Transkription wird von einer Klasse von DNA-bindenden Proteinen,<br />

den Transkriptionsfaktoren, eingenommen. ChIP-Seq-Studien erlauben die Darstellung<br />

von Transkriptionsfaktor-Bindestellen für das gesamte Genom, und besitzen daher<br />

groÿes Potential, zum Verständnis der Funktionsweise der komplexen regulatorischen Netzwerke<br />

beizutragen. In dieser Arbeit wird ein <strong>An</strong>alysemodul SHORE peak vorgestellt, welches<br />

zur Verarbeitung von Transkriptionsfaktor-Immunopräzipitationsdaten dient. Das Computerprogramm<br />

vereint statistische Detektion angereicherter Sequenzabschnitte mit empirischen<br />

Regeln zur Auslterung von Artefakten, um so die zuverlässige Erkennung der entscheidenden<br />

Bereiche des Genoms zu gewährleisten. Da Wiederholungsexperimente, die durch<br />

fallende Sequenzierungskosten verstärkt ermöglicht werden, ein wichtiges Mittel zur Identizierung<br />

der biologisch relevanten Sequenzbereiche darstellen, wurde zudem besonderer<br />

Wert auf die simultane Auswertung dieser Datensätze gelegt.<br />

Die entwickelten Algorithmen wurden zudem auf weitere <strong>An</strong>alysemodule übertragen,<br />

die mit dem Ziel der möglichst exiblen Kongurierbarkeit entwickelt wurden. In Kombination<br />

mit der bereits enthaltenen Funktionalität zur Detektion genomischer Varianten<br />

sollte die Software SHORE von Nutzen für einen groÿen Teil des <strong>An</strong>wendungsbereiches der<br />

neuen DNA-Sequenzierungstechnologien sein. Mit der allgemein nutzbaren Funktionalität<br />

der <strong>lib</strong>shore-Programmierumgebung sollte zudem die einfache Erweiterbarkeit im Hinblick<br />

auf zukünftige <strong>An</strong>sätze zur Datenanalyse gewährleistet sein.


Abstract<br />

The various parallel DNA sequencing methods that have been introduced since 2005 to<br />

complement the established dideoxynucleotide chain termination method have proved a key<br />

technology with a wide range of applications in genetic research, but also challenge with<br />

a constant ood of data. A multitude of algorithms have been proposed <strong>and</strong> implemented<br />

addressing dierent aspects of sequencing data processing <strong>and</strong> data analysis for a variety<br />

of high-throughput sequencing applications. Routine data analysis however requires a consistently<br />

assembled tool chain to ensure smooth end-to-end processing starting from raw<br />

sequencing read data up to primary analysis results.<br />

The software SHORE aims to be a modular, general-purpose data analysis suite for highthroughput<br />

DNA sequencing. With this work, we extend <strong>and</strong> generalize the concepts of the<br />

SHORE analysis pipeline to accommodate increased throughput of sequencing devices <strong>and</strong><br />

development of novel sequencing protocols such as bar-coded sample multiplexing. To provide<br />

universal support of basic features such as data set compression or indexing <strong>and</strong> query<br />

mechanisms, we develop a generic C++ programming framework <strong>lib</strong>shore to form the foundation<br />

of all of SHORE's modules. Furthermore, we aim to provide a generic application programming<br />

interface facilitating modular design as well as parallelization of high-throughput<br />

sequencing data processing <strong>and</strong> analysis algorithms.<br />

A further focus of this work was on the development of data analysis algorithms for chromatin<br />

immunoprecipitation <strong>and</strong> other sequence enrichment <strong>and</strong> expression proling studies.<br />

Through genome-wide binding-site proling for transcription factors, DNA-binding proteins<br />

occupying a key role in transcription regulation, ChIP-Seq studies have the potential to<br />

greatly improve the ability to underst<strong>and</strong> the functioning of complex regulatory networks.<br />

We present an analysis module SHORE peak oriented towards processing of transcription<br />

factor immunoprecipitation data. Our program combines statistical enrichment detection<br />

with empirical artifact removal rules to ensure robust identication of the most relevant<br />

sites. As replicate experiments are an important factor for the identication of biologically<br />

relevant sites <strong>and</strong> with dropping sequencing cost are rendered more feasible, emphasis was<br />

put on the simultaneous processing of such data sets.<br />

By transferring enrichment detection <strong>and</strong> expression analysis algorithms to further auxiliary<br />

modules emphasizing exibility of conguration, as well as maintaining previously<br />

available variation analysis functionality, SHORE should represent a valuable resource for<br />

the majority of high-throughput sequencing applications. Furthermore, in combination with<br />

the generic functionality of the <strong>lib</strong>shore framework, we hope to ensure extensibility to readily<br />

accommodate future analysis strategies.


Contents<br />

1 Introduction 1<br />

1.1 High-Throughput DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . 1<br />

1.2 Approaches to Parallel DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . 2<br />

1.3 Applications of High-Throughput Sequencing . . . . . . . . . . . . . . . . . . . . 3<br />

1.3.1 Genome <strong>and</strong> Transcriptome Sequencing . . . . . . . . . . . . . . . . . . . 4<br />

1.3.2 ChIP-Seq <strong>and</strong> further Immunoprecipitation Protocols . . . . . . . . . . . 5<br />

1.4 Properties of Sequencing <strong>Data</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

1.5 Sequencing <strong>Data</strong> <strong>An</strong>alysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

1.5.1 Base Calling <strong>and</strong> Read Quality Assessment . . . . . . . . . . . . . . . . . 8<br />

1.5.2 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

1.5.3 Short Read Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

1.5.4 Variation <strong>and</strong> Genotype Calling . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

1.5.5 Quantication by Deep Sequencing . . . . . . . . . . . . . . . . . . . . . . 11<br />

1.5.6 ChIP-Seq <strong>An</strong>alysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

1.6 Storage <strong>and</strong> Representation of Sequencing <strong>Data</strong> . . . . . . . . . . . . . . . . . . . 13<br />

1.6.1 Storage Formats for Next-Generation Sequencing <strong>Data</strong> . . . . . . . . . . . 13<br />

1.6.2 Accelerated Queries on Sequencing <strong>Data</strong> . . . . . . . . . . . . . . . . . . . 15<br />

1.7 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2 A High-Throughput DNA Sequencing <strong>Data</strong> <strong>An</strong>alysis <strong>Suite</strong> 19<br />

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.2 Ecient Storage of High-Throughput Sequencing <strong>Data</strong> Using Text-Based File Formats<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.2.2 Widely Compatible Indexed Block-Wise Compression . . . . . . . . . . . 23<br />

2.2.3 Ecient Queries on Text Files . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.2.4 Improved Compression of Read Mapping <strong>Data</strong> . . . . . . . . . . . . . . . 27<br />

2.2.5 Compression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.2.6 <strong>Data</strong> Storage Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

2.3 A Non-Destructive Read Filtering <strong>and</strong> Partitioning Framework . . . . . . . . . . 33<br />

2.4 A Flexible Sequencing Read Demultiplexing System . . . . . . . . . . . . . . . . 34<br />

2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.4.2 A Format for Description of Multiplexing Setups . . . . . . . . . . . . . . 36<br />

2.4.3 Barcode Recognition <strong>and</strong> Sample Resolution . . . . . . . . . . . . . . . . 38<br />

2.5 Versatile Oligomer Detection <strong>and</strong> Read Clipping . . . . . . . . . . . . . . . . . . 38<br />

2.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

2.5.2 Dynamic <strong>Programming</strong> Alignment <strong>and</strong> Backtracing Pipeline . . . . . . . 41<br />

xi


xii<br />

CONTENTS<br />

2.6 A Parallelization Front-End for Short Read Alignment Tools . . . . . . . . . . . 43<br />

2.7 Robust Detection of ChIP-Seq Enrichment . . . . . . . . . . . . . . . . . . . . . . 45<br />

2.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

2.7.2 Correcting for Duplicated Sequences . . . . . . . . . . . . . . . . . . . . . 45<br />

2.7.3 Detection Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

2.7.4 Recognition of Read Mapping Artifacts . . . . . . . . . . . . . . . . . . . 47<br />

2.7.5 Ranking <strong>and</strong> Assessment of Peak Signicance . . . . . . . . . . . . . . . . 49<br />

2.7.6 Experimental Relevance of Peak Signicance . . . . . . . . . . . . . . . . 51<br />

2.8 Visualization of Sequencing Read <strong>and</strong> Alignment <strong>Data</strong> . . . . . . . . . . . . . . . 52<br />

2.8.1 Gathering Run Quality <strong>and</strong> Sequence Composition Statistics . . . . . . . 52<br />

2.8.2 Visualization of Read Mapping <strong>Data</strong> . . . . . . . . . . . . . . . . . . . . . 55<br />

2.8.3 Visualization of Local or Genome-Scale Depth of Coverage . . . . . . . . 56<br />

2.8.4 Quantile Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

3 A C++ Framework for High-Throughput DNA Sequencing 61<br />

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

3.2 A Modular Signal-Slot Processing Framework . . . . . . . . . . . . . . . . . . . . 64<br />

3.2.1 Reader <strong>and</strong> Writer Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

3.2.2 Denition of Processing Network Topology . . . . . . . . . . . . . . . . . 64<br />

3.2.3 Module Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

3.2.4 In-Place <strong>Data</strong> Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

3.3 A Simplied Parallelization Interface . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

3.3.1 Parallelization Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

3.3.2 Parallel Pipeline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

4 Closing Remarks 77<br />

Bibliography 81


1 Introduction<br />

Next-Generation Sequencing has evolved into a powerful tool for many areas of biological<br />

science, but has also introduced many new challenges related to data analysis<br />

<strong>and</strong> computing infrastructure into the eld. Following a brief recapitulation of highthroughput<br />

DNA sequencing technology, this chapter highlights important areas of<br />

application while discussing previous <strong>and</strong> related work. We outline general sequencing<br />

data analysis methodology <strong>and</strong> conclude by establishing the motivation for this<br />

work.<br />

1.1 High-Throughput DNA Sequencing<br />

Widely deployed for less than a decade, the impact of parallel next-generation sequencing technology<br />

on genetic research has already been considerable. The novel high-throughput sequencing<br />

approaches are characterized by massive parallelism implemented in a single device. From this<br />

design results the key advantage of a signicantly reduced cost per sequenced base, which has<br />

allowed to quickly displace the previously predominant Sanger sequencing method in many areas<br />

of application.<br />

Sanger sequencing, the prevalent sequencing method for the majority of time since its introduction<br />

in 1977 [2], needed to rely on extensive automation in dedicated sequencing centers to<br />

achieve time <strong>and</strong> cost eective readout of large amounts of sequence. Most prominently, this was<br />

put into eect during the eort of generating the rst draft sequence of the human genome [36].<br />

By contrast, high-throughput sequencing instruments are designed to analyze thous<strong>and</strong>s to millions<br />

of DNA molecules simultaneously, <strong>and</strong> thus enable even smaller institutions to produce<br />

large quantities of sequence data on site.<br />

Secondary to elucidation of DNA primary structure, sequencing utilized as a r<strong>and</strong>om sampling<br />

device delivers quantitative clues on sample composition. In this capacity, it has been<br />

used for diverse purposes such as inference of DNA methylation levels [7] or the analysis of environmental<br />

samples [8]. In concert with the economical advantage over the dideoxynucleotide<br />

chain-termination method, parallel DNA sequencing becoming widely accessible to researchers<br />

has served to promote such deep sequencing approaches that open up whole new areas of application<br />

beyond the domain previously occupied by Sanger sequencing.<br />

As an alternative to microarray technologies, for example in gene expression proling or<br />

chromatin immunoprecipitation assays, deep sequencing overcomes fundamental technological<br />

restrictions such as probe resolution <strong>and</strong> probe saturation. Furthermore, without the inherent<br />

requirement of a-priori knowledge of sequences to be detected or quantied, development of novel<br />

protocols of application has been furthered to transform high-throughput sequencing methods<br />

into a versatile tool for research.<br />

1


2 CHAPTER 1. INTRODUCTION<br />

1.2 Approaches to Parallel DNA Sequencing<br />

<strong>An</strong> early parallel sequencing method, massively parallel signature sequencing (MPSS), was published<br />

in the year 2000 [9] <strong>and</strong> had already been available as a centralized, commercial service.<br />

However, as its short read length of only 17 to 20 nucleotides proved to be prohibitive in most<br />

potential areas of application, MPSS mainly found use as a gene expression proling assay.<br />

A technological <strong>and</strong> commercial breakthrough was marked by the deployment of integrated<br />

bench top devices through a variety of dierent providers. The rst high-throughput sequencing<br />

instrument brought to the market was the 454 Life Sciences Genome Sequencer FLX in 2005 [10].<br />

Soon after, in 2006, followed Illumina's Genome <strong>An</strong>alyzer [11] instrument <strong>and</strong> the Life Technology<br />

SOLiD system [12] in 2008.<br />

Over a short period of time, considerable technological maturation has ensued, <strong>and</strong> vendors<br />

have started to broaden their portfolios by tailoring their respective solutions to dierent operational<br />

scenarios, exemplied by the Illumina HiSeq <strong>and</strong> MiSeq devices or the 454 Life Sciences<br />

bench top model GS Junior. On the other h<strong>and</strong>, diverse alternative approaches have been proposed<br />

<strong>and</strong> introduced into the market, notably a semiconductor-based solution marketed as Ion<br />

Torrent by Life Technology [13] <strong>and</strong> the Pacic Biosciences single molecule sequencing system<br />

PacBio RS [14].<br />

As their dening feature, all high-throughput sequencing systems share the parallel interrogation<br />

of large numbers of DNA molecules. Sample DNA to be analyzed is typically applied to<br />

an expendable ow cell or ow chip, a specialized glass slide or microtiter plate particular to<br />

the respective technology. Aside from such fundamental analogies, the respective methods of sequence<br />

interrogation dier in key aspects. The basis of most current technologies is formed by the<br />

sequencing-by-synthesis (SBS) principle, with the notable exception of the SOLiD sequencingby-ligation<br />

method. Sequencing-by-synthesis decodes the sequence of DNA molecules by keeping<br />

track of nucleotides incorporated by a DNA polymerase during complementary str<strong>and</strong> synthesis,<br />

with the distinguishing feature of the dierent technological implementations being the manifold<br />

methods employed for detecting events of nucleotide incorporation.<br />

The 454 family of sequencers realize a pyrosequencing approach. The four deoxyribonucleoside<br />

triphosphates (dNTP) adenine, cytosine, guanine <strong>and</strong> thymine are owed cyclically<br />

over microtiter wells holding clonally amplied DNA templates, where each ow of reagents<br />

delivers a specic type of dNTP. The amount of pyrophosphate released during nucleotide<br />

incorporation, which is determined by the number of nucleotides incorporated in each ow, is<br />

converted into light intensity through an enzymatic reaction <strong>and</strong> recorded by a CCD camera. [10]<br />

The Illumina systems are based on reversible chain termination chemistry. A mixture of<br />

all four dNTPs is owed over a slide with r<strong>and</strong>omly distributed clusters of clonally amplied<br />

templates. The dNTPs are modied by a chain terminator, limiting elongation of the synthesized<br />

str<strong>and</strong> to a single nucleotide. Furthermore, each type of dNTP is distinguished by fusion to a<br />

specic uorophore, allowing for identication of the respective incorporated base by means<br />

of laser excitation <strong>and</strong> a CCD detector. Reagents subsequently applied displace the reversible<br />

terminator <strong>and</strong> dye to enable further str<strong>and</strong> elongation in the following cycle. [11]<br />

The Ion Torrent design adopts the homopolymer detection approach introduced by pyrosequencing.<br />

However, instead of pyrophosphate, the release of hydrogen ions serves as a proxy for<br />

str<strong>and</strong> elongation. The individual pH measurements are realized non-optically on an expendable<br />

semiconductor chip. [13]<br />

PacBio RS sequencers implement a technology termed single molecule real-time sequencing<br />

(SMRT). The method is enabled by zero-mode waveguides (ZMW), nanostructures that enable<br />

probing of volumes smaller than the wave length of light [15]. In contrast to the other current<br />

technologies described, the SMRT approach therefore does not need to rely on clonally amplied


1.3. APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING 3<br />

DNA templates, <strong>and</strong> instead single DNA molecules <strong>and</strong> polymerases suce for obtaining a recognizable<br />

signal. DNA polymerases are immobilized in ZMW detection volumes, creating a slide<br />

suitable for the parallel observation of polymerase activity. Fluorescently labeled dNTPs are employed<br />

as the substrate, where nucleotides being incorporated are detected via laser excitation<br />

<strong>and</strong> a CCD detector. The uorophore itself is displaced upon nucleotide incorporation. [14]<br />

As opposed to sequencing-by-synthesis reliant on DNA polymerase, SOLiD sequencing constitutes<br />

a sequencing-by-ligation technique that exploits DNA ligase enzymes for sequence decoding.<br />

Probe oligomers partially consisting of degenerate <strong>and</strong> universal bases [16] are ligated to interrogate<br />

di-nucleotides at a spacing of three base pairs. When reaching the full read length, the<br />

template is reset <strong>and</strong> the ligation process is repeated ve times at dierent osets, <strong>and</strong> thus<br />

each of the template's bases is measured twice through dierent probe oligomers. Four dierent<br />

uorophores serve to label the 16 types of probe to obtain an ambiguous color space code. The<br />

properties of the dibase color encoding are such that for any base, the ambiguity may be resolved<br />

as long as the identity of the preceding base is known. [12]<br />

The following discussion centers on Illumina sequencing technology as the primary focus of<br />

the present work, although applicable to varying extent to multiple or all of the currently<br />

available high-throughput sequencing devices.<br />

1.3 Applications of High-Throughput Sequencing<br />

While high-throughput sequencing devices in general excel in the cost-eective generation of massive<br />

amounts of sequence data, the new technologies are still associated with weaknesses regarding<br />

error rates <strong>and</strong> maximal achievable sequencing read length (section 1.4). These weaknesses<br />

limit the capacity to generate base-by-base readouts of entire genome sequences. Nonetheless,<br />

concerted application of complementary technologies including Sanger sequencing has served to<br />

greatly accelerate the pace at which novel reference assemblies for manifold model organisms are<br />

being generated.<br />

Certainly, despite the reduction in cost-per-base that has been achieved, decoding larger,<br />

repeat-rich genome sequences remains a massive task with current DNA sequencing technology.<br />

Genomic inter-individual <strong>and</strong> inter-species variation may however be assessed circumventing the<br />

issue of global-scale sequence topology (section 1.3.1), which has thus evolved as one of the<br />

primary applications of high-throughput sequencing. In turn, interpretation of data generated<br />

with the aim of such consensus-based characterization of sequence variation is greatly facilitated<br />

by the availability of a suitable reference genome assembly.<br />

Similarly, the resequencing scenario facilitates data interpretation for an extensive family of<br />

deep sequencing <strong>and</strong> genome-wide screening applications. As throughput of sequencing devices<br />

has grown to enable routine sequencing even of large genomes to multiple depths of coverage, its<br />

potential for quantitative measurement has been leveraged for the assessment of diverse genomic<br />

properties. With stock genome sequencing protocol, deep sequencing allows estimation of quantiable<br />

statistics such as gene copy number [1719], application to forward genetic screens [20, 21]<br />

or dissection of environmental samples [2224]. To exploit its potential for quantication in a<br />

wide range of further research settings, high-throughput sequencing is complemented by a host<br />

of dierent sample <strong>and</strong> <strong>lib</strong>rary preparation protocols acting as adapters between the property of<br />

interest <strong>and</strong> DNA sequencing technology.<br />

RNA conversion protocols facilitate investigation of transcript structure <strong>and</strong> abundance<br />

[2529] <strong>and</strong> address structure, expression <strong>and</strong> distribution of specic types of transcript<br />

such as long non-coding RNA or small RNA [27, 3033]. In ChIP-Seq, MeDIP-Seq or HITS-<br />

CLIP, immunoprecipitation-based sequence enrichment protocols provide insight into diverse


4 CHAPTER 1. INTRODUCTION<br />

genetic <strong>and</strong> epigenetic mechanisms including transcription factor binding events, histone <strong>and</strong><br />

DNA modications or specicity of RNA binding proteins [17, 3438].<br />

Conversely, depletion protocols such as MNase-Seq are for example applicable to investigation<br />

of nucleosome positioning [39]. <strong>An</strong> additional option for the study of epigenetic features by highthroughput<br />

sequencing is available in sequence conversion protocols utilizing bisulte treatment<br />

of the sample DNA (BS-Seq) [27, 40, 41].<br />

Further protocols aiming at a reduction of <strong>lib</strong>rary complexity can supply signicantly reduced<br />

cost of sequencing for certain applications, such as retrieval of genetic markers through sequencing<br />

of restriction site associated DNA (RAD-Seq) [42, 43].<br />

By st<strong>and</strong>ard sample preparation protocols, a molecular mixture retrieved from millions of<br />

cells is assessed. This property is exploited de<strong>lib</strong>erately for example by genomic enrichment or<br />

epigenetic sampling protocols such as ChIP-Seq as well as BS-Seq. On the other h<strong>and</strong>, single-cell<br />

sequencing protocols are being developed to enable the maximally ne-grained assessment of<br />

cellular populations [44].<br />

While each sequencing protocol has on its own furthered our underst<strong>and</strong>ing of the molecular<br />

makeup of cells, integration <strong>and</strong> correlation of the dierent types <strong>and</strong> sources of data now available<br />

shows great promise to provide detailed insight into the complex interplay of the various<br />

mechanisms governing the cellular machinery.<br />

1.3.1 Genome <strong>and</strong> Transcriptome Sequencing<br />

Molecular genetics studies how an organism's genome interacts with its cellular environment,<br />

ultimately shaping the organism's phenotype inside its habitat. Learning how this interaction<br />

can be inuenced may ultimately help to develop new capabilities to counteract diseases or<br />

enhance traits like disease resistance or crop yield. <strong>An</strong>alysis of the DNA-level dierences between<br />

organisms, regardless of whether they are of the same species or not, presents an entry point<br />

to obtaining clues on this interaction. With the emergence of next-generation sequencing, for<br />

the rst time researchers are able to employ this approach on the very large scale, exemplied<br />

by a variety of major sequencing projects both on the intra- <strong>and</strong> inter-species level like the<br />

1000 genomes project [45, 46] for the human genome, the 1001 genomes project [19] for the<br />

Arabidopsis thaliana genome, the Genome 10K project [47] with the goal of analyzing 10000<br />

vertebrate genomes or the metagenomic human microbiome project [48, 49].<br />

Due to the short lengths of the sequence reads recovered by most high-throughput sequencers,<br />

most studies circumvent the complex task of whole genome assembly focusing on small scale sequence<br />

variants or localized larger scale rearrangements (section 1.5.4). These genetic markers<br />

thus recovered can be processed further allowing application to a wide range of scientic questions.<br />

Recovered on population scale, they provide insight into a species' population structure<br />

<strong>and</strong> the evolution of its genomes <strong>and</strong> of its subpopulations. When integrated with phenotype<br />

data, functional clues may further be obtained by means of genome wide association studies<br />

(GWAS) (e. g. [50]).<br />

Knowledge of gene expression proles proves valuable in many dierent contexts, <strong>and</strong> may on<br />

the one h<strong>and</strong> provide a means for the classication of entire cells, as well as on the other h<strong>and</strong> of<br />

certain genes or groups of genes with respect to their involvement in specic cellular pathways or<br />

function. While gene expression has been routinely assessed on the large scale using microarray<br />

technology, the method suers from non-linear measurement, probe saturation <strong>and</strong> suboptimal<br />

signal-to-noise ratio.<br />

High-throughput sequencing by contrast allows basically unlimited dynamic range only<br />

bounded by allocatable eort <strong>and</strong> amounts of cellular material. Other than microarrays,


1.3. APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING 5<br />

sequencing further enables recovery <strong>and</strong> quantication of unexpected transcripts <strong>and</strong> splice<br />

forms.<br />

A variety of dierent transcriptome sequencing protocols are available, distinguished by their<br />

suitability for assessment of transcript structure <strong>and</strong> quantication as well as capabilities to<br />

detect certain types of transcript <strong>and</strong> to recover transcript directionality. As with all examples<br />

of quantication by sequencing, data interpretation may be intricate (section 1.5.5).<br />

A special class of transcript is given by small RNA, short cellular RNA molecules ∼2025<br />

nucleotides in length ubiquitous in almost all well-studied eukaryotes. These have been identied<br />

as another vital factor for transcript regulation, <strong>and</strong> are further implicated in various important<br />

cellular processes such as DNA methylation <strong>and</strong> pathogen response [51, 52]. Transcript levels<br />

are aected by small RNA through RNA interference (RNAi). As the specicity-determining<br />

component of RNA induced silencing complexes (RISC), they direct the degrading enzymatic<br />

activity of ARGONAUTE proteins at the targeted, roughly complementary transcripts.<br />

Small RNA molecules are attributed to multiple cellular pathways in plants <strong>and</strong> animals generally<br />

involving double str<strong>and</strong>ed RNA or RNA stem-loop structures. According to the generating<br />

biochemical pathway, small RNA can be categorized into dierent classes, notably microRNA<br />

(miRNA) <strong>and</strong> small interfering RNA (siRNA), linked with overlapping, but distinct cellular roles.<br />

High-throughput small RNA proling by sequencing facilitates genome-wide discovery, categorization<br />

<strong>and</strong> relative quantication for the various small RNA species. However, small RNA<br />

quantication is sensitive to sequence specic bias [53], with their short length likely introducing<br />

increased variance to base composition bias (section 1.4) <strong>and</strong> emphasizing the importance of ligation<br />

biases. Furthermore, relative quantication with reference to <strong>lib</strong>rary size is often deemed<br />

unsuitable for an investigation, <strong>and</strong> consensus on appropriate normalizing transformations is currently<br />

lacking (cf. [54]). Nonetheless, small RNA sequencing data prove a valuable quantitative<br />

resource on the genome scale, e. g. as a proxy for RNA-directed DNA methylation [55] or in<br />

correlation with further transcriptome or epigenetic data.<br />

1.3.2 ChIP-Seq <strong>and</strong> further Immunoprecipitation Protocols<br />

Expression of genes is thought to be connected in complex regulatory networks of positive <strong>and</strong><br />

negative feedback. A key role in these networks is occupied by transcription factors, DNA-binding<br />

proteins involved with initiation of gene transcription at the promoter sequences. ChIP-Seq, the<br />

successor of the microarray-based ChIP-chip technology featuring improved signal-to-noise ratio<br />

<strong>and</strong> sharper signal bounds, facilitates the identication of putative transcription factor targets<br />

in a relevant in-vivo context. By elucidation of individual transcription factors' targets, our<br />

underst<strong>and</strong>ing of the complex transcriptional interplay can be gradually improved.<br />

Immunoprecipitation (IP) methods exploit the antibody-antigen anity to enrich specic<br />

molecules in a solution. In the course of the procedure, antibodies for the antigen to be concentrated<br />

are incubated with the solution to promote formation of antibody-antigen complexes.<br />

<strong>An</strong>tibodies are captured on solid-phase beads, thus enabling precipitation to enrich for the formed<br />

complexes.<br />

Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) employs<br />

the IP technique for in-vivo detection of regions of co-localization of a specic molecule, usually a<br />

protein, with the genomic DNA. During the procedure, live tissue is prepared with formaldehyde<br />

or an appropriate alternative cross-linking reagent. Formaldehyde undergoes a reaction that<br />

induces cross-linking of amino groups in close vicinity, thereby promoting formation of DNAprotein<br />

complexes (DPC) (cf. Barker et al. [56]). Following lysis of the tissue, the DNA is<br />

sonicated, yielding DPC associated with only short fragments of genomic DNA. The DPC are<br />

immunoprecipitated <strong>and</strong>, by purication <strong>and</strong> amplication of the contained DNA, prepared as


6 CHAPTER 1. INTRODUCTION<br />

Instrument<br />

Max. Average Read Pairs / Run Time<br />

Read Length Flowcell<br />

Illumina MiSeq 2x250bp 16M 39h<br />

Illumina Genome <strong>An</strong>alyzer IIx 2x150bp 300M 14 days<br />

Illumina HiSeq 2x100bp 1400M 11 days<br />

454 Life Sciences GS Junior ∼400bp 100000 10h<br />

454 Life Sciences GS FLX+ ∼700bp 1M 23h<br />

Life Technologies Ion PGM up to 400bp 1000005M 3.7h7.3h<br />

Life Technologies Ion Proton up to 200bp ∼70M 2h4h<br />

Life Technologies 5500W (SOLiD) 1x75bp / 2x50bp ∼1600M 6 days<br />

Pacic Biosciences PacBio RS ∼4.5kbp ∼22000 2h<br />

Table 1.1: DNA Sequencer Specications According to Device Vendors (3/2013)<br />

a sequencing <strong>lib</strong>rary. Following deep sequencing, thus precipitated tags facilitate identication<br />

<strong>and</strong> localization of putative protein binding sites.<br />

Immunoprecipitation is complemented by techniques that seek to deplete segments of the<br />

DNA that are not protein-bound, e. g. using exonucleases like micrococcal nuclease (MNase) to<br />

digest the unbound parts of the DNA fragments following cross-linking <strong>and</strong> sonication. MNase-<br />

Seq is used on its own to investigate nucleosome positioning [39, 57], but can also be combined<br />

with ChIP-Seq for improved binding site resolution.<br />

When available, immunoprecipitation of the DPC is achieved using a specic antibody raised<br />

against the protein of interest. Alternatively, the experiment is performed using transgenic<br />

organisms where the protein is tagged by an appropriate adapter protein, e. g. GFP [58], for<br />

which a high quality antibody is available. However, this fusion protein approach comes with<br />

the drawback of possible tag-induced alteration of protein function or binding anity.<br />

Apart from its application to the identication of transcription factor targets [34, 35], ChIP-<br />

Seq is most prominently applied for genome-wide screening of histone modications [36]. Using<br />

5-methylcytosine-specic antibodies, immunoprecipitation is further applicable for generating<br />

genome-wide maps of DNA methylation (MeDIP-Seq) [37].<br />

RNA analogues to the ChIP-Seq strategy, high-throughput sequencing of RNA isolated<br />

by crosslinking immunoprecipitation (HITS-CLIP) [38] <strong>and</strong> photoactivatable-ribonucleosideenhanced<br />

crosslinking <strong>and</strong> immunoprecipitation (PAR-CLIP) [59] are used in a similar<br />

fashion to investigate in-vivo binding of RNA associates like e. g. the miRNA binding protein<br />

ARGONAUTE.<br />

1.4 Properties of Sequencing <strong>Data</strong><br />

Modications to both the instruments themselves as well as to the sequencing chemistry utilized<br />

have led to vast improvements in per-run output since the rst iteration of next generation sequencers,<br />

which has frequently drawn comparisons to the famous Moore's Law of the development<br />

of integrated circuits [60, 61].<br />

A further reduction of overall sequencing cost remains a primary objective in the further<br />

development of the technologies, often illustrated by the catch phrase 1000 dollar [human]<br />

genome [13, 62]. Eorts directed at this goal include the ongoing pursuit of new sequencing<br />

technologies as well as streamlining key operational aspects such as ease of <strong>lib</strong>rary preparation,<br />

run time <strong>and</strong> quantization of throughput i. e. the minimal amount of sequence that must<br />

be generated per sequencing run to achieve acceptable cost-per-base. However, with dropping


1.4. PROPERTIES OF SEQUENCING DATA 7<br />

cost-per-base further dierentiated considerations including data h<strong>and</strong>ling <strong>and</strong> analysis cost as<br />

well as the unique capabilities of either respective technology are increasingly moving into focus.<br />

Dierences between the various technologies also imply dierent behavior regarding important<br />

parameters such as read length, frequency, type <strong>and</strong> r<strong>and</strong>omness of sequencing errors or sequence<br />

dependent depth-of-coverage biases. With the targeted eld of application, the impact of either<br />

property varies. Table 1.1 provides a comparison of read length, throughput <strong>and</strong> run time of<br />

sequencing instruments. Despite specications of the young technologies being in constant ux,<br />

the overview highlights the technology's strengths <strong>and</strong> weaknesses discussed in the following.<br />

Reversible chain terminator as well as sequencing-by-ligation approaches enforce in-phase<br />

measurement of all DNA templates being sequenced. Therefore, these devices provide a xed,<br />

user selectable read length. Albeit currently providing the highest level of parallelism <strong>and</strong> of<br />

per-run throughput in terms of sequenced bases, they oer a signicantly shorter maximal read<br />

length compared to the other methods discussed.<br />

Sequencing-by-synthesis methods that as in the case of 454 <strong>and</strong> Ion Torrent instruments<br />

imply measurement of the homopolymer sequence of the respective DNA templates generate sequence<br />

reads of variable length determined by the actual nucleotide composition of the molecules.<br />

Whereas minimum read length equals the number of sequencing cycles, i. e. one quarter of the<br />

number of dNTP ows, the average read length thus depends on the type of DNA being sequenced.<br />

By contrast, no phasing of str<strong>and</strong> elongation is enforced by the single molecule real time<br />

sequencing approach, neither on single base nor on the homopolymer level. The method instead<br />

attempts near-continuous tracking of the str<strong>and</strong> elongation process. SMRT devices produce<br />

variable length reads, with the read length largely dependent on the binding anity of the<br />

polymerase enzyme utilized by the technology. The method currently achieves the longest average<br />

read length, with however a broad spectrum of read lengths <strong>and</strong> below-average level of parallelism.<br />

The dideoxynucleotide chain-termination sequencing method has undergone a considerable<br />

time span of continued renement. Therefore, although next-generation instruments are improving<br />

steadily with more <strong>and</strong> more sophisticated chemistry <strong>and</strong> signal processing, Sanger sequencing<br />

technology still denes the gold st<strong>and</strong>ard in terms of single read accuracy. Apart from stochastic<br />

measurement uctuations, all present technologies feature an error prole with an increase in<br />

error frequency towards the end of the read. Typical sources of systematic decline in sequence<br />

quality include deterioration of the sequencing reagents over time or loss of phasing for methods<br />

relying on clonally amplied DNA. Orthogonally, overall read quality is inuenced by factors like<br />

cross-talk between adjacent spots, mixed clusters or optical distortion. [63]<br />

Due to phased probing of the template DNA, with the Illumina <strong>and</strong> SOLiD instruments<br />

measurement error mostly induces base substitution errors. Conversely, 454 <strong>and</strong> Ion Torrent<br />

instruments are primarily susceptible to over- or underestimation of homopolymer lengths, which<br />

results in an elevated frequency of base insertion <strong>and</strong> deletion errors. SMRT suers from an<br />

elevated frequency of nucleotide deletion errors.<br />

All platforms including Sanger sequencing are furthermore subject to sequence specic biases,<br />

where both the probability of sampling specic sequences as well as single base accuracy can be<br />

strongly inuenced by the local sequence context. Besides the implications for sequencing data<br />

analysis <strong>and</strong> interpretation, uneven error rates <strong>and</strong> sampling probabilities result in an increase<br />

of required average sequence coverage <strong>and</strong> thus elevated cost-per-base.<br />

Sampling bias may be introduced not only during the actual steps of the sequencing procedure,<br />

but also the weakly st<strong>and</strong>ardized sequencing <strong>lib</strong>rary preparation, with e. g. PCR amplication an<br />

important <strong>and</strong> frequently cited source of bias [64]. Sequencing bias is classied either according<br />

to the introducing reaction or step of the <strong>lib</strong>rary preparation, such as ligation or fragmentation<br />

bias, or according to the sequence properties it is attributed to, such as base composition or end


8 CHAPTER 1. INTRODUCTION<br />

biases.<br />

Since quality <strong>and</strong> strength of these biases are dependent on an only vaguely understood<br />

interplay of variable factors like fragmentation size <strong>and</strong> exact PCR conditions, they are not<br />

easily corrected for <strong>and</strong> must be considered during data analysis <strong>and</strong> interpretation of results.<br />

1.5 Sequencing <strong>Data</strong> <strong>An</strong>alysis<br />

While data analysis strategies for dierent applications of high-throughput sequencing naturally<br />

diverge at some point following the initial base calling of the read sequences, they frequently share<br />

common data processing tasks or generic approaches such as reference genome-based resequencing.<br />

In addition to software solutions oered by device vendors or other commercial providers,<br />

countless computational utilities are developed by the scientic community implementing specic<br />

analysis strategies or prerequisites of generic analysis approaches such as read mapping.<br />

Following the base calling step in which nucleotide sequences are deduced from signal intensities<br />

delivered by the sequencing device, raw sequence data undergo quality control <strong>and</strong> further<br />

processing according to parameters dictated by the respective application <strong>and</strong> analysis strategy.<br />

Subsequent strategies diverge into dierent classes such as reference sequence-guided resequencing<br />

approaches, de-novo sequence analysis such as genome assembly as well as reference-free<br />

multi-sample comparison for diverse purposes such as e. g. assessment of variation or microRNA<br />

content. Primarily motivated by the relatively short read lengths oered by current sequencing<br />

technology, reference sequence-guided resequencing remains the most widely used <strong>and</strong> practical<br />

method (section 1.5.3).<br />

1.5.1 Base Calling <strong>and</strong> Read Quality Assessment<br />

Base calling is the initial step of deducing an actual nucleotide sequence from the raw signal<br />

measurements delivered by sequencing machines, e. g. the brightness levels measured by the<br />

CCD detector for uorescence-based techniques. Parallel sequencers operate by sampling the<br />

entire measurement area supporting large numbers of DNA molecules (e. g. ow cell) at certain<br />

intervals (section 1.2). Instrument measurements initially are stored as a series of images, where<br />

each image represents the signal intensities across the measurement area at a single sampling<br />

interval. The stack of images generated is subsequently processed in software.<br />

In short, base calling software is required to locate spots within the measurement area that<br />

represent valid DNA templates, <strong>and</strong> further analyze the run of signal intensities at each spot<br />

to provide a nucleotide sequence that likely explains the observed intensities. In addition to<br />

this maximum likelihood estimate of a sequencing read's nucleotide sequence, a measure of<br />

reliability for the issued base calls is desirable. Due to their conceptual simplicity, PHREDlike<br />

per-base quality scores [65, 66] have become the de facto st<strong>and</strong>ard across all platforms.<br />

Additional measures that capture the peculiarities of the respective sequencing process, like e. g.<br />

individual base likelihood in the case of reversible chain terminator sequencing or homopolymer<br />

accuracy in the case of pyrosequencing, can be calculated by specic base callers but are widely<br />

disregarded by downstream analysis software.<br />

Base calling software is usually highly device specic <strong>and</strong> thus provided by the instrument<br />

vendors; alternative software solutions exist in some cases [6771], but have not become widely<br />

adopted.<br />

Quality of high-throughput sequencing reads can be highly variable <strong>and</strong> is dependent on many<br />

dierent factors (section 1.4). Therefore, prior to further analysis it is often desirable to assess<br />

the quality of the generated sequencing reads, <strong>and</strong> to pre-lter the raw read data accordingly.<br />

Furthermore, depending on the sequencing protocol <strong>and</strong> application, the relevant sequences are


1.5. SEQUENCING DATA ANALYSIS 9<br />

often fused to utility oligomers like bar codes, sequencing adapters or linker sequences that are<br />

still present in the sequencer's output <strong>and</strong> interfere with direct application of the appropriate<br />

analysis algorithm. Countless tools <strong>and</strong> scripts have been developed for these purposes <strong>and</strong> are<br />

in some cases made freely available.<br />

FastQC [72] for example is a simple Java application providing visualization for average read<br />

quality <strong>and</strong> various other properties of the data set. The FASTX Toolkit [73] provides tools for<br />

creating sequence quality charts, adapter removal <strong>and</strong> demultiplexing operations. TagDust [74]<br />

is a tool that identies <strong>lib</strong>rary artifacts such as sequencing primer dimers. Similar capabilities<br />

are oered by the tool TagCleaner [75], with additional read clipping, trimming <strong>and</strong> splitting<br />

options. A collection of Perl scripts for read ltering, quality-based trimming <strong>and</strong> creating read<br />

quality charts is provided by the NGS QC Toolkit [76].<br />

1.5.2 Assembly<br />

While the classical domain of large-scale DNA sequencing is the assembly of entire genomes,<br />

due to their short read length most high-throughput sequencing technologies do not produce<br />

data optimally suited to this approach. Computational tools based on alternative assembly<br />

algorithms <strong>and</strong> data structures such as De Bruijn graphs have been developed that can better<br />

cope with large amounts of short reads, for example the Velvet [77], Oases [78], SOAPdenovo [79],<br />

ALLPATHS [80, 81] <strong>and</strong> ALLPATHS-LG [82, 83] assemblers. Despite these improvements, stock<br />

short read assemblies still do not reach the quality of typical laboriously generated reference<br />

genomes [84]. Generating high quality reference assemblies continues to require enormous eort<br />

often involving construction of BAC <strong>lib</strong>raries <strong>and</strong> large amounts of man-hours both on the wet<br />

lab <strong>and</strong> data analysis sides for closing gaps in the assembly, ordering the assembled scaolds or<br />

identication of mis-assembled regions.<br />

These remaining issues prevent large-scale multiple whole-genome comparison approaches,<br />

<strong>and</strong> as a consequence, solutions to data representation <strong>and</strong> methodology for this type of analysis<br />

are not yet well established. In the medium term, sequencing technology can be expected to<br />

reach a sucient mixture of sequencing cost, throughput, read length <strong>and</strong> accuracy to render<br />

whole-genome assembly a rather routine task. For genome-wide variation analysis, this should<br />

bring about a shift towards such large-scale assembly approaches.<br />

1.5.3 Short Read Alignment<br />

While direct genome assembly remains impractical, most applications of next-generation sequencing<br />

are analyzed through resequencing approaches that rely on the availability of a suitable<br />

reference genome sequence. Relating all data to a single assembled genome circumvents the issue<br />

of assembly, facilitates comparisons by establishing a common coordinate system <strong>and</strong> allows to<br />

examine analysis results in the context of a fully rened genome annotation.<br />

Mapping millions of short reads to reference genomes up to multiple gigabases in size constitutes<br />

one of the computationally most dem<strong>and</strong>ing tasks of the resequencing strategy. In the<br />

presence of sequencing errors, erroneous reference sequences <strong>and</strong> possibly genomic variation such<br />

as SNPs, indels <strong>and</strong> rearrangements, perfect match queries between read <strong>and</strong> reference sequence<br />

cannot in general deliver satisfactory results.<br />

The read alignment problem presents a string matching task that has quickly been met<br />

by a variety of proposed algorithms <strong>and</strong> software solutions, confronting the scientic community<br />

with a multitude of freely available alignment tool options like BWA [85], Bowtie [86],<br />

Bowtie2 [87], SOAP [88], SOAP2 [89], SHRiMP [90], SHRiMP2 [91], MAQ [92], GenomeMapper<br />

[93], PALMapper [94] or CloudBurst [95].


10 CHAPTER 1. INTRODUCTION<br />

To achieve satisfactory performance, alignment tools rely upon precomputed, static index<br />

data structures usually generated for the reference genome, ranging from simple k-mer hashes<br />

or spaced-seed strategies to sux trees, sux arrays or equivalent compressed full text indexes<br />

such as the Burrows-Wheeler transform (BWT) based FM-index [96]. The genome indexing<br />

approach has implications regarding both performance <strong>and</strong> hardware requirements of an alignment<br />

algorithm. Disk space <strong>and</strong> index memory required for k-mer indexes, sux trees <strong>and</strong> plain<br />

sux arrays amount to a multiple of the size of the indexed nucleotide sequence. In contrast,<br />

compressed BWT-based indexes have the advantage of an inherent representation of the genome<br />

sequence, with their size approaching that of the compressed sequence.<br />

Dierent index data structures aside, read mapping algorithms are set apart by their choice of<br />

trade-os between mapping speed, accuracy, sensitivity <strong>and</strong> supported alignment parameters. At<br />

the same time, mapping accuracy is partially inuenced by mapping sensitivity, since a missed<br />

mapping location for a read may lead to spurious mappings to a dierent location of similar<br />

sequence. Read mapping algorithms commence by identication of potential origins of seeds,<br />

short substrings of the sequence read matching the reference genome sequence at high sequence<br />

identity. The tradeo between mapping speed on the one side <strong>and</strong> accuracy <strong>and</strong> sensitivity on the<br />

other is greatly inuenced by seed selection, seed length as well as the required level of sequence<br />

identity. Tools may decide to ignore seeds that occur with very high frequency in the reference<br />

genome, further accelerating the mapping process at the cost of sensitivity.<br />

Following seed placement, follow-up alignment algorithms establish the actual base pairings<br />

between sequencing read <strong>and</strong> reference sequence, in the process identifying the best possible<br />

match for the entire read according to the scoring measure dened by the mapping tool. Furthermore,<br />

many mapping utilities in addition allow the explicit or implicit retrieval of the next-best<br />

mapping location, permitting to infer the reliability of the assigned location (mapping quality)<br />

under the assumption of completeness <strong>and</strong> correctness of the provided reference genome<br />

sequence.<br />

The set of features supported by mapping <strong>and</strong> alignment algorithms decides over their suitability<br />

for dierent kinds of sequencing data or application. However, omission of certain features<br />

may help with optimization of the algorithm. Gapped alignments for example come with a significant<br />

performance hit, but are indispensable for application in variation detection or for aligning<br />

sequencing reads obtained by pyrosequencing, Ion Torrent or SMRT technology. Further gaprelated<br />

features such as ane gap costs or split-read approaches permit correct placement of the<br />

read in presence of longer insertions <strong>and</strong> deletions or intron-exon boundaries.<br />

Sequencing reads with special properties such as color space or bisulte converted data furthermore<br />

require fully customized mapping <strong>and</strong> alignment strategies.<br />

1.5.4 Variation <strong>and</strong> Genotype Calling<br />

Reversing the assembly <strong>and</strong> multi-genome comparison approach (section 1.5.2), resequencing<br />

has become the current method of choice for assessing genomic variation. In the resequencing<br />

setting, all read data for certain subspecies, strains or mutants are mapped to a single reference<br />

sequence of a closely related organism. The complexity of the analysis task is therefore shifted<br />

from establishing <strong>and</strong> examining complex relations between multiple assembled genomes towards<br />

obtaining a comprehensive <strong>and</strong> reliable readout of the structural information contained in the<br />

short read alignments.<br />

Basic consensus calling is limited to the detection of conserved regions <strong>and</strong> small scale variation<br />

like single nucleotide polymorphisms <strong>and</strong> small insertions <strong>and</strong> deletions contained therein.<br />

Such regions of evident sequence conservation are directly suggested by the short read mappings.<br />

The task for analysis algorithms is therefore to assess each of the supported positions to identify


1.5. SEQUENCING DATA ANALYSIS 11<br />

the most likely combination of alleles to give rise to the observed data, along with a measure of<br />

condence for the provided solution.<br />

Identication of this most likely combination of alleles at each reference genome position is<br />

therefore the main problem to be solved by analysis algorithms. A major issue to be overcome<br />

by these algorithms is to achieve an integration of models for sequencing errors, read mapping<br />

<strong>and</strong> base alignment errors. While the probability of sequencing errors may be systematically<br />

inuenced by the local sequence context, individual error events in dierent reads may be considered<br />

as largely independent. Sources of mapping <strong>and</strong> base alignment error are however mostly<br />

systematic in nature. Reads <strong>and</strong> aligned bases therefore do not represent independent evidence<br />

for a certain allele conguration, preventing these types of error from being as readily captured<br />

by statistical modeling.<br />

The majority of consensus <strong>and</strong> genotype calling algorithms commence by examining the<br />

pileup of base calls <strong>and</strong> base qualities provided by the mapped reads overlapping a position.<br />

Given sucient depth of coverage, eventual sequencing errors can be thereby identied at high<br />

condence.<br />

However, sequencing error may not only cause invalid readout of single positions, but also<br />

spurious cross-mappings to loci with similar sequence to that of the true origin of a read. The<br />

likelihood of such spurious mappings may be assessed using measures like the mapping quality<br />

provided by some alignment tools (section 1.5.3). Furthermore, heuristic parts of the mapping<br />

algorithms used to achieve improved mapping performance may also cause systematic crossmappings<br />

that can be indistinguishable from true variants or give rise to invalid multi-allelic<br />

calls. Such pseudo multi-allelic regions may further arise due to copy number variants, underrepresentation<br />

of repetitive sequence in the reference assembly or sample contamination.<br />

<strong>An</strong>other source of systematic miscalls presents base alignment error. Base alignment error<br />

is introduced by reads, albeit mapped to the locus on the reference genome from which they<br />

originate, whose alignment provided by the read mapping tool does not suggest the correct set<br />

of base pairings. This type of error mainly occurs at transition points to insertions, deletions,<br />

copy number variants or other types of rearrangement or at exon-intron boundaries in mRNA<br />

sequencing. Reads overlapping these transition points only with a small number of positions do<br />

not contain enough information to support the correct type of variant, causing spurious alignment<br />

of terminal nucleotides.<br />

For these reasons, current consensus <strong>and</strong> genotype call algorithms employ combinations of<br />

statistical models as well as alignment correction algorithms <strong>and</strong> heuristics for removing or<br />

penalizing read alignment issues.<br />

The Genome <strong>An</strong>alysis Toolkit (GATK) [97] combines Bayesian heterozygous genotype calling<br />

based on quality score reca<strong>lib</strong>ration with indel realignment <strong>and</strong> various heuristics for penalizing<br />

mapping artifacts [98].<br />

SHORE [18] implements a exibly ca<strong>lib</strong>ratable scoring matrix approach [1, 99] that calculates<br />

SNP, indel <strong>and</strong> reference call scores based on a user-denable weighting of various properties of<br />

the read alignments overlapping a position. The method allows to impose a penalty on, or to<br />

ignore a number of terminal nucleotides towards the ends of each alignment, ruling out many<br />

cases of base alignment error at the cost of an eective loss of sequencing depth.<br />

The SAMtools package [100] implements a prole HMM for estimation of a Base Alignment<br />

Quality (BAQ) for the assessment of potential mapping artifacts [101].<br />

1.5.5 Quantication by Deep Sequencing<br />

In principle, deep sequencing constitutes statistical sampling of the population of molecules that<br />

make up the sequencing <strong>lib</strong>rary. Therefore, obtained sequencing data may be transformed into


12 CHAPTER 1. INTRODUCTION<br />

quantities that represent a frequency distribution that may be used to infer relative quantities<br />

with reference to this population.<br />

Depending on the application, these measurements may be processed further for mere classication<br />

purposes (e. g. enriched versus not enriched in chromatin immunoprecipitation (ChIP)<br />

assays) or transformed into quantities appropriate for multi-sample comparison (e. g. normalized<br />

expression level in RNA sequencing).<br />

For routinely used, but weakly dened concepts such as gene expression, denition of an appropriate<br />

reference for obtaining biologically meaningful quantities can be intricate. Depending<br />

on the application the appropriate unit of measurement might be absolute, i. e. the number of<br />

copies of a molecule in the cell or in the sample, a cellular or compartmental concentration, or<br />

relative to the total population of molecules of the same family. To a certain degree, relative measurements<br />

can be directly obtained from sequencing data, whereas calculation of absolute values<br />

is in general not possible. For identication of dierentially expressed loci in a multi-sample<br />

comparison, it may however be sucient to represent all measurements in relation to a common<br />

reference value. Given many loci contribute to the sequencing <strong>lib</strong>rary with a majority following<br />

the same distribution in all conditions examined, the problem of calculation of a common<br />

reference value can be reduced to identication of outliers in a multi-sample comparison.<br />

Next to basic compensation for sequencing <strong>lib</strong>rary size, further specialized normalization procedures<br />

are therefore often applied to the raw quantities of sequenced reads. Some of these procedures<br />

attempt inference of quantities that are better suited to the comparison of the biological<br />

conditions in question, such as estimation of absolute quantities. Further aims of normalization<br />

include stabilization of measurement variance or removal of sequence specic biases. As<br />

an example, for whole transcriptome sequencing a normalizing procedure termed trimmed mean<br />

of M-values (TMM) [102] has been shown to compare favorably with plain relative quantication.<br />

For data with dierent properties such as microRNA proles, consensus regarding data<br />

normalization is yet to be reached [103].<br />

1.5.6 ChIP-Seq <strong>An</strong>alysis<br />

While chromatin IP (section 1.3.2) enriches for sequence fragments that contain a binding site for<br />

the protein of interest, no perfect purication is achieved. Obtained sequencing data therefore<br />

tend to contain signicant amounts of background noise in the form of reads that represent<br />

DNA fragments that were sequenced purely by chance rather than due to their anity to the<br />

DNA-binding protein. Primary data analysis involves the identication of those genomic regions<br />

that potentially contain one or more binding sites, recognizable from the mapped read data by<br />

characteristic peaks in sequencing depth at the respective sites <strong>and</strong> their close vicinity.<br />

Single transcription factors are often thought to contribute to the regulation of hundreds<br />

or thous<strong>and</strong>s of dierent genes, <strong>and</strong> to take on dierent roles in various regulatory networks.<br />

This creates the need for peak calling algorithms that are able to process the ChIP-Seq data<br />

to identify all putative binding sites in a genome-wide scan. Multiple freely available computational<br />

tools including MACS [104], QuEST [105], SISSRs [106], PeakSeq [107], cisgenome [108] or<br />

FindPeaks [109] all implement dierent algorithmic approaches to the peak calling problem.<br />

MACS employs estimation of local depth of coverage background using a sliding window. By<br />

this background estimate, p-values for enrichment are calculated based on the Poisson distribution.<br />

The program is able to operate with or without control <strong>lib</strong>raries <strong>and</strong> calculates false<br />

discovery rates through a sample-control swap.<br />

SISSRs calculates net tag count as the dierence of forward <strong>and</strong> reverse str<strong>and</strong> read mappings<br />

in short jumping windows. Base line transitions of net tag count are retained as sites of putative<br />

enrichment, which are further compared to Poisson-motivated read support thresholds.


1.6. STORAGE AND REPRESENTATION OF SEQUENCING DATA 13<br />

The PeakSeq program identies c<strong>and</strong>idate enriched regions through segment-wise calculation<br />

of a depth of coverage threshold obtained by simulation of r<strong>and</strong>om read distribution, taking into<br />

account a pre-computed mappability map to capture repetitive parts of the genome. Thus<br />

obtained c<strong>and</strong>idates are assessed further using a binomial test comparing sample <strong>and</strong> scaled<br />

control data.<br />

Particular to the QuEST algorithm, peak detection relies on score proles generated by kernel<br />

density estimation over the reads' 5 ′ mapping coordinates rather than depth of coverage. Peak<br />

calling is achieved by identication of local score maximums <strong>and</strong> a combination of score <strong>and</strong> fold<br />

change thresholds.<br />

While a multitude of dierent analysis algorithms thus exist, due to the diculty of assessing<br />

the quality of the provided solution there is generally no consensus which of the available<br />

algorithms performs best in a given scenario.<br />

1.6 Storage <strong>and</strong> Representation of Sequencing <strong>Data</strong><br />

High <strong>and</strong> constantly improving throughput of sequencing instruments implies routine production<br />

of uncommonly large data sets. The amount of data produced poses challenges not only regarding<br />

data transfer, processing <strong>and</strong> analysis, but also in terms of data storage. In the desire to guarantee<br />

best possible traceability of scientic results <strong>and</strong> all potential sources of error, the benet of longterm<br />

storage of all raw data must be balanced with storage cost, further considering the likelihood<br />

to again require certain types of data <strong>and</strong> the cost of eventually regenerating experimental data.<br />

Under these considerations it has become accepted practice to require long-term storage of at<br />

least nucleotide sequences <strong>and</strong> PHRED qualities of sequencing reads, whereas the preservation<br />

of raw signal measurements or further quality measures (section 1.5.1) is deemed optional.<br />

Generally, large storage size of the data sets emphasizes the importance of the tradeo between<br />

space <strong>and</strong> access eciency. While o-line long-term storage solutions are able to mainly address<br />

space eciency, with the only restriction being that the computational eort required to convert<br />

to <strong>and</strong> from the compressed representation of the data should remain in tolerable bounds, on-line<br />

representation of the data during the data analysis phase has to provide additional guarantees<br />

regarding data access.<br />

1.6.1 Storage Formats for Next-Generation Sequencing <strong>Data</strong><br />

High-throughput sequencing data sets store elements like sequencing reads or read alignments,<br />

which dene a tuple of attributes, such as a read identier, nucleotide sequence <strong>and</strong> per-base<br />

qualities. Additional meta-data items like sequencing instrument, run or processing information<br />

may be required to be kept alongside the data set. On-line sequencing data representation has<br />

to at least guarantee ecient sequential traversal of a data set's elements, i. e. enable ecient<br />

retrieval of all of the respective next element's attributes. Furthermore, it is typically required<br />

to allow storage of the data set elements in a specic order not dictated by space eciency<br />

considerations. Finally, accelerated queries for data set elements with specic properties, such<br />

as range queries for retrieval of elements overlapping with a given region of the genome, require<br />

a certain degree of r<strong>and</strong>om access support from the storage solution.<br />

While generic solutions such as relational database management systems (RDBMS) address<br />

many of these requirements, they come with an excess of further features while impeding data<br />

set distribution. Therefore, most sequencing data analysis solutions rely on custom le-based<br />

storage solutions. File formats in the bioinformatics eld have traditionally been ASCII text<br />

les following simple formatting rules. Examples include the de facto st<strong>and</strong>ard formats for representation<br />

of biological sequences, FASTA <strong>and</strong> FASTQ [110]. For description of large amounts


14 CHAPTER 1. INTRODUCTION<br />

of features that are relative to a reference sequence, simple tab-delimited tables have emerged as<br />

the predominant storage formats. For example, SAM [100, 111] is a st<strong>and</strong>ard le format used for<br />

storing sequencing reads <strong>and</strong> read alignments. GFF [112] is a generic format for description of<br />

genomic features that is primarily used, but not restricted to, representation of genome annotation<br />

data, <strong>and</strong> its subset specication GVF [113] is targeted at description of genomic variation<br />

like SNP or structural variants. VCF [114] provides a specication for storage of multi-sample<br />

variation <strong>and</strong> genotype information. SAM, GFF <strong>and</strong> VCF are structured similarly, with each data<br />

set element being represented on a single line. Common or m<strong>and</strong>atory element attributes like<br />

reference genome coordinates are stored as m<strong>and</strong>atory columns, whereas optional attributes are<br />

added in arbitrary order in the form of key-value pairs (tags). All three formats are generic formats<br />

that allow users to specify arbitrary additional attributes, <strong>and</strong> allow inclusion of meta-data<br />

in the le in the form of lines starting with a certain character sequence serving as meta-data<br />

indicator. The SAM format comes with an equivalent binary companion format BAM with the<br />

aim of reduced le parsing overhead.<br />

While processed data like variant calls are usually moderate in size, space eciency is a<br />

concern for raw read <strong>and</strong> read alignment data, suggesting the application of data compression<br />

algorithms. The generic compression algorithm DEFLATE [115] is one of the most widely used<br />

compression methods due to a tradeo between compression eciency <strong>and</strong> compression <strong>and</strong><br />

decompression speed that is acceptable in many settings. Generic compression algorithms are<br />

also applied to sequencing data, e. g. the BAM format is routinely encoded as BGZF [111], a<br />

simple subset specication of the GZIP [116] format for DEFLATE-compressed data.<br />

However, compression eciency can be improved through use of sequencing data specic encoding<br />

methods or preprocessing. To some extent, high-throughput sequencing data compression<br />

is related to the issue of DNA sequence compression. Complete genome sequences are known<br />

to be hardly compressible beyond the typical 2-bit encoding of the four bases using common<br />

generic compression algorithms [117]. Most eorts preceding the emergence of high-throughput<br />

sequencing have thus been directed at generating ecient representations of such large biological<br />

sequences. However, algorithms that have been made available for this purpose are not readily<br />

applicable to short read sequencing data.<br />

Common approaches used by most DNA compression algorithms in the literature rely on<br />

sux arrays or similar index data structures for detecting <strong>and</strong> collapsing perfectly or imperfectly<br />

repetitive regions in a small set of large sequences. As such, these implementations are not suited<br />

to the task of compressing the huge sets of very short sequences generated by a high-throughput<br />

sequencing run. <strong>An</strong>alogous approaches are however available for short read sequences. Imposing<br />

a specic ordering on the elements of a data set potentially leads to an increase in entropy, <strong>and</strong><br />

thereby reduced compression eciency. A utility SCALCE implements a read reordering method<br />

using a similarity-based clustering approach that serves as a compression booster in combination<br />

with generic compression algorithms <strong>and</strong> formats [118]. However, as specic element order is<br />

often a requirement of analysis algorithms, SCALCE <strong>and</strong> similar reordering approaches are most<br />

suitable for o-line long-term storage <strong>and</strong> data distribution. A dierent approach is taken by<br />

the tool Quip, which implements an assembly-based short read compression scheme where reads<br />

are represented by their position on specically assembled contigs [119]. Assembled contigs must<br />

be stored along with the data set elements, but allow ecient compression of data sequenced to<br />

multiple depth.<br />

While these approaches focus on compression of short nucleotide sequences, those represent<br />

just one of the attributes to be stored in next-generation sequencing data sets, along with read<br />

identiers, per-base PHRED qualities <strong>and</strong> possibly read alignments. Various methods therefore<br />

attempt to improve compression ratio for these heterogeneous collections of data. Notably, the<br />

CRAM format [120] aims at being a drop-in replacement for the SAM/BAM format, to large extent


1.7. CONTRIBUTIONS OF THIS WORK 15<br />

supporting the same logical structure <strong>and</strong> set of features. The main features of CRAM are reduced<br />

representations of aligned read sequences <strong>and</strong> per-base qualities. For aligned sequences, the<br />

format implements reference compression. In reference compressed alignments only the lengths<br />

of read subsequences that perfectly match the reference genome are stored, or an equivalent<br />

representation. While in principle a lossy operation, all discarded information may subsequently<br />

be recovered as long as the appropriate reference sequence is available. For PHRED quality<br />

data, CRAM oers a choice of various lossy binning schemes aimed at reduction of the entropy<br />

introduced with the read qualities. Following such preprocessing operations, CRAM encodes data<br />

into a custom binary format utilizing well-known encoding schemes such as Golomb [121] <strong>and</strong><br />

Human coding [122].<br />

1.6.2 Accelerated Queries on Sequencing <strong>Data</strong><br />

Support for accelerated queries on a data set is required to enable rapid data visualization<br />

[123, 124] <strong>and</strong> data set exploration. Indexed range queries are e. g. supported by the<br />

BigWig <strong>and</strong> BigBed formats [123], the BAM format or the utility Tabix [125] using R-tree [126]<br />

based indexing [124].<br />

While stock generic compression algorithms permit streaming encoding <strong>and</strong> decoding <strong>and</strong> are<br />

thus compatible with the goal of sequential, ordered traversal of data sets, they do not comply<br />

with r<strong>and</strong>om access requirements. This disadvantage is commonly mitigated by enabling seeking<br />

to dened positions in the stream. These decoding osets are realized by periodically either<br />

saving the complete state of the encoder, or by causing an encoder reset. In both cases, the<br />

cost is an eective decrease in compression eciency, either because of the overhead of saving<br />

the series of encoder states along with the compressed data, or because the encoder needs to be<br />

retrained following each reset. Additionally, an index data structure must be integrated to allow<br />

conversion between compressed <strong>and</strong> uncompressed stream osets. As a result, there is a tradeo<br />

between compression ratio <strong>and</strong> r<strong>and</strong>om access granularity, <strong>and</strong> thus eciency.<br />

1.7 Contributions of this Work<br />

The software SHORE is a processing <strong>and</strong> analysis pipeline for high-throughput DNA sequencing<br />

data [1, 18]. The pipeline is targeted at a reference sequence-based resequencing data analysis<br />

approach, combining raw read manipulation <strong>and</strong> ltering, read mapping <strong>and</strong> several analysis<br />

modules specic to dierent sequencing applications.<br />

With this work, large parts of our pipeline were reworked for coping with increasingly large<br />

amounts of data, optimization of workows <strong>and</strong> extended analysis requirements. In the process,<br />

SHORE was embedded into a generic programming framework to provide universal support of<br />

introduced features across all processing modules (section 3.1). Increasing storage sizes of data<br />

sets produced by high-throughput sequencing devices have been addressed by implementation of a<br />

variety of data compression options. At the same time, increased processing times were countered<br />

by extended parallel processing facilities, <strong>and</strong> specically distributed memory parallelization<br />

of the read mapping process. Furthermore, with greater data set size, eciency of retrieval<br />

of specic data set elements or subsets has gained importance for data set exploration <strong>and</strong><br />

visualization. To meet these requirements, appropriate data indexing <strong>and</strong> query algorithms were<br />

incorporated. Exploiting the data set indexing facilities, within several analysis modules we<br />

address basic visualization of sequencing read alignment <strong>and</strong> depth of coverage distribution to<br />

mitigate the need for time-consuming data set copying <strong>and</strong> conversion for use with external<br />

visualization solutions.


16 CHAPTER 1. INTRODUCTION<br />

Our application programming framework <strong>lib</strong>shore was implemented as a common foundation<br />

to all SHORE processing modules. In addition to elementary functionality such as automatic<br />

input <strong>and</strong> output decoding <strong>and</strong> encoding, algorithms for indexing sequencing-related data <strong>and</strong><br />

alignment of biological sequences are provided. Furthermore, we developed a generic processing<br />

framework with the aim of facilitating breakdown <strong>and</strong> implementation of rather complex<br />

sequencing data analysis algorithms as collections of manageable <strong>and</strong> self-contained processing<br />

modules. We further extended on this framework for straightforward parallelization of basic<br />

sequencing data processing.<br />

The fundamental data analysis workow introduced with the initial version of SHORE consists<br />

of subsequent application of read ltering, read mapping <strong>and</strong> application-specic analysis<br />

modules. While this basic approach has been maintained, it has been extended in many areas<br />

for simplication of a variety of routine tasks. <strong>An</strong> overview of the modied workow is therefore<br />

given in section 2.1.<br />

Signicant modications to core pipeline modules concern raw sequencing read import <strong>and</strong><br />

ltering as well as read mapping. <strong>Data</strong> import was modied to enable recovery of raw data<br />

or iterative data set re-ltering (section 2.3), <strong>and</strong> complemented with extended functionality to<br />

address e. g. the more <strong>and</strong> more widespread use of a variety of <strong>lib</strong>rary multiplexing <strong>and</strong> further<br />

adapter ligation protocols. For this purpose, exible sequence-specic read partitioning <strong>and</strong><br />

clipping facilities have been added (sections 2.4, 2.5). SHORE's read mapping module initially<br />

constituted a basic parallelization wrapper providing a unied interface to dierent short read<br />

alignment tools. To take advantage of an increasing diversity of read mapping algorithms, features<br />

have been added to enable integration of the results obtained using dierent alignment<br />

algorithms or parameters (section 2.6).<br />

Application-specic data analysis modules initially implemented in SHORE mainly addressed<br />

the assessment of genomic variation. This functionality has largely been maintained, with only<br />

slight modications to take advantage of the <strong>lib</strong>shore infrastructure. A focus of this work however<br />

was on quantitative application of high-throughput sequencing, <strong>and</strong> primarily ChIP-Seq<br />

data analysis. Within the SHORE analysis environment, we have implemented an enrichment<br />

detection module targeted at the analysis of transcription factor immunoprecipitation data (section<br />

2.7). In comparison to other approaches, our method emphasizes robustness towards read<br />

mapping artifacts <strong>and</strong> simplies h<strong>and</strong>ling of replicate experiments. Our ChIP-Seq enrichment<br />

detection module is complemented by congurable auxiliary utilities allowing composition of custom<br />

workows applicable to various expression proling or enrichment detection applications.<br />

In conclusion, with this work the SHORE software has been developed from an analysis pipeline<br />

consisting of a set of largely independent submodules targeted primarily at genomic variation<br />

assessment into a tightly integrated, general-purpose sequencing data analysis suite supporting<br />

a set of coordinated, interdependent features. The SHORE sequencing data analysis suite is open<br />

source software freely available 1 under the GNU General Public License (GPL) version 3.<br />

1 http://shore.sf.net


2 A High-Throughput DNA Sequencing <strong>Data</strong> <strong>An</strong>alysis<br />

<strong>Suite</strong><br />

The constant pace of development of high-throughput sequencing technology <strong>and</strong> protocols<br />

is a force driving continued adaptation <strong>and</strong> improvement of related software<br />

solutions. The present chapter is concerned with the data storage solutions <strong>and</strong> retrieval<br />

algorithms, high-throughput sequencing data analysis algorithms <strong>and</strong> software<br />

utilities that were developed in the course of this work for use with the SHORE DNA<br />

sequencing data processing <strong>and</strong> analysis pipeline.<br />

2.1 Overview<br />

Amounts of data produced by high-throughput sequencing devices have from the start posed<br />

a challenge with respect to data analysis. Since their introduction, countless computational<br />

utilities have been developed to satisfy dierent data processing <strong>and</strong> analysis needs in all areas of<br />

application. Large-scale high-throughput sequencing projects however emphasize the importance<br />

of st<strong>and</strong>ardized work ows <strong>and</strong> data storage. With data analysis extending over multiple phases<br />

as well as a signicant time span <strong>and</strong> overall man-hours distributed among multiple individuals,<br />

coordinated eorts to ensure coherence <strong>and</strong> reproducibility of processing <strong>and</strong> analysis results<br />

become imperative.<br />

A basic level of st<strong>and</strong>ardization is readily achieved by reliance on whole-sale software solutions,<br />

with all required data processing <strong>and</strong> analysis algorithms based on a common framework.<br />

However, apart from commercial oers few freely available high-throughput sequencing applications<br />

provide integration to varying degrees of more than just isolated processing <strong>and</strong><br />

analysis steps, e. g. SOAP, SAMtools [100] or GATK [97].<br />

With the SHORE software, we provide an integrated high-throughput sequencing read processing<br />

<strong>and</strong> analysis framework based on consistent data storage concepts. SHORE oers an<br />

end-to-end pipeline including all relevant steps starting from initial raw read data ltering <strong>and</strong><br />

partitioning <strong>and</strong> proceeding up to primary analysis results. <strong>An</strong> overview of SHORE's general data<br />

analysis work ow is provided in gure 2.1. While in principle consistent with what has been<br />

described [1] (cf. section 1.7), we implement a variety of extensions including optional recovery<br />

of raw data, on-the-y merging, subset selection <strong>and</strong> ltering as well as data visualization, as<br />

discussed in the following.<br />

In a reference sequence-guided resequencing approach, primary data analysis proceeds in three<br />

main stages. First, raw sequencing read data obtained from the base calling software become<br />

subject to quality control <strong>and</strong> demultiplexing algorithms (gure 2.1(a) ). Thus processed reads<br />

are subsequently mapped to the appropriate reference genome sequence (b). Application-specic<br />

algorithms then interpret the read alignment data as the nal step of the primary analysis (d),<br />

such as genotype calling, assessment of structural variation, enrichment detection or dierential<br />

expression analysis. Approaches to follow-up analysis of the generated primary results are<br />

manifold <strong>and</strong> highly divergent, <strong>and</strong> therefore not currently integrated with the pipeline.<br />

19


20 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Base Calling<br />

Unprocessed<br />

Reads<br />

Import,<br />

Filtering <strong>and</strong><br />

Demultiplexing<br />

(a)<br />

Processed<br />

Reads<br />

Recovery<br />

(e)<br />

Visualization <strong>and</strong><br />

Quality Assessment<br />

(g)<br />

Read<br />

Mapping<br />

(b)<br />

Export<br />

(f )<br />

Unprocessed<br />

Reads<br />

Archiving or<br />

Distribution<br />

Read<br />

Alignments<br />

Merge,<br />

Select<br />

<strong>and</strong> Filter<br />

(c)<br />

Export<br />

(h)<br />

Processed<br />

Reads<br />

Visualization<br />

(i)<br />

<strong>An</strong>alysis<br />

(d)<br />

Read<br />

Alignments<br />

Results<br />

External <strong>An</strong>alysis<br />

or Visualization<br />

Figure 2.1: The SHORE Workow<br />

Our analysis pipeline is realized as a comm<strong>and</strong> line application organized as a collection of<br />

submodules. A module SHORE import provides initial processing of raw read data <strong>and</strong> logically<br />

organizes its output in a predened directory hierarchy, which forms the basis for the operation of<br />

further modules (section 2.3). The module provides a comprehensive set of commonly required<br />

ltering operations such as quality based ltering <strong>and</strong> read trimming, removal of sequencing<br />

adapters or demultiplexing for bar-coded samples (section 2.4). To facilitate iterative adjustment<br />

of ltering parameters or distribution of raw read data to third parties, the module further<br />

provides functionality to fully restore its original input data (e).<br />

<strong>An</strong>alysis of SHORE processed data by means of third-party analysis utilities is possible through<br />

format conversion. A module SHORE convert provides read <strong>and</strong> alignment data export to a<br />

variety of commonly supported data exchange formats such as FASTA, FASTQ, SAM, BED or<br />

GFF (f, h). To assess overall run quality, nucleotide composition <strong>and</strong> base quality distributions<br />

may optionally be visualized with a module SHORE tagstats (g, section 2.8).<br />

<strong>Data</strong> analysis methods implemented in SHORE primarily follow a reference sequence-oriented<br />

resequencing approach (section 1.5). The provided module SHORE mapowcell is responsible for<br />

the required mapping of the processed read data to a reference genome sequence. The mapping<br />

module is realized as a high-level comm<strong>and</strong> line interface <strong>and</strong> parallelization front-end to a variety<br />

of widely useful read mapping tools such as GenomeMapper, BWA or Bowtie2 (section 2.6).


2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />

TEXT-BASED FILE FORMATS 21<br />

Read mapping data stored as a result of the SHORE mapowcell module constitute the input<br />

for data analysis modules such as variant calling or enrichment detection (d). Available analysis<br />

modules cover SNP <strong>and</strong> indel detection (SHORE qVar, consensus), ChIP-Seq analysis (SHORE peak,<br />

section 2.7), reference-based small RNA analysis as well as versatile modules SHORE coverage <strong>and</strong><br />

SHORE count applicable to a variety of expression analysis <strong>and</strong> enrichment sequencing approaches.<br />

A further module SHORE mapdisplay as well as SHORE coverage <strong>and</strong> count support data set exploration<br />

through visualization of read alignments or depth of coverage, respectively (i, section 2.8).<br />

Alignment input data to analysis <strong>and</strong> visualization modules can on-the-y be selected by<br />

region of interest, merged with data of e. g. further sequencing runs as well as passed through a<br />

variety of ltering operations such as duplicate sequence removal (c, sections 2.2.3, 2.7.2).<br />

The SHORE data analysis suite is designed for seamless integration into established Unix<br />

comm<strong>and</strong> line workows. Where appropriate, modules support streaming input <strong>and</strong> output<br />

to <strong>and</strong> from other comm<strong>and</strong>s. SHORE's data are stored in simple line based text le formats<br />

utilizing st<strong>and</strong>ards-compliant compression formats, facilitating data manipulation by stock text<br />

processing utilities. To support routine application in a specic scenario, all of SHORE's defaults<br />

may be pre-congured through user-specic conguration les.<br />

2.2 Ecient Storage of High-Throughput Sequencing <strong>Data</strong> Using<br />

Text-Based File Formats<br />

For large data sets as generated by high-throughput sequencing machines, space ecient ondisk<br />

representation is imperative. At the same time, during active analysis phases data must<br />

remain easily retrievable, requiring an appropriate compromise between compression ratio <strong>and</strong><br />

access eciency (section 1.6). This section presents data storage formats, indexing <strong>and</strong> retrieval<br />

mechanisms implemented in the SHORE pipeline.<br />

Typical binary le formats for sequencing data come with both advantages <strong>and</strong> disadvantages<br />

compared to traditional text-based alternatives. SHORE makes use of compressed text le formats<br />

for read alignment data that achieve similar or better compression ratios compared to compressed<br />

binary formats such as CRAM. SHORE's le compression formats all support streaming input<br />

<strong>and</strong> output as well as near-r<strong>and</strong>om read access to decompressed le osets. Users may exibly<br />

choose the desired tradeo between compression ratio, decoding <strong>and</strong> encoding speed <strong>and</strong> r<strong>and</strong>om<br />

access guarantees. Specialized utilities oer binary search like queries <strong>and</strong> range queries on<br />

compressed or uncompressed data sets for many types of text-based le format popular with the<br />

bioinformatics community.<br />

2.2.1 Overview<br />

Driven by the amounts of short read <strong>and</strong> alignment data generated during HTS data analysis,<br />

specialized binary storage formats have been introduced, notably the BAM format [100] <strong>and</strong><br />

the CRAM format [120] both designed for storage of aligned as well as unaligned short read<br />

data. BAM employs binary encoding of values with the aim of reduced le parsing overhead,<br />

i. e. to improve eciency of conversion between the disk <strong>and</strong> the in-memory representations of<br />

data. On the other h<strong>and</strong>, CRAM denes a specialized format with the aim of enabling increased<br />

compression ratios by supporting custom encoding for certain attributes of the data set elements.<br />

In spite of the reasons in favor of specialized binary data encoding, there are also drawbacks<br />

compared to traditional ASCII text-based storage formats. Parsing custom binary formats requires<br />

careful consideration of technical details like machine endianness <strong>and</strong> representation of<br />

oating point numbers. Additionally, in depth knowledge of the relevant data encoding <strong>and</strong><br />

compression algorithms is inevitable. Unless a parsing <strong>lib</strong>rary is provided for the programming


22 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

language of choice, readout of binary formats is therefore often too laborious or beyond basic programming<br />

skills, eectively limiting users of the format to choose from a narrow set of supported<br />

programming languages.<br />

In contrast, basic parsers are straightforward to implement for simple tab-delimited table<br />

formats. Requirements such as string tokenization <strong>and</strong> conversion between character strings <strong>and</strong><br />

numeric types are well supported by high level scripting languages popular in the bioinformatics<br />

eld such as Perl or Python, <strong>and</strong> considered a basic skill in next to any programming language.<br />

The line-based formats are furthermore accessible to shell scripting <strong>and</strong> readily ltered through<br />

widely used Unix text processing utilities such as grep, awk, sed or sort. While conversion<br />

between le <strong>and</strong> in-memory representation is less ecient compared to optimized binary formats,<br />

the introduced overhead is marginal in relation to the overall CPU consumption of data analysis<br />

algorithms.<br />

Furthermore, delimited text tables can serve as a basic framework for le format denitions<br />

such as SAM, GFF or VCF. This common foundation enables implementation of generic algorithms<br />

applicable to a large variety of dierent le formats, as demonstrated for example by the<br />

tool Tabix [125] for range indexing of genomic feature le formats. Adapting similar algorithms to<br />

dierent specialized binary formats is laborious, requiring reimplementation or implementation<br />

of sophisticated adapter mechanisms.<br />

Format issues are easily identiable with text-based le formats without requirement of expert<br />

knowledge or implementation of specialized format verication routines. Software dealing with<br />

data produced by a rapidly evolving technology as is high-throughput sequencing instrumentation<br />

is likely required to undergo constant adaptation. Besides, at high rates of data production<br />

storage systems may operate close to the limits of their capacity. In such potentially unstable<br />

settings, simplicity of le formats constitutes a signicant advantage.<br />

On grounds of these considerations, the SHORE analysis suite relies on tab-delimited table<br />

formats for storing reads, read mapping data <strong>and</strong> analysis results. Users may further choose to<br />

store les either directly as plain text, or between one of two widely used compression formats,<br />

GZIP [116] <strong>and</strong> XZ [127]. While such generic compression le formats themselves represent<br />

binary formats, decoding comm<strong>and</strong>s <strong>and</strong> routines are widely available across platforms <strong>and</strong><br />

programming languages <strong>and</strong> are employed merely as a ltering layer, thus retaining most of the<br />

advantages of a text-based format. With compression, SHORE implements block-wise encoding<br />

for both the GZIP <strong>and</strong> the XZ format to enable near-r<strong>and</strong>om read access at any decompressed<br />

le oset. Our implementation enables free adjustment of r<strong>and</strong>om access <strong>and</strong> space eciency<br />

trade-os (section 2.2.2).<br />

SHORE's text-based read alignment le format was extended to support reference compression<br />

as introduced by the binary CRAM format (section 1.6), as well as further simple preprocessing<br />

lters serving to boost compressibility in combination with generic compression algorithms.<br />

Depending on the choice of preprocessing lters, compression algorithm <strong>and</strong> r<strong>and</strong>om access resolution,<br />

our compressed text le formats enable similar or better read alignment compression<br />

ratios compared to CRAM (section 2.2.5).<br />

<strong>An</strong>other common requirement in sequencing data analysis is the retrieval of data set elements<br />

with a certain attribute value, e. g. extraction of the set of single nucleotide polymorphisms<br />

included in a certain range of the reference genome or retrieving a read with a specic identier<br />

for a short read data set. However, st<strong>and</strong>ard linear search is slow when processing large data sets.<br />

Given the data set is in sorted order with respect to the key attribute, accelerated queries are<br />

possible. SHORE sort oers text le sorting functionality similar to the st<strong>and</strong>ard sort comm<strong>and</strong><br />

line utility, optimized for use with typical high-throughput sequencing le formats. Additionally,<br />

the tool is capable of performing binary accelerated queries on arbitrary sorted tab-delimited<br />

text les.


2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />

TEXT-BASED FILE FORMATS 23<br />

(a) GZIP<br />

(b) BGZF<br />

(c) dictzip<br />

(d) SHORE-GZIP<br />

GZIP header<br />

DEFLATE compressed data<br />

GZIP footer<br />

Indexing information embedded in header<br />

Figure 2.2: Basic Structure of the GZIP Format <strong>and</strong> its Sub-Formats<br />

Binary search is however not applicable to elements like read mappings, gene models or<br />

variant information representing two-dimensional objects with a start <strong>and</strong> an end coordinate.<br />

Fast access to such data set elements overlapping a specic region of interest is crucial for data<br />

visualization <strong>and</strong> manual data set exploration. A text-le indexing approach capable of answering<br />

various types of range queries is implemented as a utility SHORE 2dex, providing functionality<br />

similar to Tabix [125] or the BAM [100] <strong>and</strong> BigBed [123] indexing schemes. In comparison to<br />

the also-generic Tabix indexer, SHORE provides additional functionality such as exible choice of<br />

compression format <strong>and</strong> r<strong>and</strong>om access resolution, indexing of les not sorted by start coordinate<br />

<strong>and</strong> fast queries on data sets with deep sequence coverage.<br />

Both utilities SHORE sort <strong>and</strong> SHORE 2dex make use of SHORE's generic r<strong>and</strong>om read access<br />

back end <strong>and</strong> can thereby seamlessly be applied to both compressed <strong>and</strong> uncompressed data sets.<br />

2.2.2 Widely Compatible Indexed Block-Wise Compression<br />

GZIP [116] is one of the most widely used generic compression formats with a high level of<br />

support across operating systems, programming languages, comm<strong>and</strong> line utilities <strong>and</strong> many<br />

further software applications. The employed DEFLATE compression algorithm [115] oers a<br />

reasonable compromise between encoding <strong>and</strong> decoding speed <strong>and</strong> compression ratio. Therefore,<br />

the format is an obvious choice for compression of sequencing related data. However, it lacks<br />

r<strong>and</strong>om read access support, forcing decompression of all data preceding a required section of<br />

the compressed le. As this interferes with the ability to perform accelerated queries on the<br />

compressed data sets, subset specications have been developed to add this feature without<br />

breaking compatibility.<br />

The GZIP le format embeds DEFLATE compressed data between a header <strong>and</strong> a footer<br />

sequence (gure 2.2(a) ). Encoding parameters are specied within the header, whereas the<br />

footer consists solely of eight byte specifying a checksum as well as the decompressed le size<br />

modulo 2 32 . Header, compressed data <strong>and</strong> footer constitute a GZIP stream, <strong>and</strong> a GZIP compliant<br />

le consists of one ore more concatenated streams. Upon decompression, data of concatenated<br />

GZIP streams are concatenated as well.


24 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

The BGZF format developed for the BAM format <strong>and</strong> the Tabix utility [100, 111, 125] exploits<br />

the concatenation feature to implement independently compressed blocks within the GZIP speci-<br />

cation. Uncompressed data are split into 64 kilobyte (kB) blocks, compressed independently as<br />

GZIP streams <strong>and</strong> eventually concatenated (gure 2.2(b) ). Although the resulting compressed<br />

size of a block is unpredictable <strong>and</strong> depends on the compressibility of the data, with the knowledge<br />

of the start osets of all blocks in a compressed le it is possible to quickly locate the block<br />

containing a specic uncompressed le oset. Thereby excess data that have to be decompressed<br />

to access any specic position of the le are limited to at most the block size of 64 kB. BGZF<br />

however does not specify a means of storing the start oset of each of the compressed blocks. To<br />

realize accelerated queries on top of BGZF, block oset information must therefore be incorporated<br />

in external index les. This represents a source of potential error, as the contents of the<br />

compressed le may diverge from the stored index data. Furthermore, various software applications<br />

do not correctly implement the stream concatenation feature of the GZIP specication,<br />

resulting in BGZF compressed data being truncated after decompression of the rst block.<br />

The open source comm<strong>and</strong> line tool dictzip comes with a specication for an indexed GZIPcompliant<br />

format. The GZIP format specication allows up to 64 kB of arbitrary extra data<br />

to be embedded within the header sequence. This extra data is simply to be ignored by basic<br />

GZIP decoders. The dictzip format takes advantage of the extra data eld for storing block<br />

oset information (gure 2.2(c) ). During the encoding process, the DEFLATE algorithm oers<br />

the possibility of manual insertion of reset points from which decompression may be started<br />

without processing the entire le. By exploiting this feature to compress blocks of input data<br />

independently, the dictzip tool avoids reliance on the stream concatenation feature as in the<br />

case of BGZF. However, with the extra data eld in the GZIP header being limited to 64 kB by<br />

specication, the number of blocks that can be stored is limited as well, eectively prohibiting<br />

storage of data with an uncompressed size of more than 1.8 gigabyte (GB). Furthermore, the<br />

entire index data to be stored in the GZIP header only becomes available after the entire le has<br />

been encoded. Streaming output is therefore incompatible with the format, which is therefore<br />

suitable for medium size, static data sets, but is not well matched to the compression of large-scale<br />

sequencing data.<br />

Due to the shortcomings of available formats, we introduce a further subset specication.<br />

The SHORE-GZIP format compresses the entire le as a single GZIP stream to provide optimal<br />

compatibility. To enable streaming output, all indexing information is stored at the end of the<br />

le. On decompression using either SHORE or arbitrary third party tools, block osets stored<br />

in the index are discarded. Thereby, index information is guaranteed to remain consistent with<br />

the le contents. To allow ne-tuning the tradeo between compression ratio <strong>and</strong> r<strong>and</strong>om access<br />

overhead, the format enables free choice of encoding block size.<br />

SHORE-GZIP realizes block-wise encoding similar to the dictzip approach by requesting periodic<br />

resets of the DEFLATE encoder (gure 2.2(d) ). During encoding, the method keeps track<br />

of the compressed block osets. After encoding all input data, the recorded index information<br />

is split into blocks of approximately 64 kB. Each block is embedded as extra data eld into a<br />

GZIP header, which is appended to the compressed le followed by two byte signaling empty<br />

compressed data to DEFLATE decoders, as well as eight further byte forming the appropriate<br />

GZIP footer. The indexing information is hence stored as consecutive empty GZIP streams that<br />

will be ignored by st<strong>and</strong>ard decoding of the data set. The SHORE-GZIP implementation is however<br />

able to recognize, collect <strong>and</strong> decode the trailing index blocks on dem<strong>and</strong> to enable r<strong>and</strong>om<br />

read access.<br />

<strong>An</strong> important feature of plain GZIP is the ability for users to concatenate compressed les<br />

without breaking format compliance. With SHORE-GZIP, concatenation results in index records<br />

being interspersed among the data streams. Through correct recognition of such interspersed


2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />

TEXT-BASED FILE FORMATS 25<br />

index elements by SHORE's index parsing algorithm, r<strong>and</strong>om access functionality is preserved in<br />

concatenated les.<br />

Apart from GZIP, SHORE oers le compression using the XZ [127] format. XZ employs the<br />

LempelZivMarkov chain algorithm (LZMA) to achieve considerably improved average compression<br />

ratio compared to DEFLATE. LZMA oers decompression performance close to that<br />

of DEFLATE, whereas compression requires signicantly more computational resources. XZ is<br />

modeled after the GZIP format by specication of a similar structure of concatenated streams<br />

of encoded data. The le format however natively includes terminal index records that store<br />

all positions of an eventual encoder reset <strong>and</strong> can therefore be employed within SHORE without<br />

further extensions.<br />

2.2.3 Ecient Queries on Text Files<br />

Queries for specic elements in a sorted sequence can be quickly answered by binary search<br />

in O(log n) time, where n is the length of the input sequence. Stock binary search algorithms<br />

however are not directly applicable to text-formatted data sets where the total number of elements<br />

<strong>and</strong> their exact start osets in the le are not known. We therefore implement a modied binary<br />

search for text le queries. Our algorithm bisects the data set by seeking to the central byte <strong>and</strong><br />

subsequently seeking past the next newline marker following that position. Except for border<br />

cases that must be h<strong>and</strong>led separately, the following line is compared to the user-provided search<br />

key <strong>and</strong> the algorithm is applied recursively to the appropriate subset of the le. In general,<br />

worst case time complexity of the modied binary search algorithm is O(m) on arbitrary text<br />

les with a total size of m byte, which however reduces to O(log n) for an n-element data set if<br />

a static limit for the maximal byte size per element is assumed to exist.<br />

Generic text le search functionality is made available through the SHORE sort utility. The<br />

program is applicable to arbitrary sorted tab-delimited text tables, oering the capability to<br />

retrieve specic data set elements as well as to bisect the data set using a search key.<br />

Automated data analysis can often narrow down the data relevant to the investigation to<br />

a small set of regions on the reference genome. Close examination of such individual genomic<br />

regions necessitates rapid retrieval of data associated with the appropriate range of positions. Unless<br />

consisting completely of non-overlapping entries, range queries can however not be answered<br />

by binary search.<br />

Objects associated with a specic range are two-dimensional entities that can be interpreted<br />

as points in a plane dened by their start <strong>and</strong> end coordinate. A query range q = (x q , y q )<br />

then is itself a point in the plane, which together with its reection about the 45 degree line<br />

q ′ = (y q , x q ) partitions the search space into six dierent quadrants (gure 2.3(a) ). Area A<br />

includes all objects fully included in the query range, dened by constraints x ≥ x q <strong>and</strong> y ≤ y q .<br />

Subdivision B represents all elements whose range itself includes the entire query range, with<br />

the constraints x ≤ x q <strong>and</strong> y ≥ y q . All objects intersecting the query are represented by the set<br />

union of areas A to D dened by y > x q <strong>and</strong> x < y q , whereas elements completely disjoint from<br />

the query range fall into quadrants E <strong>and</strong> F (y ≤ x q or x ≥ y q ).<br />

One of many ways to impose a spacial partitioning on multi-dimensional data represent<br />

k-d trees [128], recursively bisecting the data set with respect to alternating dimensions (gure<br />

2.3(b) ). In a perfectly balanced k-d tree, each recursion splits the current subset of the data<br />

at the median of the respective dimension's values. In that case, the tree may be represented<br />

implicitly by an array of its elements with a particular ordering [129]. k-d trees can answer<br />

region searches with a worst case time complexity of O(k · n 1−1/k + r), i. e. O( √ n + r) in the<br />

two-dimensional case, with n the total number of data set elements <strong>and</strong> r the number of elements<br />

to be retrieved [130].


26 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

B<br />

D<br />

F<br />

Start + Size<br />

C<br />

q = (x q , y q )<br />

A<br />

q ′ = (y q , x q )<br />

E<br />

Start<br />

Start<br />

(a) Range Query<br />

Figure 2.3: Range Search Using k-d Trees<br />

(b) Search Space Partitioning<br />

SHORE implements a generic algorithm to impose 2-d tree ordering on array-like objects.<br />

Given a 2-d tree ordered array object, further algorithms facilitate retrieval of elements intersecting<br />

(AD), included by (A) or including (B) a query range or entirely located upstream (E)<br />

or downstream (F) of a query position.<br />

Based on the 2-d tree algorithms, we developed a persistent indexing scheme for textformatted<br />

range data les, made available through the SHORE 2dex utility. The indexing method<br />

processes input les in blocks of a xed, user-provided byte size. For each block, we record<br />

a sequential identier as well as the sequence range spanned by the combined elements whose<br />

storage entry starts inside the respective block. If a block is not associated with any storage<br />

entry, no information is saved. Blocks records are arranged in 2-d tree order <strong>and</strong> saved to disk<br />

as the concatenation of three arrays providing the reordered identiers, genomic start positions<br />

<strong>and</strong> sizes. For answering range queries, array information is mapped into memory <strong>and</strong> supplied<br />

to the appropriate 2-d tree query algorithm, thereby providing the identiers of all blocks that<br />

potentially contain elements relevant to the query. In combination with the index block size,<br />

block identiers allow calculation of the uncompressed start osets of the blocks in the le. A<br />

subsequent linear search phase scans the blocks indicated by the 2-d tree query to assemble the<br />

nal search result.<br />

Through the block size parameter, users are enabled to variably adjust the tradeo between<br />

index le size <strong>and</strong> maximum length of the linear search phase. In sparse range data sets or<br />

data sets not ordered by sequence coordinate, a query range may be spanned by the combined<br />

elements associated with a block although none of the individual elements is of relevance to the<br />

respective query. In such cases irrelevant blocks must frequently be subjected to linear search<br />

unless the index block size is suciently small. To overcome this slight deciency, we introduce<br />

a second user parameter maxgap. If subsequent data set elements in a block are farther apart on<br />

the genomic sequence than the provided maxgap size, then the respective block is stored multiple<br />

times associated with dierent ranges that do not include any coverage gap larger than maxgap.


2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />

TEXT-BASED FILE FORMATS 27<br />

(a)<br />

CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTA<br />

||||||||||**|||||||||||||****|||||||||||||<br />

CCCTAAACCCGTAACCCTAAACCCT---TCTCTGAATCCTTA<br />

(b)<br />

(c)<br />

CCCTAAACCC[TG][AT]AACCCTAAACCCT[A-][A-][A-][CT]CTCTGAATCCTTA<br />

10[TA,GT]13[AAA,---][CT]13<br />

(d)<br />

Query CIGAR MD<br />

CCCTAAACCCGTAACCCTAAACCCTTCTCTGAATCCTTA 10=2X13=3D1X13= 10T0A13^AAA0C13<br />

Figure 2.4: Example Read Alignment <strong>and</strong> its Representations in the MapList <strong>and</strong> SAM Formats<br />

BAM indexing as well as the Tabix utility are two incarnations of a similar indexing scheme<br />

based on R-trees which was rst introduced for the UCSC Genome Browser [124]. Utilities like<br />

the BAM indexer however are tied to their specic associated le format. The more versatile<br />

Tabix utility supports many text-based le formats, but requires BGZF compressed data. By<br />

separating the r<strong>and</strong>om access <strong>and</strong> range indexing concepts, the 2d-tree index is universally applicable<br />

to all storage formats supported by the r<strong>and</strong>om access back end, currently including plain<br />

text, XZ <strong>and</strong> SHORE-GZIP. Moreover, the ability of indexing data sets not strictly required to<br />

be ordered by sequence coordinate is unique among the stated indexing methods. As described,<br />

our method indexes chunks of the data set having a xed byte size. The UCSC R-tree family of<br />

indexes dier in this respect by grouping data set elements into tiling windows on the genomic<br />

sequence at a maximum resolution of 16 kilobases. This binning approach implies the disadvantage<br />

of potential uncontrollable growth of the linear search phase for deeply covered regions<br />

of the genome. We further regard the 2d-tree index's homogeneous array layout a substantial<br />

implementation advantage.<br />

2.2.4 Improved Compression of Read Mapping <strong>Data</strong><br />

To cope with increasingly large amounts of read mapping data, we implement reference based<br />

compression, quality reduction, read relabeling as well as a simple transposition algorithm capable<br />

of further boosting compressibility of SHORE alignment les <strong>and</strong> other types of tab-delimited table.<br />

Conversion from st<strong>and</strong>ard to condensed representations is implemented as a tool SHORE coal.<br />

A read mapping by our denition is a multi-attribute entity composed of its mapping coordinates<br />

on the reference genome as well as a read alignment that species the exact set of<br />

base pairings, traditionally represented in a multi-line format (gure 2.4(a) ). The line-based,<br />

tab-delimited MapList format [1] dened by SHORE stores mapping coordinates, read alignments<br />

along with the original read information <strong>and</strong> attributes describing the reliability of the mapping<br />

as well as read pairing information. Alignments are stored in an alignment string format resembling<br />

that originally introduced by Vmatch [131], a sequence matching tool based on enhanced<br />

sux arrays [132]. In the Vmatch format, matching positions are represented literally by the<br />

matching nucleotide symbol, whereas edit operations are described by the pair of mismatching<br />

symbols enclosed in square brackets (gure 2.4(b) ).<br />

To adjust to the development of sequencing technology <strong>and</strong> read mapping tools, we have<br />

carefully amended the original alignment representation with a small set of additional features<br />

while maintaining full backwards compatibility. With the constant increase of achievable read<br />

length, it becomes possible for read mapping tools to align reads across longer stretches of


28 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

inserted, deleted or polymorphic sequence. For more concise <strong>and</strong> readable representation of such<br />

alignments, we additionally allow consecutive edit operations of the same type to be indicated by<br />

two comma-separated strings representing the sequence of the reference <strong>and</strong> the read, respectively.<br />

Reference-based compression as introduced by CRAM in our format is supported by simple<br />

substitution of all substrings representing exact matches between read <strong>and</strong> reference sequence by<br />

their length (gure 2.4(c) ).<br />

The st<strong>and</strong>ard SAM/BAM format splits read alignment information across three attributes<br />

(gure 2.4(e) ). In this format, the nucleotide sequence of a read (query) is stored unaltered.<br />

Actual alignment information is separated into a CIGAR string as well as an optional MD tag<br />

required to specify reference sequence information for mismatch <strong>and</strong> deletion positions. Omission<br />

of read information redundant with the reference sequence via reference-based compression is<br />

not supported by the format specication. In the opposite direction, the SHORE alignment<br />

string format species further enhancements such as inclusion of soft clipped sequence enclosed<br />

in angle brackets (not shown) to enable full feature compatibility with the CIGAR alignment<br />

representation.<br />

Apart from read alignments, read qualities <strong>and</strong> identiers can make up for a considerable<br />

fraction of the storage requirements of compressed read data (cf. section 2.2.5). SHORE oers a<br />

simple algorithm capable of lowering base quality resolution <strong>and</strong> thereby read quality entropy<br />

similar to the quality binning approach taken by cramtools [120]. We take a user-specied<br />

resolution threshold k to partition each read quality string into sections where the lowest <strong>and</strong><br />

highest quality value dier by at most k. Each nucleotide is then assigned the lowest quality<br />

value found in the respective section. Information on edit operations is prioritized by always<br />

storing the exact quality value for read bases representing mismatches or insertions relative to<br />

the reference sequence.<br />

Simplication of read identiers can amount to a signicant further reduction of storage<br />

requirements. St<strong>and</strong>ard read identiers are strings automatically generated by the base calling<br />

software. For example, Illumina software builds read labels from a sequencing run identier<br />

as well as the reads' coordinates on the ow cell. Due to the r<strong>and</strong>omness of the coordinates<br />

as well as the r<strong>and</strong>om order of occurrence of the reads in a typical mapping data set ordered<br />

by alignment coordinate, the sequence of identiers to be stored is of high entropy <strong>and</strong> poor<br />

compressibility. Post-mapping systematic relabeling of reads can therefore drastically improve<br />

compression ratio. Following relabeling, information content of the sequence of stored identiers<br />

is primarily determined by their actual function, which is identication of associated data set<br />

entries such as paired reads or repetitive read mappings. SHORE coal oers such relabeling<br />

combining simple enumeration by alignment coordinate with a static data set identier.<br />

In general, generic compression algorithms like DEFLATE or LZMA better pick up redundancy<br />

in homogeneously structured data. This suggests that e. g. read quality data might be<br />

compressed more eectively when grouped together rather than stored interleaved with the further<br />

attributes of the data set elements such as identiers <strong>and</strong> read alignments. Isolated storage<br />

of these attributes however prohibits data streaming. Furthermore, given presence of variablelength<br />

attributes that cannot reasonably be stored in a xed, predetermined amount of space,<br />

r<strong>and</strong>om access to the full set of attributes of a certain data set element requires additional index<br />

data structures that counter the aim of optimizing compression ratio. A compromise between<br />

such an attribute centric <strong>and</strong> the regular element centric representation constitutes block-wise<br />

coding of the data in an attribute centric representation, with block size chosen small enough to<br />

transform a block into an element centric representation on-the-y.<br />

To explore the eectiveness of such a mixed representation for boosting compressibility, we<br />

implemented a simple transposition operation applicable to tab-delimited le formats (algorithm<br />

2.1). Our algorithm takes as parameter a xed block size k for transformation of the


2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />

TEXT-BASED FILE FORMATS 29<br />

input : Text T, block size k<br />

output : Transposed text T ′<br />

foreach block B in T of size k<br />

do<br />

rst_row ← rst row of B;<br />

remove the rst row from B;<br />

last_row ← last row of B;<br />

remove the last row from B;<br />

append rst_row to T ′ ;<br />

if B is a table then<br />

foreach column C in B<br />

do<br />

append C as row to<br />

T ′ ;<br />

end<br />

else<br />

append B to T ′ ;<br />

end<br />

append last_row to T ′ ;<br />

end<br />

Algorithm 2.1: Block-Wise Text Transposition<br />

input data. Input is subdivided into blocks of k byte. First <strong>and</strong> last row of a block are dened<br />

as the characters up to, <strong>and</strong> including the rst newline, as well as the characters following the<br />

last newline marker found in the block. For each block, rst <strong>and</strong> last row are output unaltered,<br />

preserving lines spanning block boundaries. <strong>Data</strong> in between are checked whether they represent<br />

a table with xed number of columns, <strong>and</strong> if so are transposed swapping rows for columns.<br />

When applied to data represented as a tab-delimited table with a xed number of columns,<br />

the transposition algorithm succeeds in grouping together all attribute values of the same type<br />

in a block, lines at block boundaries excluded. For tables with variable number of columns or<br />

other data, the input is left unaltered. Applied twice with the same block size parameter value,<br />

the original input is guaranteed to be restored.<br />

SHORE supports a transposed format by addition of a special header marking transposed<br />

data <strong>and</strong> specifying the transformation's block size. On le access this header is recognized <strong>and</strong><br />

input is passed through a lter applying the appropriate block-wise back-transposition. R<strong>and</strong>om<br />

access is supported by reading the appropriate block of data, back-transposing <strong>and</strong> subsequently<br />

seeking to the correct oset inside the block.<br />

2.2.5 Compression Results<br />

We used a data set consisting of approximately 6.8 million aligned 40 base pair Arabidopsis<br />

thaliana reads obtained by Illumina sequencing for evaluation of eectiveness of the measures<br />

implemented to improve space eciency.<br />

In plain MapList format, the read mapping data set occupied about 1.1 Gb of disk space. To<br />

assess the potential for reduction of total data set size by optimizing encoding <strong>and</strong> representation<br />

of certain attributes, we extracted individual data set columns (table 2.1). Read alignments <strong>and</strong>


30 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Attribute Plain Size Compressed Size Compression Ratio<br />

read alignments (ref. comp.) 349 Mb (33 Mb) 19 Mb (3.4 Mb) 5.7% (10.3%)<br />

read IDs (relabeled) 227 Mb (117 Mb) 39 Mb (3.2 Mb) 17.5% (2.7%)<br />

qualities (reduced) 341 Mb (341 Mb) 76 Mb (30 Mb) 22.3% (8.8%)<br />

other elds 163 Mb 5.4 Mb 3.3%<br />

Table 2.1: Compressibility of Individual Read Mapping Attributes<br />

qualities accounted for 349 Mb <strong>and</strong> 341 Mb, each corresponding to approximately 32% of the<br />

total storage size, followed by st<strong>and</strong>ard read identiers with 227 Mb (21%) <strong>and</strong> the collective<br />

size of all further attributes (163 Mb, 15%).<br />

Column data were compressed in XZ format at 128 kB block size. Read alignments were<br />

thereby reduced to 5.7% (19 Mb) of their original size, further elds to 3.3% (5.4 Mb), whereas<br />

read identiers <strong>and</strong> quality strings could not be compressed as eectively with compression ratios<br />

of 17.5% (39 Mb) <strong>and</strong> 22.3% (76 Mb), respectively.<br />

By application of reference-based compression, storage size of read alignments was reduced to<br />

less than ten percent (33 Mb) in plain text format. This gain was however somewhat diminished<br />

in compressed format due to the worse compressibility of the data, resulting in a compressed size<br />

of 3.4 Mb at a compression ratio of 10.3%.<br />

Relabeling of the reads resulted in a storage gain of roughly fty percent relative to the<br />

original representation owing to the shorter read identiers. Vastly improved however was the<br />

compressibility of the identier sequence, resulting in a reduction to just 2.7% (3.2 Mb) relative<br />

to plain text format.<br />

Eect of reduced quality value resolution was assessed using the cramtools quality binning<br />

conguration N40-R8 to allow direct comparison to the CRAM format, although the algorithm<br />

implemented in SHORE achieved similar results. The N40-R8 setting preserves quality values of<br />

all reference sequence mismatches, while assigning quality values of sequence matches to one of<br />

eight possible bins. Application of quality resolution reduction improved compression ratio of<br />

the quality data from 22.3% to 8.8%.<br />

We further assessed the eect of various combinations of compression <strong>and</strong> data reduction<br />

congurations on total data set size by application of the SHORE compress <strong>and</strong> SHORE coal utilities<br />

to the data set in MapList format <strong>and</strong> compare the results with storage of SAM/BAM <strong>and</strong> CRAM<br />

formatted data (table 2.2).<br />

Eect of compression block size was evaluated using SHORE's r<strong>and</strong>om access granularity presets<br />

fast, corresponding to 128 kB block size, medium, 2 MB, <strong>and</strong> slow, 64 MB. Compression<br />

of the 1.1 GB data set in SHORE-GZIP format at 128 kB block size reduced space requirements<br />

to 205 MB, or 19% of the uncompressed le size. Increasing block size to 2 MB further improved<br />

space eciency by just 1.5% to 202 MB storage size, <strong>and</strong> use of 64 MB blocks did not have a<br />

signicant advantage over 2 MB.<br />

Selecting the XZ compression format at 128 kB block size resulted in a storage size of 163 MB,<br />

15% of plain data set size <strong>and</strong> a 20% improvement over SHORE-GZIP at the same block size. Eect<br />

of increased compression block size was more pronounced for XZ than for SHORE-GZIP. At 2 MB<br />

<strong>and</strong> 64 MB block size disk usage was reduced to 140 MB <strong>and</strong> 116 MB, respectively, improvements<br />

of 14% <strong>and</strong> 17% over the respective previous block size.<br />

Block-wise transposition was applied using the block size parameter matching the respective<br />

compression block size. At the smallest block size, size of the transformed data in SHORE-GZIP<br />

format was 181 MB, a 12% improvement over untransposed storage. Block-wise transposition<br />

also augmented the eect of increased block sizes, achieving le sizes of 160 MB at 2 MB <strong>and</strong> of


2.2. EFFICIENT STORAGE OF HIGH-THROUGHPUT SEQUENCING DATA USING<br />

TEXT-BASED FILE FORMATS 31<br />

File Format Preprocessing Compression Format Block Size Compressed File Size<br />

SAM - - - 1.3 GB<br />

MapList - - - 1.1 GB<br />

BAM - BGZF default 237 MB<br />

SAM - GZIP 128 kB 216 MB<br />

MapList - GZIP 128 kB 205 MB<br />

MapList - GZIP 2 MB 202 MB<br />

MapList - GZIP 64 MB 202 MB<br />

MapList BT GZIP 128 kB 181 MB<br />

MapList - XZ 128 kB 163 MB<br />

MapList BT GZIP 2 MB 160 MB<br />

MapList BT GZIP 64 MB 157 MB<br />

MapList BT XZ 128 kB 150 MB<br />

MapList BT, RC GZIP 2 MB 143 MB<br />

MapList - XZ 2 MB 140 MB<br />

CRAM RC - default 135 MB<br />

MapList BT, RC XZ 128 kB 132 MB<br />

MapList BT XZ 2 MB 123 MB<br />

MapList - XZ 64 MB 116 MB<br />

MapList BT, RC, RL GZIP 2 MB 114 MB<br />

MapList BT, RC, QR GZIP 2 MB 109 MB<br />

MapList BT, RC XZ 2 MB 108 MB<br />

CRAM RC, QR - default 103 MB<br />

MapList BT, RC, QR XZ 128 kB 100 MB<br />

MapList BT XZ 64 MB 96 MB<br />

MapList BT, RC, RL XZ 128 kB 95 MB<br />

MapList BT, RC XZ 64 MB 84 MB<br />

MapList BT, RC, RL, QR GZIP 2 MB 81 MB<br />

MapList BT, RC, QR XZ 2 MB 80 MB<br />

MapList BT, RC, RL XZ 2 MB 77 MB<br />

MapList BT, RC, QR XZ 64 MB 62 MB<br />

MapList BT, RC, RL XZ 64 MB 60 MB<br />

MapList BT, RC, RL, QR XZ 64 MB 39 MB<br />

BT:<br />

RC:<br />

RL:<br />

QR:<br />

block-wise transposition<br />

reference-based compression<br />

read relabeling<br />

lossy quality resolution reduction (cramtools [120] preset N40-R8)<br />

Table 2.2: Comparison of Alignment File Compression Eciency


32 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

157 MB at the 64 MB conguration. This corresponds to advantages of 21% <strong>and</strong> 22% over the<br />

matching untransformed representations.<br />

With XZ/LZMA compression, compression ratios beneted slightly less from block-wise transposition,<br />

at the three dierent block sizes gaining 8%, 12% <strong>and</strong> 17% over untransposed data to<br />

obtain le sizes of 150 MB, 123 MB <strong>and</strong> 96 MB, respectively.<br />

We further list the eect of additionally applying various combinations of reference-based<br />

compression, read relabeling <strong>and</strong> quality resolution for transposed data encoded in GZIP format<br />

with 2 MB block size as well as encoded in XZ format at the three dierent block sizes. In<br />

all compression formats <strong>and</strong> block sizes assessed, reference-based compression could account for<br />

11% to 12% of additional storage space savings. Added relabeling of reads resulted in le sizes<br />

reduced by further 20% in SHORE-GZIP format <strong>and</strong> by over 28% in the three XZ encoded les.<br />

Reduction of quality resolution gave similar improvements, producing slightly higher savings<br />

than read relabeling in GZIP compressed data, but had somewhat lesser impact compared to<br />

relabeling in XZ encoding.<br />

For comparison of formats, table 2.2 also includes disk usage for the same data set converted<br />

to SAM, BAM <strong>and</strong> CRAM le formats. The SAM format representation of the data set entries<br />

was slightly more verbose compared to the MapList formatting, resulting in an uncompressed<br />

le size of approximately 1.3 GB. The BAM encoded data set was about 14% larger compared<br />

to MapList, <strong>and</strong> about 9% larger than the SAM formatted data compressed as SHORE-GZIP<br />

at 128 kB block size. When applying reference-based compression only, size of CRAM encoded<br />

data ranked slightly better than the MapList le compressed in XZ format at 2 MB block size<br />

or compressed as GZIP following block-wise transposition <strong>and</strong> reference-based compression, <strong>and</strong><br />

slightly worse than the 128 kB block size XZ-encoded le with block-wise transposition <strong>and</strong><br />

reference-based compression enabled. With additional reduction of quality value resolution,<br />

CRAM ranked somewhat worse than the XZ compressed MapList le for the same data encoded<br />

in 128 kB chunks with block-wise transposition.<br />

2.2.6 <strong>Data</strong> Storage Considerations<br />

Choice of sequencing data storage formats depends on three areas of application with overlapping,<br />

but distinct design goals. O-line data archiving requires primarily space-ecient representation.<br />

<strong>Data</strong> exchange formats must conform to a highly st<strong>and</strong>ardized <strong>and</strong> stable format specication <strong>and</strong><br />

allow combined distribution of all data <strong>and</strong> meta-data in a single le, while directly supporting<br />

simple analysis <strong>and</strong> processing tasks. Finally, on-line storage solutions should be optimized<br />

regarding the trade-o between space eciency <strong>and</strong> certain access guarantees to provide optimal<br />

support for the respective approaches to analysis.<br />

With the implementation of the SHORE storage formats, we demonstrate indexable, space ef-<br />

cient on-line data storage for high-throughput sequencing on the basis of simple text-based le<br />

format denitions <strong>and</strong> stock compression formats. While binary formats dedicated to sequencing<br />

data storage such as CRAM should permit further optimization for improved performance<br />

<strong>and</strong> space eciency, the outlined simple formats <strong>and</strong> pre-processing methods should provide a<br />

valuable benchmark for future eorts.<br />

The implemented indexing solution for compressed sequencing data les forms the basis for<br />

rapid data retrieval for iterative analysis <strong>and</strong> data set exploration. Furthermore, our indexing<br />

approach should facilitate the implementation of caching strategies e. g. for implementation of<br />

fully interactive visualization of large sequencing data sets.


2.3. A NON-DESTRUCTIVE READ FILTERING AND PARTITIONING FRAMEWORK 33<br />

2.3 A Non-Destructive Read Filtering <strong>and</strong> Partitioning Framework<br />

To streamline further processing <strong>and</strong> analysis, it is in general desirable to initially prune highthroughput<br />

sequencing data sets of sequence with below-par quality. Sequencing of short polymers<br />

such as small RNA, bar-coded multiplexing <strong>and</strong> diverse other protocols in addition to that<br />

require sequence context sensitive read clipping <strong>and</strong> data partitioning.<br />

With SHORE import, we provide a utility to apply such initial read ltering operations, in<br />

the process converting sequencing reads from various supported input formats into SHORE's<br />

FlatRead format <strong>and</strong> partitioning the data according to the predened SHORE directory hierarchy<br />

as described [1]. This work further develops the application into a modular <strong>and</strong> extensible read<br />

ltering framework.<br />

<strong>Data</strong> analysis frequently is an iterative process, where at advanced stages it may become clear<br />

that an initial choice of ltering parameters for a data set was suboptimal. However, retrieval of<br />

the original source data may be hampered by archiving in backup solutions, unless large amounts<br />

of duplicate data are to be supported <strong>and</strong> managed on disk. Our utility was therefore reworked to<br />

perform all sequencing read editing <strong>and</strong> ltering in a non-destructive, reversible manner, exploiting<br />

the predened SHORE directory hierarchy for preventing loss of information. It is thus able<br />

to seamlessly recover the raw unprocessed sequencing data sets for straightforward re-ltering<br />

or distribution to third parties. Processing facilities provided by the application include custom<br />

quality <strong>and</strong> chastity ltering, low complexity read removal, quality-based trimming of read ends<br />

<strong>and</strong> small RNA adapter clipping, as described [1, 18]. We have further added lters for sequence<br />

context-dependent read splitting as well as a versatile demultiplexing system (section 2.4). Defaults<br />

of the program are congured such that no ltering is applied automatically, but quality<br />

lters indicated by the input format are respected. Supported read ltering operations must be<br />

selectively enabled by the user <strong>and</strong> are applied in a dened order.<br />

SHORE import further provides optional functionality for partitioning its output into batches of<br />

xed, user denable size, as well as additionally by the length of the processed reads. As sequence<br />

length is a highly signicant property e. g. for small RNA data, length partitioning facilitates<br />

selection of subsets relevant to the respective analysis. <strong>Data</strong> partitioning may furthermore oer<br />

technical advantages. For example, large sequencing data sets may require a long period time<br />

to be written to disk, with a correspondingly high possibility of encountering system failures or<br />

other issues in the process. Reduction of the amount of data stored in a single le represents a<br />

trivial yet eective measure for limiting the signicance of such failures for processing operations<br />

that are applied on a per-le basis. As SHORE is capable of merging sorted data sets on-the-y<br />

with minor processing overhead, potential downsides of data set partitioning are minimized.<br />

The original denition of the SHORE run directory hierarchy was amended to support nondestructive<br />

read ltering, sample demultiplexing <strong>and</strong> raw data recovery. The directory hierarchy<br />

by default includes four levels (gure 2.5). The root of the hierarchy represents a single run of<br />

a sequencing instrument. This directory is named according to the respective output directory<br />

parameter, an arbitrary user provided string. Subdirectory names have dened meanings. The<br />

rst level below the root directory comprises of solely numeric names, representing the individual<br />

physically separated lanes of the sequencing run. The second level includes two dierent types<br />

of directory. Cleaned <strong>and</strong> demultiplexed read data are stored in directories corresponding to<br />

the appropriate sample identier prexed by the string sample_. On the same level, a directory<br />

filtered stores all data required to restore the original unedited data set.<br />

A further sub-level resides within the sample directories. Valid paired-end sequencing reads<br />

are partitioned into directories named according to an integer enumeration of the paired-end<br />

read indexes. Reads obtained by single end sequencing or paired-end reads losing their partner<br />

due to read ltering are stored in a separate directory single. <strong>Data</strong> les are stored either


34 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Run-EAS67.136<br />

1<br />

2<br />

ltered<br />

sample_Ler-1<br />

sample_Col-0<br />

sample_Kro-0<br />

sample_Bur-0<br />

ltered<br />

1<br />

2<br />

single<br />

1<br />

2<br />

single<br />

1<br />

2<br />

single<br />

1<br />

2<br />

single<br />

Figure 2.5: SHORE Run Directory Hierarchy<br />

directly in the single <strong>and</strong> paired-end directories, or below a further sub-level for batching or read<br />

length-dependent partitioning.<br />

On completion of a batch of reads, the associated le is sorted to enable accelerated queries<br />

on the data set (section 2.2.3) <strong>and</strong> compressed in a supported indexed compression format (section<br />

2.2.4). Output is formatted in SHORE's FlatRead format; results of read trimming <strong>and</strong><br />

clipping are on request reported via a set of optional tags rather than being applied directly<br />

to make ltering results available for further customized processing. Optionally, SHORE directory<br />

hierarchy output can be disabled to transform the program into a Unix pipeline-compliant<br />

st<strong>and</strong>alone read ltering utility. The application reports basic read trimming <strong>and</strong> ltering statistics,<br />

whereas quality <strong>and</strong> nucleotide content statistics are implemented as a separate module<br />

SHORE tagstats (section 2.8).<br />

SHORE import denes various dierent importers to be capable of dealing with a variety of<br />

input formats. For correct processing of read data, the program must be able to determine<br />

all reads belonging to the same read pairing. For pairing recognition to be possible without<br />

disproportional time or memory overhead, input data must be ordered according to adequate<br />

criteria. Currently available importers allow input where read pairs are provided as separate,<br />

correspondingly ordered les such as Illumina GAPipeline directories, sets of FASTQ les or<br />

SOLiD color space FASTA les, as well as sorted input in FlatRead, FASTQ, Illumina QSEQ or<br />

454 SFF format. Therefore, unordered input data require preprocessing, but can be conveniently<br />

passed to the application via Unix pipes using utilities SHORE convert <strong>and</strong> SHORE sort.<br />

Our utility automatically recognizes any filtered directories residing within SHORE run or<br />

lane directories that are passed to it as input. The contents of these directories are automatically<br />

parsed to restore the original state of the data prior to any SHORE specic ltering, optionally<br />

re-ltering the data set with altered settings. Filtering dened by the original input format is<br />

by default restored as well, but can be explicitly ignored. By passing the restored output to the<br />

tool SHORE convert, raw data can be obtained for distribution in a variety of widely supported<br />

formats.<br />

2.4 A Flexible Sequencing Read Demultiplexing System<br />

To minimize cost per sequenced base, next-generation sequencing instruments must be uncompromisingly<br />

optimized for maximum throughput, at the expense of generating more data in a<br />

single run than would be required for many types of research project. Therefore, most current se-


2.4. A FLEXIBLE SEQUENCING READ DEMULTIPLEXING SYSTEM 35<br />

quencers feature physically separated sequencing lanes or dierent types of ow cells or ow chips<br />

to allow either to distribute overall throughput to multiple samples or to enable reduced-output<br />

modes of operation. While physical compartmentalization allows simple sample multiplexing<br />

without any requirement for adaptation of work ows, it is tied to the sequencing hardware <strong>and</strong><br />

therefore oers very limited capabilities.<br />

Bar coded multiplexing strategies have thus quickly become the method of choice to achieve<br />

greater operational exibility with the primarily throughput-optimized high-end sequencer models.<br />

Bar coding protocols oer near-perfect scalability by allowing to distribute per-run sequencing<br />

depth to an almost arbitrary number of samples.<br />

Bar coded sequencing data require that prior to further analysis demultiplexing is performed<br />

in software to separate the respective samples as well as sample sequences from the articial bar<br />

code oligomers. The SHORE import utility integrates a exible demultiplexer capable of dealing<br />

with a wide variety of st<strong>and</strong>ard <strong>and</strong> custom bar coding protocols. The respective bar coding<br />

setup is provided by the user in a simple tab-delimited sample sheet format. Our tool supports<br />

all combinations of 5 ′ bar code ligation <strong>and</strong> index read protocols, variable length bar codes <strong>and</strong><br />

is able to take advantage of additional common subsequences such as e. g. restrictions sites in<br />

RAD-Seq applications (section 1.3).<br />

2.4.1 Overview<br />

Bar coding is realized by fusing sample DNA with index oligomers that serve as an unambiguous<br />

sample identier. Ligation of these bar codes can be achieved in many dierent ways, each coming<br />

with a dierent set of advantages <strong>and</strong> disadvantages regarding <strong>lib</strong>rary preparation, cost <strong>and</strong><br />

exibility. Therefore, various bar coding protocols as well as dierent bar coding kits provided<br />

by sequencing instrument vendors are available.<br />

Using the Illumina platform, index sequences may be associated with particular sequencing<br />

primers to be read as a separate index read. Other protocols ligate bar codes to be read in one<br />

go with the sample DNA, resulting in index oligomer <strong>and</strong> sample DNA being fused into a single<br />

sequence read. Moreover, paired-end sequencing protocols imply various possible bar coding<br />

congurations, with oligomer labels either attached to the start of just one of the sequencing<br />

reads, identical labels attached to either read, or dierent labels inserted at the start of the<br />

rst <strong>and</strong> the second read. Labeling reads individually, or even combining 5 ′ <strong>and</strong> index read bar<br />

codes, enables large numbers of samples to be discriminated using only few dierent oligomer<br />

tags, at the cost of a more elaborate sample preparation procedure <strong>and</strong> an increased overhead<br />

in sequencing cycles.<br />

Often multiple dierent labels are chosen to identify each respective sample. For example,<br />

strong bias in the sequenced sample may in certain cases interfere with sequencing instrument<br />

ca<strong>lib</strong>ration, <strong>and</strong> due to that the set of bar code oligomers must be carefully compiled to achieve<br />

a largely balanced nucleotide composition in the multiplexed DNA.<br />

Demultiplexing via tools provided by the sequencing instrument vendors is usually only possible<br />

for the vendor-specic indexing protocols. With the increasing variety in dierent bar coding<br />

protocols, we have gradually developed bar code matching integrated with the SHORE import utility<br />

into a exible demultiplexing solution that is congurable for all compositions of index read<br />

<strong>and</strong> 5 ′ ligated bar coding via a simple sample sheet format. We employ a read oriented format to<br />

avoid manual enumeration of all valid bar code combinations for multi bar code congurations.<br />

Prior to resolving tuples of bar codes to the appropriate sample identiers, bar code matching is<br />

performed individually for each read for improved performance <strong>and</strong> detection of potential mislabeling.<br />

The current bar code matching implementation is able to tolerate a xed, user-specied<br />

number of mismatches <strong>and</strong> no gaps.


36 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

1 #?sample read barcode<br />

Col-0 2 TTCACG<br />

Col-0 2 GGATGT<br />

Ler-1 2 CTAGGC<br />

5 Ler-1 2 AGACCA<br />

Listing 2.1: Example of a Simple Demultiplexing Sheet<br />

2.4.2 A Format for Description of Multiplexing Setups<br />

A SHORE demultiplexing sample sheet is a simple tab-delimited text table with named columns.<br />

The recognized column names are lane, sample, barcode_group, read, barcode, extbarcode<br />

<strong>and</strong> barcode_type.<br />

The sample sheet is parsed utilizing a generic table input API provided by the SHORE <strong>lib</strong>rary.<br />

Column names are dened in the header line, the rst non-empty line of the le that is not a line<br />

comment, or is introduced with a special comment tag #?. Columns are recognized by name<br />

<strong>and</strong> may occur in arbitrary order; further columns may be present, but are ignored.<br />

The only m<strong>and</strong>atory column, named sample, denes the sample identiers that bar codes are<br />

to be translated to. All further columns may be added in arbitrary combinations <strong>and</strong> order. A<br />

typical simple index read demultiplexing conguration is described by additionally specifying the<br />

read <strong>and</strong> barcode columns (listing 2.1), with read specifying the index of the read that contains<br />

the bar code tag <strong>and</strong> barcode the actual sequence of the bar code oligomer. The read index<br />

column may be omitted, indicating that all sequence reads of a pair are attached to identical bar<br />

codes.<br />

Index read <strong>and</strong> 5 ′ bar coding require suitable treatment of the respective bar code oligomers.<br />

While 5 ′ bar codes must be clipped, with the remaining part of the read sequence to be retained,<br />

index reads must be removed completely from the data set. The default implemented in SHORE<br />

is to apply bar code clipping if the sample sheet entry refers to the rst read or if the read<br />

index was omitted, <strong>and</strong> to lter bar-coded reads where the read index is greater than one. The<br />

barcode_type column allows to explicitly control this behavior. SHORE accepts bar code types<br />

read, 5prime <strong>and</strong> none, where the type is associated with the read index, i. e. all of a lane's<br />

entries with the same read index must also specify the same bar code type. Bar codes of type<br />

read will trigger ltering of the entire bar code associated read, whereas 5prime bar codes will<br />

be clipped.<br />

For congurations with bar code information distributed across multiple sequence reads, several<br />

entries with dierent read index values are for each sample added to the denition (listing 2.2,<br />

lines 47). The sample sheet entries will then be grouped on the sample identier, with each<br />

possible combination of bar codes with diering read indexes considered a valid bar code tuple<br />

for the respective sample. Sample sheet entries with the special sample identier * are interpreted<br />

as valid for each sample specied for the respective sequencing lane (e. g. lines 1617).<br />

For example, listing 2.2 denes two valid bar codes for the Ler-1 sample for each of the three<br />

sequencing reads, <strong>and</strong> thus 2 3 = 8 valid 3-tuples of bar codes that can be resolved to that sample.<br />

Automatic combination of all entries listed for a sample can be controlled by the user via<br />

specication of the barcode_group column. Sample sheet entries with dierent bar code group<br />

values are not able to form a valid bar code tuple (e. g. listing 2.2, lines 1013). For example, the<br />

rst read bar code specied in line 10 for the Col-0 sample may only combined with the second<br />

read entry from line 11 <strong>and</strong> not the one from line 13. Furthermore, a combination of line 10 <strong>and</strong><br />

13 would collide with the bar code tuples including entries from line 4 <strong>and</strong> 7, which are already<br />

specied to resolve to the Ler-1 sample. As with the sample column, the bar code group may be<br />

set to * to specify <strong>and</strong> entry that refers to all groups of the respective sample in the respective


2.4. A FLEXIBLE SEQUENCING READ DEMULTIPLEXING SYSTEM 37<br />

1 #?lane sample barcode_group read barcode extbarcode barcode_type<br />

# Bar codes for the Ler-1 sample.<br />

1 Ler-1 0 1 AACT TGCAG 5prime<br />

5 1 Ler-1 0 1 TAGC TGCAG 5prime<br />

1 Ler-1 0 2 AACT * read<br />

1 Ler-1 0 2 CCCT * read<br />

# Bar codes for the Col-0 sample.<br />

10 1 Col-0 0 1 AACTT GCAG 5prime<br />

1 Col-0 0 2 GGAC * read<br />

1 Col-0 1 1 TTGCT GCAG 5prime<br />

1 Col-0 1 2 CCCT * read<br />

15 # Third read has the same bar codes for all valid samples.<br />

1 * * 3 GGAC * 5prime<br />

1 * * 3 TTGC * 5prime<br />

# Lane 2 is not multiplexed, discard the index read.<br />

20 2 Bur-0 0 2 * * read<br />

Listing 2.2: Example of a Full Demultiplexing Sheet<br />

lane (e. g. lines 1617).<br />

For certain applications, the identity of several nucleotides immediately following 5 ′ bar code<br />

sequences is known, like for example restriction site sequence in RAD-Seq (section 1.3). While<br />

such sequences can be exploited to correctly assign each read to the appropriate sample, they<br />

usually should in contrast to the bar code oligomers not be removed from the output. Parts of the<br />

recognition sequence that should not be clipped from the read can be specied in the extended<br />

bar code (extbarcode) eld of the table. Internally, the sequence to be recognized is contructed<br />

by concatenation of the barcode <strong>and</strong> extbarcode elds, while the division of sequence among<br />

both elds is translated into a bar code cut position. For bar code types other than 5prime, the<br />

split into bar code <strong>and</strong> extended bar code has no eect. The bar code cut position is determined<br />

in the context of the respective bar code tuple, i. e. for the same recognition sequence dierent<br />

samples may dene a dierent split between bar code <strong>and</strong> extended bar code, as demonstrated<br />

by lines 4 <strong>and</strong> 10 of listing 2.2. If no part of the recognition sequence is to be removed from the<br />

output, then the entire oligomer can be provided as extended bar code, with column barcode<br />

either omitted or set to * .<br />

If neither bar code nor extended bar code are provided with a value other than * , then the<br />

respective sample sheet entry will match any read sequence. With bar code type either none<br />

or 5prime, such an entry can be utilized for assigning a certain sample identier to an entire<br />

sequencing lane. On the other h<strong>and</strong>, this property may be exploited to completely remove all<br />

reads with a certain read index from the output (e. g. listing 2.2, line 20).<br />

The sample sheet column lane serves to allow independent demultiplexing specications for<br />

dierent sequencing lanes in a single sample sheet le. Rows with diering sequencing lane elds<br />

are completely independent of each other. If the sequencing lane column is omitted, then the<br />

entire demultiplexing specication is considered valid for all lanes of the instrument run.


38 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

2.4.3 Barcode Recognition <strong>and</strong> Sample Resolution<br />

Despite current sequencing technologies having reached rather high levels of stability, single cycle<br />

quality dropouts can not completely ruled out. Therefore, algorithms matching reads <strong>and</strong> bar<br />

code oligomers should be able to tolerate at least limited hamming or edit distance. Rather than<br />

directly mapping each tuple of reads to one of the potentially huge number of possible bar code<br />

tuples with a certain cumulative edit distance, our algorithm proceeds by individually assigning<br />

each sequence read to a specic bar code.<br />

Sample demultiplexing is thus broken down into three successive operations. First, each<br />

sequencing read is matched to the set of allowed bar code sequences associated with the respective<br />

read index. Subsequently, given all sequences for a ow chamber spot could be matched to one<br />

of the oligomers, the resulting tuple of bar codes is mapped to the appropriate sample identier.<br />

Following successful bar code resolution, reads are then pruned of the index oligomers as required.<br />

The initial bar code recognition step tolerates up to a xed, user specied hamming distance<br />

between prex of the sequencing read <strong>and</strong> bar code sequence. We perform a staged prex<br />

matching to precomputed sorted sets of sequences, where each stage corresponds to a certain<br />

hamming distance to the exact tag sequences. The set of sequences for each stage is generated<br />

from the set of valid tags for the respective read index by recursively mutating the sequences<br />

of the respective previous set, with the mutated sequences carrying a pointer to the original<br />

unmodied bar code. Potential collisions in bar code resolution are automatically detected <strong>and</strong><br />

the corresponding entries pruned. At each stage, the read being assessed is used to query the set<br />

of tag sequences utilizing a binary relation that denes a pair of sequences as equivalent given<br />

that one is the prex of the other. If the query returns no result, the algorithm proceeds to the<br />

next stage, until the user specied number of mismatches is reached <strong>and</strong> the read is discarded<br />

as unresolvable.<br />

Given a tuple of reads could thus be successfully mapped to a tuple of valid index sequences,<br />

the set of valid bar code tuples is interrogated to resolve the bar code combination to the appropriate<br />

sample identier. Each of the reads is then either pruned of its 5 ′ end or agged as ltered<br />

according to the sample sheet's bar code type specication. Since the bar code cut position can<br />

in general only be resolved in the context of the complete bar code tuple, this clipping operation<br />

is postponed until sample identier <strong>and</strong> bar code group have been resolved.<br />

Due to its restriction to hamming distance rather than edit distance, our demultiplexer is of<br />

limited applicability to indel-prone sequencing technologies such as 454, Ion Torrent of SMRT<br />

sequencing. However, in certain cases perfect matching may be sucient, or probable indels<br />

might be represented explicitly in the sample sheet. In setups requiring very long bar code<br />

oligomers <strong>and</strong> a high mismatch tolerance, recursive generation of the mutated bar code sets<br />

threatens to become overly resource intensive. However, bar code matching is implemented as<br />

one of several isolated modules with a simple interface, such that additional matching options<br />

might easily be added in the future as required, e. g. utilizing the dynamic programming alignment<br />

algorithms provided by the SHORE <strong>lib</strong>rary. For highly customized context sensitive read clipping<br />

<strong>and</strong> splitting allowing for gaps <strong>and</strong> user dened scoring schemes, we provide a separate utility<br />

SHORE oligo-match (section 2.5).<br />

2.5 Versatile Oligomer Detection <strong>and</strong> Read Clipping<br />

In addition to sample demultiplexing, a wealth of dierent <strong>lib</strong>rary preparation protocols <strong>and</strong><br />

sequencing applications exist that involve ligation of certain adapter or linker sequences, or otherwise<br />

require sequence context aware clipping or splitting of the unprocessed read sequences.<br />

Depending on the protocol or application, the respective recognition sequences may be degener-


2.5. VERSATILE OLIGOMER DETECTION AND READ CLIPPING 39<br />

ate or truncated in various ways. Therefore, computational tools are required that are capable of<br />

detecting optimal sequence matches given dened constraints, <strong>and</strong> can utilize detected matches<br />

for pruning, splitting or otherwise manipulating sequencing read data according to user requirements.<br />

In this work, a highly congurable utility SHORE oligo-match for pair-wise sequence matching<br />

<strong>and</strong> match based read manipulation was implemented. The program utilizes dynamic programming<br />

alignment algorithms with customized alignment matrix <strong>and</strong> backtrace initialization, modes<br />

of backtracing <strong>and</strong> match ltering. Various modes of sequencing read manipulation or annotation<br />

are provided, <strong>and</strong> full match information may be retrieved for advanced customization needs.<br />

Our application supports both threaded as well as distributed memory modes of parallel operation.<br />

It is applicable to matching, manipulating <strong>and</strong> ltering even large data sets of sequencing<br />

reads by a small set of provided oligomer sequences.<br />

2.5.1 Overview<br />

Optimal pairwise sequence alignments can be computed using well known dynamic programming<br />

algorithms, given that no larger-scale sequence rearrangements need to be accounted for. Pairwise<br />

alignment algorithms are grouped into two classes, global alignment methods like the Needleman-<br />

Wunsch algorithm [133] <strong>and</strong> local alignment such as the Smith-Waterman algorithm [134]. However,<br />

matching pairs of short sequences such as high throughput sequencing reads <strong>and</strong> adapter<br />

oligomers usually implies specic sequence overlap congurations, <strong>and</strong> therefore represents neither<br />

a global nor local alignment use case.<br />

Mate pair sequencing <strong>lib</strong>raries for Roche 454 Genome Sequencer instruments are constructed<br />

through the Cre/lox system. Cre is an enzyme that catalyzes site specic recombination at<br />

pairs of lox sequences. The protocol exploits this anity for circularizing multiple kilobase DNA<br />

fragments. Circularized DNA is then cut in relative proximity to the recombination site to<br />

obtain a shorter linear fragment consisting of the inverted ends of the original fragment joined<br />

by a linker oligomer that includes a lox sequence. To obtain mate pair reads, the compound<br />

insert is sequenced <strong>and</strong> then split computationally adjacent to the detected location of the linker<br />

oligomer.<br />

Cre/lox mate pair sequencing protocols have also been introduced to Illumina technology<br />

[135]. Due to the by comparison short read lengths, both ends of the circularized fragment<br />

cannot in general be retrieved as a single sequence read <strong>and</strong> thus the insert is sequenced from<br />

both ends. The linker oligomer may be detectable in none, only one, or both sequences of a<br />

read pair, depending on read length, linker position <strong>and</strong> size of the insert.<br />

Recognition <strong>and</strong> removal of adapter oligomers is further required in all situations where there<br />

is an overlap between the distributions of insert size <strong>and</strong> sequencing read length. In small RNA<br />

sequencing, all relevant sequencing reads will contain the 3 ′ sequencing adapter due to the short<br />

length of the sequenced polymer. Alternative multiplexing protocols are enabled by this property<br />

where the bar code oligomer is ligated to the 3 ′ <strong>lib</strong>rary adapter. In total RNA <strong>lib</strong>raries adapter<br />

sequences are in contrast only picked up in a minority of reads.<br />

Adapter <strong>and</strong> linker recognition require nding the optimal pair-wise alignment with the constraint<br />

of specic sequence overlap congurations. We dene ve dierent keywords global,<br />

local, dangling_qry, dangling_ref <strong>and</strong> dangling_any for description of pair-wise sequence<br />

overlap congurations. At either end of the sequence alignment there are four dierent possible<br />

overlap congurations overhang of the reference sequence (dangling_ref), overhang of<br />

the query sequence (dangling_qry), unaligned sequence in both reference <strong>and</strong> query (local) as<br />

well as no sequence overhang (global) <strong>and</strong> therefore 16 dierent alignment congurations in<br />

total (table 2.3).


40 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Alignment Matrix initialization (left end) Backtracing (right end)<br />

1 global global<br />

2 global dangling_qry, dangling_any<br />

3 global dangling_ref, dangling_any<br />

4 global local<br />

5 dangling_qry, dangling_any global<br />

6 dangling_qry, dangling_any dangling_qry, dangling_any<br />

7 dangling_qry, dangling_any dangling_ref, dangling_any<br />

8 dangling_qry, dangling_any local<br />

9 dangling_ref, dangling_any global<br />

10 dangling_ref, dangling_any dangling_qry, dangling_any<br />

11 dangling_ref, dangling_any dangling_ref, dangling_any<br />

12 dangling_ref, dangling_any local<br />

13 local global<br />

14 local dangling_qry, dangling_any<br />

15 local dangling_ref, dangling_any<br />

16 local local<br />

Table 2.3: Dynamic <strong>Programming</strong> Alignment Modes


2.5. VERSATILE OLIGOMER DETECTION AND READ CLIPPING 41<br />

The conguration depicted in the rst column indicates the type of end overhang that should<br />

not be penalized by the alignment algorithm's scoring method, with the lower line dened as<br />

reference (ref) <strong>and</strong> the upper line as query (qry). The dierent end alignment modes form a<br />

hierarchy of constraints, with dangling_any a subset of local, dangling_qry <strong>and</strong> dangling_ref<br />

subsets of dangling_any, <strong>and</strong> global a common subset dangling_qry <strong>and</strong> dangling_ref.<br />

The term query will be used in the following to always indicate the sequencing read, <strong>and</strong><br />

reference may thus refer to a possibly much shorter oligomer sequence. Alignment of a short<br />

read to a part of a genomic reference sequence constitutes an example of overlap conguration 11,<br />

dened by the keyword pair (dangling_ref;dangling_ref). Detection of sequencing adapters<br />

or 3 ′ bar codes in small RNA sequencing data corresponds to congurations 6 or 7, depending<br />

on whether the read contains the entire or only the partial adapter sequence. This subset of<br />

congurations is dened by the pair (dangling_qry;dangling_any). Detection of bar codes in<br />

5 ′ bar-coded sequences is represented by case 2 (global;dangling_qry).<br />

For Cre/lox-type recombinant mate pair sequencing varied overlap constraints may be appropriate<br />

depending on the desired sensitivity-specicity tradeo. Conservative detection of<br />

linker sequences in valid 454 type mate pairs corresponds to conguration 6. Illumina Cre/lox<br />

mate pair protocols obtain read pairs where the linker sequence is expected towards the end<br />

of either read (conguration 6 or 7). Occasionally one of the enzymatic cuts of the circular<br />

DNA may also occur inside the linker DNA. Such cases are described by conguration<br />

10. The subset of congurations 6, 7 <strong>and</strong> 10 constitutes a case of mutually dependent end<br />

alignment congurations. Thus, it can not be accounted for by a single keyword pair, but the<br />

two pairs (dangling_qry;dangling_any) <strong>and</strong> (dangling_ref;dangling_qry) (or alternatively,<br />

(dangling_any;dangling_qry) <strong>and</strong> (dangling_qry;dangling_ref)).<br />

The SHORE oligo-match utility is capable of for each sequencing read selecting the optimal<br />

alignment or alignments out of multiple pairs of end alignment modes, multiple reference<br />

oligomers <strong>and</strong> optionally their reverse complemented sequence.<br />

By utilizing a fully customizable 16x16 scoring matrix, the program enables adjustable h<strong>and</strong>ling<br />

of ambiguous IUPAC nucleotide codes as well as asymmetric base mismatch penalties with<br />

respect to the direction of the match.<br />

A pair-wise sequence mapping is composed of two pairs of end coordinates as well as the<br />

alignment describing actual base pairings <strong>and</strong> sequence gaps. To increase specicity of read<br />

clipping <strong>and</strong> splitting operations, it is desirable to ensure that optimal pair-wise mappings feature<br />

a unique pair of end coordinates, whereas potential alternative alignments can be considered<br />

irrelevant. Our alignment algorithm is capable of either generating an exhaustive list of all<br />

possible alignments, a list featuring a representative of all alignments with a dierent pair of<br />

end coordinates, or just a single representative for each pair-wise mapping, which is optionally<br />

assessed for uniqueness of end coordinates.<br />

The utility provides lters for pair-wise mappings with respect to uniqueness of oligomer<br />

selection <strong>and</strong> end coordinates. Alignments valid following ltering may be comprehensively<br />

reported <strong>and</strong> applied to clipping or splitting sequencing reads at either, or at both ends of the<br />

detected oligomer.<br />

2.5.2 Dynamic <strong>Programming</strong> Alignment <strong>and</strong> Backtracing Pipeline<br />

Our alignment pipeline proceeds in three subsequent passes. Initially dynamic programming<br />

alignment of the sequencing read is performed to each oligomer <strong>and</strong> for each pair of provided end<br />

alignment modes. The optimal alignment or alignments are passed to the backtracing module.<br />

Finally, generated traces are ltered <strong>and</strong> may subsequently be applied to read manipulation.


42 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

While alignment scores become available after the initial stage, ltering is delayed until after<br />

backtracing for reasons of threshold calculation.<br />

The end alignment mode for the left end of the alignment determines the mode of alignment<br />

matrix initialization <strong>and</strong> alignment score calculation. With left end alignment mode local<br />

initialization <strong>and</strong> score calculation proceeds according to st<strong>and</strong>ard local alignment, with rst<br />

row <strong>and</strong> column initialized to zero <strong>and</strong> the alignment score at each eld of the matrix truncated<br />

at zero from below. Left end mode dangling_any utilizes the same matrix initialization, but<br />

does not clip alignment scores at zero. With dangling_ref <strong>and</strong> dangling_qry, only the rst row<br />

or column is zero-initialized, respectively, given the width of the matrix is determined by the<br />

size of the reference sequence. Sequence overhangs for the respective other sequence are valued<br />

with the regular gap penalty. Mode global does not perform pre-initialization of the matrix, as<br />

in st<strong>and</strong>ard global alignment.<br />

For selection of the best out of multiple permitted pairs of end alignment mode for a reference<br />

oligomer, multiple passes of dynamic programming alignment must be performed due to the<br />

distinct requirements of matrix initialization. For each distinct left end alignment mode a corresponding<br />

right end alignment mode may be dened. However, alignment is only performed if the<br />

conguration is not included by a dierent pair of modes specied, i. e. if the constraint on the<br />

right end of the alignment is weaker than that specied by the next left end mode that includes the<br />

current one. Due to the hierarchical nature of tolerated sequence overhang congurations there<br />

is nonetheless a chance that the same alignment will be produced multiple times by dierent keyword<br />

pairs. For example, (dangling_ref;dangling_qry) <strong>and</strong> (dangling_qry;dangling_ref)<br />

both dene supersets of the (global;global) conguration. Such redundancies can be eliminated<br />

prior to backtracing by determining the subset of the respective end alignment mode that<br />

has already been covered by previous alignments <strong>and</strong> excluding it by assigning corresponding<br />

elds of the alignment matrix the maximum penalty.<br />

For backtracing initialization <strong>and</strong> alignment score calculation, the alignment algorithm<br />

keeps track of values <strong>and</strong> locations of last row, last column <strong>and</strong> global score maximums for the<br />

matrix. Mirroring matrix initialization, backtracing starts at global score maximums for right<br />

end alignment mode local, at last row <strong>and</strong> last column maximums for modes dangling_ref<br />

<strong>and</strong> dangling_qry, respectively, at the combined last row <strong>and</strong> last column maximums for<br />

dangling_any, <strong>and</strong> at the lower right corner for global.<br />

Whenever the backtracing algorithm encounters multiple possible elds of origin for an alignment<br />

score, alternative paths are stacked for later completion, unless retrieval of only a single<br />

representative alignment was requested. While exhaustive tracing of all possible alignment paths<br />

can be combinatorially unfavorable, generating a representative for all distinct pairs of mapping<br />

end coordinates can be performed eciently. For this purpose, each eld of the matrix that has<br />

been reached by the backtracing algorithm from the same right end coordinates is marked as<br />

visited. If a eld marked as visited is reached from one of the stacked alternative paths, the<br />

respective path is aborted <strong>and</strong> the algorithm proceeds to assessing the next path.<br />

For read clipping purposes it is typically only relevant whether an oligomer can be assigned<br />

to unique coordinates on the sequencing read. For assessment of coordinate uniqueness, the<br />

backtracing algorithm proceeds only until alternative end coordinates have been encountered, <strong>and</strong><br />

the alternative trace is not emitted. If the backtrace completes without encountering coordinates<br />

dierent from the primary trace, its end coordinates are marked as unique.<br />

Reliability of a sequence match is determined by both the length of the match <strong>and</strong> its relative<br />

amount <strong>and</strong> type of edit operations. Alignment score ltering is therefore congured by setting<br />

slope <strong>and</strong> oset of a linear function. By default, the function's parameter is the length within the<br />

shorter sequence out of reference <strong>and</strong> query that is spanned by the match. Alignments with a<br />

score below the threshold thus calculated are not propagated to the list of results <strong>and</strong> sequencing


2.6. A PARALLELIZATION FRONT-END FOR SHORT READ ALIGNMENT TOOLS 43<br />

read manipulation. Alternatively, the score threshold may be set up to be calculated using the<br />

total length of the selected sequence as the function parameter.<br />

2.6 A Parallelization Front-End for Short Read Alignment Tools<br />

A great variety of sequencing read mapping <strong>and</strong> alignment programs are nowadays available to<br />

users. While most of these applications are similar in operation, each comes with its own specic<br />

user interface <strong>and</strong> usage peculiarities. Despite the huge selection of dierent implementations<br />

<strong>and</strong> extensive optimization of the underlying mapping <strong>and</strong> alignment algorithms, read mapping<br />

is furthermore still one of the computationally most expensive steps for resequencing applications<br />

(section 1.5.3). While parallel processing can often solve the issue of extensive run times,<br />

parallelization support is not uniform among all available solutions. Therefore, as each read<br />

alignment tool comes with its own set of trade-os, strengths <strong>and</strong> weaknesses, users are often left<br />

to decide between performance <strong>and</strong> required or desired features, <strong>and</strong> frequently need to adapt to<br />

dierent interfaces.<br />

The SHORE mapowcell application is a parallelization, pre- <strong>and</strong> post-processing wrapper for<br />

a variety of widely used <strong>and</strong> freely available short read mappers [1]. In addition to shared memory<br />

parallelization, the application has been reimplemented supporting networked distributed<br />

memory architectures through the Message Passing Interface (MPI) st<strong>and</strong>ard. Further added<br />

capabilities include automatic adjustment of parameters for h<strong>and</strong>ling of data sets with heterogeneous<br />

read lengths, as well as straightforward integration of mapping results obtained using<br />

dierent aligner back-ends or sets of mapping parameters. The program thus constitutes a<br />

meta-alignment tool able to amend mapped sequencing read data sets using dierent aligners,<br />

parameters or additional reference sequence information. Such multi-pass read mapping may<br />

especially benet computationally expensive approaches such as spliced read mapping.<br />

SHORE mapowcell currently provides a common user interface to the tools GenomeMapper,<br />

BWA, Bowtie, Bowtie2, ELAND <strong>and</strong> BLAT. User parameters are automatically transformed into<br />

the appropriate equivalent for the respective mapping back-end <strong>and</strong> results consistently reported<br />

in compressed, seekable SHORE MapList format easily convertible to various data exchange formats<br />

such as SAM.<br />

Ion or pyrosequencing read data sets as well as other types of sequencing data set following<br />

quality based read trimming (section 2.3) or sequence context specic read clipping (section 2.5)<br />

feature read length distributions with broad support range. However, many read mapping tools<br />

only support a set of static alignment parameters provided at program startup, resulting in<br />

below-par mapping specicity for reads on the lower tail of the read length distribution <strong>and</strong><br />

insucient sensitivity for reads at the upper end of the length spectrum.<br />

The SHORE mapowcell application automates read length-dependent selection of all relevant<br />

alignment parameters such as maximum tolerated edit distance, gap extension lengths or seed<br />

sizes. These user parameters may be provided as either absolute values or percentage of read<br />

length; additionally, edit distance may be bounded according to the alignment seed lemma.<br />

Given a certain read length l <strong>and</strong> edit distance d, a read alignment is guaranteed to contain a<br />

perfectly matching seed of size s = ⌊l/(d + 1)⌋, <strong>and</strong> therefore the maximum edit distance that<br />

guarantees nding a seed for each valid mapping is d = ⌊l/s⌋ − 1. We provide two modes of<br />

reducing the initially specied edit distance. A mode strict limits maximum edit distance to<br />

d max = ⌊l/s⌋ − 1, thereby ensuring that either all valid alignments with optimal edit distance<br />

are found for a read, or it is left unmapped (further seed selection heuristics implemented by<br />

the mapping tool excluded). <strong>An</strong> alternative setting on sets a weaker limit d max = ⌊l/s⌋, such<br />

that some of multiple mappings with the same edit distance may be missed, but reads are left<br />

unmapped if valid alignments at lesser edit distance can not be excluded.


44 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Sequencing of DNA that is highly diverged from available reference genome sequences as well<br />

as transcriptome sequencing due to intron exon junctions imply that many sequencing reads<br />

can not be mapped to the reference as a contiguous unit. Such data must be mapped using<br />

specialized spliced sequencing read mapping programs like PALMapper [94] or generic matching<br />

tools such as BLAT [136]. Specialized spliced mapping algorithms may however represent a<br />

suboptimal choice for mapping the entire data set, either for reasons of performance or specicity.<br />

A dierent, yet related problem pose cases where the respective reference sequence is too large<br />

to be h<strong>and</strong>led with a certain combination of mapping tool <strong>and</strong> available hardware. Our utility<br />

is able to mitigate these issues through automatic combination of the results of multiple read<br />

mapping passes using dierent alignment tools, parameters or reference sequence information.<br />

Two dierent re-mapping modes oer either full reassessment of the entire data set utilizing<br />

the information of prior read mapping passes, or processing <strong>and</strong> integrating additional mapping<br />

results for read previously not mappable to the reference genome.<br />

For robust h<strong>and</strong>ling of variable reference sequence information provided to the program,<br />

SHORE mapowcell maintains a sequence dictionary used to record reference sequence identiers,<br />

metadata <strong>and</strong> checksums along with the alignment result le. With each read mapping pass the<br />

reference dictionary is updated appropriately <strong>and</strong> possible additional sequences are assigned new<br />

unique sequence identiers as required.<br />

For reassessment of read previously not mappable to the reference genome, mapping proceeds<br />

as in single pass application, <strong>and</strong> results are the union of all newly discovered read mappings with<br />

the results of the previous pass. For full renement of entire data sets, both previously mapped<br />

<strong>and</strong> unmapped reads become subject to realignment. Previous mapping results are however<br />

factored into calculation of alignment parameters <strong>and</strong> determination of batch identity. On completion,<br />

results of the current <strong>and</strong> previous mapping run are merged while at the same time<br />

pruning read alignments rendered suboptimal by the newly obtained results from the mapping<br />

data set.<br />

Read mapping tools incorporating computationally more expensive spliced alignment algorithms<br />

further emphasize the need for ecient <strong>and</strong> capable parallelization of the mapping process.<br />

While many aligners implement simple threaded parallelization, support is not universal, <strong>and</strong><br />

only a few mapping tool-specic solutions exist for distributed memory settings or cloud environments,<br />

like e. g. CloudBurst [95]. SHORE mapowcell implements a simple batch-based parallelization<br />

approach, in which automatically generated batches of input data may be distributed<br />

to either an alignment tool running in multi-threaded mode, to multiple aligner processes instantiated<br />

on the same machine, to alignment tools running within MPI processes distributed over a<br />

network, or a hybrid combination of all of the above.<br />

SHORE mapowcell denes a binary relation on read data by which reads are equivalent if<br />

they are set to be aligned using the same combination of static mapping parameters, <strong>and</strong> divides<br />

the data set into batches accordingly. Reads are partitioned into batch-specic les in SHORE's<br />

temporary le directory <strong>and</strong>, on reaching a user specied maximum batch size, submitted for<br />

parallel processing. Parallel operations include actual read mapping as well as format conversions<br />

<strong>and</strong> sorting where required. With ne tuning of the maximum batch size parameter, a simple<br />

measure is provided for adjustment of the tradeo between minimal parallelization overhead <strong>and</strong><br />

improved load balancing. For distributed memory parallelization, the temporary le location<br />

specied is required to be shared over the relevant network, with only meta information on each<br />

batch being transmitted via MPI channels for simplicity of implementation. On completion of<br />

all batches related to a specic input le, mapping results are automatically merged back into<br />

the permanent storage location.


2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 45<br />

2.7 Robust Detection of ChIP-Seq Enrichment<br />

Investigation of protein-DNA interaction by ChIP-Seq can help to further our underst<strong>and</strong>ing of<br />

cellular regulatory networks (section 1.3.2). Primary data analysis for ChIP-Seq experiments<br />

requires software tools that can pinpoint hundreds to thous<strong>and</strong>s of putative protein binding sites<br />

from data sets of high-throughput sequencing read alignments (section 1.5.6).<br />

This section discusses the peak calling pipeline SHORE peak integrated into the SHORE analysis<br />

suite. The utility is able to discriminate true enrichment from common artifact patterns using<br />

various robust heuristics, calculates false discovery rates (FDR) by interrogation of control experiments<br />

<strong>and</strong> implements a unique joint detection, separate evaluation approach to h<strong>and</strong>ling of<br />

replicate experiments. The program outputs locations of possible binding regions together with<br />

their FDR <strong>and</strong> various additional features potentially useful for ranking, ltering or ne-tuning<br />

the detection. It further enables users to generate graphical representations of peak regions,<br />

put detected loci into the context of available genome annotation or to extract binding region<br />

sequences for subsequent motif sampling <strong>and</strong> analysis.<br />

Our pipeline is pre-congured for straightforward detection of dened loci of ChIP-Seq enrichment<br />

as produced by transcription factor (TF) binding sites. Settings may however be adjusted<br />

to support the detection of clusters of enrichment that occur e. g. in data representing states<br />

of histone modication. Individual algorithms are implemented as pluggable modules <strong>and</strong> have<br />

been brought over to multi-purpose tools SHORE coverage <strong>and</strong> SHORE count, allowing ne-grained<br />

control <strong>and</strong> exible applicability of enrichment detection in diverse scenarios such as MNase-Seq<br />

assays or assessment of copy number variation (CNV).<br />

The SHORE peak pipeline has been outlined [58, 137] <strong>and</strong> further employed [138141] in multiple<br />

biological publications.<br />

2.7.1 Overview<br />

Were sequencing reads r<strong>and</strong>omly sampled from the genome, sequencing depth should be distributed<br />

according to the Poisson distribution whose parameter equals the average depth [142].<br />

Although high-throughput sequencing is known to be highly non-r<strong>and</strong>om, with biases that bring<br />

about considerable deviation from the Poisson model (section 1.4), the distribution can still<br />

present a powerful tool for the detection of irregularities in sequencing depth at high sensitivity.<br />

Our peak calling algorithm consists of two major steps. The rst pass is a detection phase that<br />

employs a relaxed Poisson assumption to the read alignment data generated through the ChIP-<br />

Seq experiment to identify regions of putative enrichment at high sensitivity. In the subsequent<br />

second pass implemented to improve specicity of the prediction, these c<strong>and</strong>idate regions are<br />

subjected to several robust empirical rules for artifact removal <strong>and</strong> further tested against control<br />

experiment data to assess signicance of the enrichment.<br />

2.7.2 Correcting for Duplicated Sequences<br />

The ChIP-Seq procedure yields quantities of DNA that are typically orders of magnitude below<br />

the amounts retrieved by other methods of DNA sampling. The excess PCR rounds required<br />

to obtain a compliant sequencing <strong>lib</strong>rary often bring about signicant amounts of duplicated<br />

sequences, reducing <strong>lib</strong>rary complexity. PCR may entail signicant sequence bias (section 1.4)<br />

with erratic rates of fragment duplication.<br />

Duplicate sequences represent the same fragment of DNA <strong>and</strong> therefore violate the assumption<br />

that each sequencing read is sampled independently from the set of sequences under examination.<br />

During detection of sequence enrichment, the presence of duplicate sequences must thus be<br />

explicitly accounted for by the predictive model, or these sequences must be pruned from the


46 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

data set prior to application of the actual detection algorithm. Most current ChIP-Seq analysis<br />

algorithms including SHORE peak implement a pruning approach.<br />

In the presence of sequencing errors or variable read length, PCR duplicates cannot be recognized<br />

by simply collapsing all reads having the same nucleotide sequence. Furthermore, at<br />

enriched loci identical sequences may be sampled independently. In the sequencer output, these<br />

identical sequences are thus indistinguishable from PCR artifacts.<br />

Duplicate recognition is therefore implemented as a ltering step that is applied after mapping<br />

the reads to the reference genome. ChIP-Seq algorithms typically restrict usage of duplicate reads<br />

by applying a static limit to the number of reads having their 5 ′ end mapped to the same position<br />

of the genome. While reliably removing duplicated sequences, the approach does not distinguish<br />

between duplication events <strong>and</strong> independent sampling, <strong>and</strong> thus suers from saturation once all<br />

reference positions are becoming occupied.<br />

We therefore adaptively remove PCR duplicates, using a heuristic based on the assumption<br />

that the number of non-duplicated reads n ∗ x that have their 5 ′ end at any position x of the<br />

genome is sampled from a Poisson distribution with unknown distribution parameter λ x , <strong>and</strong><br />

that each of these reads is represented on average by an unknown number of copies α x . Given the<br />

assumption holds, α x equals the dispersion index D x = σ 2 x/µ x for the read count distribution<br />

including duplicates. We further assume that all λ i <strong>and</strong> all α i for positions that are in close<br />

vicinity take similar values, <strong>and</strong> that the read count distribution for any position x can therefore<br />

be approximated using a short range of positions adjacent to x. Based on these assumptions,<br />

we estimate α for each position using a sliding window. In a window, the dispersion index is<br />

estimated as the ratio of the empirical variance <strong>and</strong> mean of the read counts. To increase the<br />

adaptiveness of the sliding window calculation, the ratio is calculated independently for a left<br />

window that includes all positions from x − k to x, <strong>and</strong> a right window including positions x to<br />

x + k<br />

ˆD left = s2 (x−k)..x<br />

¯n (x−k)..x<br />

ˆD right = s2 x..(x+k)<br />

¯n x..(x+k)<br />

with s 2 <strong>and</strong> ¯n the empirical variance <strong>and</strong> mean over the positions indicated, respectively. The<br />

left <strong>and</strong> right results are then averaged using the root mean square (RMS) of both values to<br />

obtain an estimate ˆα x of α x √<br />

1<br />

ˆα x =<br />

2 · ( ˆD left 2 + ˆD right 2 )<br />

The division into left <strong>and</strong> right windows should allow for sharper boundaries between enriched<br />

<strong>and</strong> background sites. RMS average over the two windows was chosen to ensure that re-evaluation<br />

using a scaled read count vector n ′ = n · 1/ˆα x as input will always yield ˆα ′ x = 1.0.<br />

In scenarios where it is desirable to retain the complete set of sequences with their associated<br />

base qualities, the estimated multiplicity factor ˆα x may be incorporated into the analysis by<br />

assigning each read starting at x a weight of 1/ˆα x during calculation of sequencing depth. If<br />

retaining all sequences is not a concern, the maximal number of reads may alternatively limited<br />

to ⌊n x /ˆα x ⌋, with n x the observed read count at a position x. The limiting approach is currently<br />

implemented in SHORE, discarding all excess alignments when the read count limit for the<br />

respective position has been reached.<br />

The duplicate correction algorithm does not provide any guarantees regarding the correctness<br />

of ˆα x . However, given that the variance of the read counts of adjacent positions is likely to<br />

exceed the Poisson assumption, average multiplicity should be overestimated. The algorithm is


2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 47<br />

therefore conservative in the sense that the true number of non-duplicated reads per position<br />

will be underestimated. In contrast to imposing a static limit on the number of reads accepted<br />

at each position, the adaptive heuristic still allows to discriminate dierent levels of extreme<br />

enrichment as well as peak calling in presence of deep background coverage.<br />

2.7.3 Detection Phase<br />

The ChIP-Seq procedure enriches for short DNA fragments with above average anity to the<br />

protein of interest. One or both ends of these fragments are subsequently sequenced, allowing<br />

detection of its origin on the reference genome assembly. In the following, the enriched DNA<br />

fragments subjected to sequencing will be referred to as inserts, whereas the sequenced ends<br />

of the fragments will be referred to as reads. The number of inserts putatively overlapping a<br />

position will be termed insert depth, <strong>and</strong> the number of reads overlapping the position read depth.<br />

The primary peak detection phase, resembling that of MACS [104] (section 1.5.6), is realized<br />

as a sliding window analysis of insert depth. A window of xed, user-selectable width is shifted<br />

along the reference sequence in single base steps. In each step, the algorithm calculates the<br />

average depth of insert coverage over the current window. In the case of paired-end sequencing,<br />

insert depth is calculated from the ranges dened by read pairs. For single end sequencing, each<br />

read is extended in 3 ′ direction to match the estimated average insert size. To exclude regions<br />

not accessible to the sequencing experiment, positions with sequencing depth zero are ignored,<br />

i. e. the average depth is calculated as<br />

∑<br />

¯d ′ x∈W<br />

W =<br />

d(x)<br />

|{x ∈ W : d(x) > 0}|<br />

where W is the set of reference sequence positions included by the sliding window <strong>and</strong> d(x)<br />

signies the depth at a position x. Since the sliding window may or may not contain enriched<br />

sites, ¯d′ W may be regarded a conservative estimate of the minimum average background signal<br />

over the window. ¯d′ W is used as the average coverage depth parameter to evaluate the potential<br />

enrichment at the central position x c of the sliding window by a one-sided Poisson test. The<br />

test calculates the p-value for the respective coverage depth from the cumulative distribution<br />

function for the upper tail of the Poisson distribution<br />

⌊d(x<br />

p(x c ) = 1 − e − ¯d ∑ c)⌋<br />

′<br />

W ·<br />

i=0<br />

( ¯d ′ W )i<br />

i!<br />

Consecutive positions falling below a certain signicance threshold, by default set to 0.05, are<br />

joined to form a c<strong>and</strong>idate region.<br />

Detection is ne-tunable using two auxiliary signicance thresholds. A relaxed probing signicance<br />

threshold is accepted to automatically join c<strong>and</strong>idate regions separated only by short<br />

drops in depth of coverage into a contiguous region. Furthermore, c<strong>and</strong>idate regions may be<br />

pre-ltered using an acceptance threshold, discarding all regions that do not include at least one<br />

position that satises this more stringent signicance criterion.<br />

2.7.4 Recognition of Read Mapping Artifacts<br />

Final signicance scores for enrichment are calculated based on the number of read mappings<br />

providing evidence for a respective site (section 2.7.5). For multiple reasons, signicance of<br />

enrichment is on its own of limited value for singling out relevant protein binding sites (section<br />

2.7.6). Peak signicance is therefore primarily used as a means of imposing a ranking on<br />

the detected loci.


48 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Strong oversampling of a region of the reference sequence, for example due to collapsing of<br />

repetitive sequence onto few representations in a reference (section 1.5.3), may result in high<br />

levels of signicance for such loci even for minimal strengths of enrichment. Thus oversampling<br />

entails high-ranking loci that are of limited relevance to the analysis when compared to signicant<br />

loci sampled at regular depth.<br />

SHORE peak oers various ltering heuristics that allow for reliable removal of typical oversampling<br />

artifacts. Oversampled regions are marked by elevated depth of coverage in both<br />

experiment <strong>and</strong> control sample. By specication of a maximum coverage depth tolerated in the<br />

control for valid c<strong>and</strong>idate regions, many oversampled positions are ruled out. SHORE lters<br />

c<strong>and</strong>idate regions based on their average depth of control sample coverage, discarding regions<br />

whose depth lies above a certain threshold. The rejection threshold is calculated adaptively in<br />

terms of a user specied multiple of st<strong>and</strong>ard deviations distance from the median chromosomal<br />

depth of coverage. When analyzing multiple replicate experiments, such suspiciously high depth<br />

of coverage in any of the controls specied triggers removal of the proposed c<strong>and</strong>idate region<br />

by our algorithm. By default, regions with average control sample read depth more than six<br />

st<strong>and</strong>ard deviations above median depth are ignored.<br />

Fold change, calculated as the ratio of read counts or average depths of experiment <strong>and</strong><br />

control sample coverage at a c<strong>and</strong>idate locus, in some cases presents a more sensitive criterion for<br />

recognition of loci that satisfy a strict signicance cuto due to oversampling, but are of limited<br />

relevance to further analysis. Oversampling is then marked by large amounts of supporting<br />

evidence in the form of reads mapped to the region, but comparably weakly pronounced fold<br />

change. SHORE calculates normalized fold change values for a region as<br />

f =<br />

¯d e<br />

k e,c · ¯d c<br />

(2.1)<br />

where ¯d e <strong>and</strong> ¯d c represent the mean depth of coverage for experiment <strong>and</strong> control, respectively,<br />

<strong>and</strong> k e,c a scaling factor calculated for experiment <strong>and</strong> control over the respective chromosome to<br />

accommodate dierences in overall sequencing depth (section 2.7.5). Regions with a normalized<br />

fold change f < 2 are not considered by default. For experiments comprising multiple replicates,<br />

the user-specied fold change criterion must be satised by at least one replicate.<br />

For true sequence enrichment, mapped reads represent precipitated fragments of DNA typically<br />

larger than the sequenced read length. By contrast, oversampling artifacts arise due to<br />

over-representation solely of the part of the sequence utilized for read mapping. Given this<br />

mapping length is shorter than the actual insert length, a characteristic dierence in the str<strong>and</strong><br />

specic read depth curves ensues for enrichment compared to oversampling. In the case of enrichment,<br />

read depth curves for the forward <strong>and</strong> the reverse str<strong>and</strong> are shifted, with the shift oset<br />

representing the missing 3 ′ parts of the inserts that are not taken into account for the read depth<br />

calculation (gure 2.6). Contrarily, for oversampled regions not comprising enriched sequence,<br />

no such shift in str<strong>and</strong> specic read depth may be observed. The peak shift of single-end read<br />

depth therefore may be exploited as an acceptance criterion for enriched regions. SHORE peak<br />

estimates a peak shift value for a region based on the two areas under the read depth curves<br />

for each str<strong>and</strong>. For both these areas, the center of gravity (COG) is calculated. The estimated<br />

peak shift then equals dierence between the x-coordinate of the COG for the reverse str<strong>and</strong> <strong>and</strong><br />

the x-coordinate of the forward str<strong>and</strong> COG. In multi-replicate experiments, the user-specied<br />

peak shift threshold must be satised by at least one replicate.<br />

COG-based shift estimation may be considered robust for dened loci such as transcription<br />

factor binding sites. For larger stretches of enriched sequence representing clustered binding<br />

sites or histone methylation, the COG is easily skewed by local biases or r<strong>and</strong>om uctuations in<br />

sequencing depth, <strong>and</strong> therefore does not provide reliable estimation of peak shift.


2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 49<br />

200<br />

SOC1<br />

100<br />

0<br />

Depth of Coverage [# Reads]<br />

-100<br />

-200<br />

10<br />

5<br />

0<br />

Control<br />

-5<br />

-10<br />

18810400 18810600 18810800 18811000 18811200 18811400 18811600 18811800<br />

Position [bp]<br />

Chromosome 2<br />

Figure 2.6: ChIP-seq Depth of Coverage in the Vicinity of a Transcription Factor Binding Site<br />

2.7.5 Ranking <strong>and</strong> Assessment of Peak Signicance<br />

In a nal step, the peak calling algorithm assesses detected c<strong>and</strong>idate regions to establish a<br />

ranking of putative peak sites <strong>and</strong> determine their level of signicance.<br />

For any c<strong>and</strong>idate region, given the read count for the experiment is a r<strong>and</strong>om variable X<br />

<strong>and</strong> the read count for the control is a r<strong>and</strong>om variable Y , then their joint read count is the<br />

convolution Z = X ∗ Y .<br />

Assuming the joint read count in a region takes an observed value Z = z, then the experiment<br />

read count is distributed according to a probability mass function P (X = x|Z = z). Using Bayes'<br />

theorem, it is<br />

P (Z = z|X = x) · P (X = x)<br />

P (X = x|Z = z) =<br />

P (Z = z)<br />

P (Y = (z − x)) · P (X = x)<br />

=<br />

P (Z = z)<br />

(2.2)<br />

P (Y = (z − x)) · P (X = x)<br />

= ∑ z<br />

k=0 (P (Y = (z − k)) · P (X = k)) (2.3)<br />

Were experiment <strong>and</strong> control read counts following Poisson distributions, X ∼ Pois(λ 1 )<br />

<strong>and</strong> Y ∼ Pois(λ 2 ), then their convolution is Poisson distributed as well, Z ∼ Pois(λ 3 ), where


50 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

λ 3 = λ 1 + λ 2 .<br />

When applying the Poisson assumption to equation 2.2, it follows that<br />

P (X = x|Z = z) =<br />

λ z−x<br />

2 ·e −λ 2<br />

(z−x)!<br />

· λx 1 ·e−λ 1<br />

x!<br />

λ z 3·e−λ 3<br />

z!<br />

z!<br />

=<br />

x! · (z − x)! · λx 1 · λ2 z−x · e −(λ1+λ2)<br />

λ z 3<br />

( · e−λ3<br />

z λ<br />

= ·<br />

x)<br />

x 1 · λ2<br />

z−x<br />

=<br />

( z<br />

x)<br />

·<br />

(λ 1 + λ 2 ) z<br />

( ) x λ1<br />

·<br />

λ 1 + λ 2<br />

(<br />

1 − λ 1<br />

λ 1 + λ 2<br />

) z−x<br />

(2.4)<br />

Therefore, given Poisson distributed read counts <strong>and</strong> an observed joint read count for a region<br />

Z = z, the read count of the experiment follows a binomial distribution, X ∼ B(z, p), with the<br />

probability parameter p as implied by equation 2.4<br />

p = λ 1<br />

λ 1 + λ 2<br />

(2.5)<br />

Our algorithm relies on the Poisson assumption to assess the signicance level of c<strong>and</strong>idate<br />

peaks using a one-sided binomial test, i. e. by evaluating the upper tail of the cumulative distribution<br />

function (CDF) of the binomial distribution for the experiment read count x <strong>and</strong> joint<br />

read count z<br />

z∑<br />

( z<br />

P (X ≥ ⌊x⌋) = · p<br />

k)<br />

k · (1 − p) z−k<br />

k=⌊x⌋<br />

The oor function ⌊ ⌋ is applied to x to accommodate weighted read counts as e. g. used for<br />

reads mapping to multiple positions of the reference genome in a conservative way.<br />

Normalization of experiment <strong>and</strong> control is therefore dened by calculation of an appropriate<br />

value for the probability parameter p of the binomial distribution. To obtain an estimate ˆp of the<br />

probability parameter we dene a sequencing depth scaling factor as δ = λ 1 /λ 2 . By equation 2.5,<br />

the relation between the depth scaling factor <strong>and</strong> the probability parameter is<br />

p =<br />

δ<br />

δ + 1<br />

δ =<br />

p<br />

1 − p<br />

Chromosomal, mitochondrial <strong>and</strong> plastid sequences may be present at considerably dierent<br />

number of copies in the cell, implying potentially dierent depth of coverage for these sequences.<br />

The probability parameter is therefore estimated independently for each chromosome or organelle<br />

sequence. The sequence is subdivided into xed-size non-overlapping bins. The normalization<br />

algorithm then chooses ˆδ such that the median of the read counts associated with the bins for<br />

the experiment equals the median of the control sample bins multiplied by ˆδ. This estimate is<br />

then used to replace δ in equation 2.6 to obtain a probability parameter for the binomial tests<br />

on the respective chromosome or organelle sequence.<br />

While detection of c<strong>and</strong>idate regions is applied to the joint data of all experiments, nal<br />

assessment of peak signicance is carried out separately for each replicate data set. This joint<br />

(2.6)


2.7. ROBUST DETECTION OF CHIP-SEQ ENRICHMENT 51<br />

detection, separate evaluation approach ensures that replicate results are immediately comparable<br />

without requirement for peak overlap calculation involving arbitrary thresholds. At the<br />

same time, signicance levels obtained still represent independent entities.<br />

To account for the potentially large number of binomial tests performed, we nally process<br />

the p-values calculated to obtain false discovery rates (FDR) using Benjamini-Hochberg estimation<br />

[143].<br />

Finally, to allow ne-grained custom ltering of peak lists, a variety of further peak properties<br />

are reported by our program, such as peak shift, peak height <strong>and</strong> fold enrichment ratio. Fold<br />

change values between sample <strong>and</strong> control are reported as a score F on the interval [−1, 1] using<br />

the transformation F = 4 π · arctan(f) − 1 on equation 2.1.<br />

2.7.6 Experimental Relevance of Peak Signicance<br />

Besides true DNA binding anity of the protein of interest <strong>and</strong> described artifact patterns,<br />

sources of bias may contribute to sequence enrichment in some cases indistinguishable in ChIP-<br />

Seq results.<br />

For example, between-sample variation in the strength of sequence specic sampling bias<br />

might potentially be introduced due to slight dierences in sample preparation parameters (section<br />

1.4). Modeling varying sequencing depth as a scaling factor that globally applies to entire<br />

chromosomes then has the consequence that the denition of enrichment of a sequence in a<br />

two-sample comparison may include positions with above-average sampling bias, if the specic<br />

bias is pronounced more strongly in the experiment than in the control; as well as positions with<br />

below-average bias where the bias in the experiment is weaker when compared to control. Such<br />

dierences in sampling bias strength can likely not be estimated globally, as the eects of multiple<br />

sources of bias might be superimposed. Adequate sampling bias estimation <strong>and</strong> correction<br />

therefore can be expected to require complex modeling based on specic sequence features <strong>and</strong><br />

likely further investigation into sequencing <strong>and</strong> sample preparation processes. Str<strong>and</strong> specic<br />

eects may however be ruled out by assessment of properties like peak shape, peak shift or<br />

forward-reverse fold change. Furthermore, such technical sources of enrichment can sometimes<br />

be excluded by downstream binding motif analysis.<br />

Moreover, correctly identied sites with above-average protein-DNA anity might be biologically<br />

irrelevant. With multiple co-factors contributing to transcription initiation, evolutionary<br />

forces restricting the genome-wide development of individual binding sites are likely weak. Relevant<br />

sites should therefore be dened by presence of further promoter elements. Furthermore,<br />

both transcription factor binding as well as the ChIP procedure itself are governed by complex<br />

physico-chemical dynamics. DNA fragments with above-average anity to the protein of interest<br />

might become detectable in ChIP-Seq, where the induced binding is however not suciently<br />

stable to result in transcription-regulatory relevance. Signicance of such sites is however solely<br />

determined by sucient depth of sequencing.<br />

Further biological factors such as chromatin state <strong>and</strong> accessibility may contribute to significant<br />

enrichment not directly related to protein binding events. For these reasons, statistical<br />

signicance of enrichment can on its own not provide a criterion for denition of biologically<br />

relevant binding sites.<br />

Furthermore, the interpretation of the ranking imposed by binding site signicance involves<br />

intricate considerations. Signal resulting from several dierent inuences becomes convoluted<br />

through the ChIP procedure. Quality of the employed antibody, overall quality of the precipitation<br />

procedure, prevalence of the DNA-protein interaction across tissues as well as the actual<br />

anity of the protein of interest to its target motifs all contribute to the overall signal observed.<br />

The rst two eects globally aect the ratio of enriched sites to the magnitude of background cov-


52 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

erage, <strong>and</strong> therefore impair the ability to observe global discrepancies in binding anity between<br />

samples. By contrast, tissue composition <strong>and</strong> binding anity are both locus specic eects that<br />

are indistinguishable during analysis, i. e. the procedure does not allow distinguishing between<br />

the strength of the Protein-DNA interaction <strong>and</strong> its predominance across the cells <strong>and</strong> tissue<br />

types from which the sample was retrieved. The large amount of tissue required for immunoprecipitation<br />

protocols likely contributes to a lack of controllability of the analyzed tissue states<br />

<strong>and</strong> composition, which may constitute a factor impeding experimental reproducibility.<br />

2.8 Visualization of Sequencing Read <strong>and</strong> Alignment <strong>Data</strong><br />

Exploration <strong>and</strong> iterative analysis of high-throughput sequencing data require computational<br />

tools able to rapidly retrieve <strong>and</strong> visualize certain subsets or features of a data set. After obtaining<br />

raw or ltered read data, assessment of base call quality, nucleotide composition <strong>and</strong> read length<br />

distribution is desirable to verify overall data quality prior to further analysis (section 1.5.1).<br />

At later stages of data analysis, visual inspection of read alignment data can be a valuable<br />

tool for verication of loci of identied mutation c<strong>and</strong>idates (section 1.3) or underst<strong>and</strong>ing the<br />

structure of complex genomic regions of clustered polymorphisms or rearrangements. For other<br />

applications such as mRNA-, small RNA- or ChIP-Seq, examination of local depth of coverage<br />

curves is an adequate approach to value structure or expression of individual transcripts or<br />

rule out a potential mapping artifact. Finally, dierent approaches to data visualization are<br />

required for gaining cues towards potential megabase scale or genome wide characteristics of the<br />

sequencing depth distribution in applications targeting e. g. epigenomic modications.<br />

While web browser or graphical user interface driven applications allow fully interactive navigation<br />

of data sets, they may require considerable eort in data preparation or are not easily<br />

operated remotely over a network.<br />

The SHORE pipeline implements a variety of means to ad-hoc sequencing data visualization.<br />

The SHORE tagstats utility oers a summarized view of base quality distributions, per-base<br />

nucleotide composition as well as read length distribution. We further provide an application<br />

SHORE mapdisplay capable of rapidly generating comprehensible views of read alignment data,<br />

either as a text-based representation viewable in a terminal window, or in the form of static<br />

scalable vector graphics (SVG) images readily displayed in modern web browsers. In addition<br />

to that, versatile utilities SHORE coverage <strong>and</strong> SHORE count provide options for visualizing local<br />

depth of coverage as well as larger-scale genomic distribution of read mappings, respectively.<br />

2.8.1 Gathering Run Quality <strong>and</strong> Sequence Composition Statistics<br />

The SHORE tagstats utility summarizes content of unique read sequences in a data set, <strong>and</strong><br />

additionally gathers base quality <strong>and</strong> per base nucleotide statistics. Statistics are collected in<br />

various tables <strong>and</strong> may additionally be visualized in the form of an all-in-one gure generated as<br />

a static Portable Document Format (PDF) image (gure 2.7).<br />

Read quality distribution is represented as a per-base box plot, with indicators for median<br />

values, second <strong>and</strong> third quartile as well as the range included by a single st<strong>and</strong>ard deviation<br />

distance from the mean. Median <strong>and</strong> quartile values are corrected for quantization bias (section<br />

2.8.4).<br />

Total number of reads extending beyond a certain length is indicated by the support curve.<br />

The gure further includes the frequency of either nucleotide at each cycle of the sequencing run.<br />

Nucleotide frequencies are displayed multiplied by four to allow lineup with the support curve. In<br />

presence of sequencing errors <strong>and</strong> varying read length, uniqueness of read sequences cannot serve<br />

as a means of removal of duplicate sequences arising due to PCR amplication. Moreover, the


2.8. VISUALIZATION OF SEQUENCING READ AND ALIGNMENT DATA 53<br />

Quality Score | Nucleotide Count<br />

0 8 16 24 32 40 48<br />

0 5356 10712 16067 21423 26779 32135<br />

Median Base Quality<br />

Inner Quality Quartiles<br />

Average Quality ±SD<br />

Support<br />

4A<br />

4C<br />

4G<br />

4T<br />

Length Distribution<br />

Dashed: Unique Sequences<br />

1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 160<br />

Cycle [bp]<br />

Figure 2.7: Read <strong>and</strong> Quality Statistics for a Paired-End Sequencing Run Following Quality-<br />

Based Read Trimming<br />

likelihood of reads originating from duplicates having the same nucleotide sequence decreases with<br />

increasing read length (sections 1.4, 2.7.2). Frequency <strong>and</strong> nucleotide composition of unique read<br />

sequences can however still provide valid clues on ChIP-Seq <strong>lib</strong>rary quality <strong>and</strong> complexity <strong>and</strong><br />

reveal important sample properties in small RNA sequencing applications. Thus, support <strong>and</strong><br />

nucleotide composition after pruning of redundant read sequences is indicated by a corresponding<br />

dashed line for each property.<br />

Finally, distribution of read lengths is dicult to pick up visually from support distributions<br />

alone. For improved appraisal, the distributions are therefore additionally indicated by short<br />

vertical bars at the bottom of the gure.<br />

While the presented superimposition of quality <strong>and</strong> nucleotide composition curves may in<br />

some cases serve to obfuscate the interpretability of either individual property, it renders multivariate<br />

entities such as correlations between read quality issues <strong>and</strong> skews in nucleotide composition<br />

or the eectiveness of read trimming algorithms visually immediately appraisable.


54 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

G<br />

g<br />

G<br />

G<br />

G<br />

A<br />

G<br />

-A<br />

a<br />

a-<br />

G<br />

T<br />

C<br />

c<br />

T<br />

A<br />

G<br />

A<br />

a<br />

A<br />

G<br />

G<br />

a<br />

A<br />

c<br />

G<br />

c<br />

a<br />

A<br />

G<br />

A<br />

A<br />

A<br />

a<br />

T<br />

g<br />

C<br />

g<br />

A<br />

c<br />

A<br />

T<br />

C C<br />

A<br />

A<br />

C<br />

A<br />

A<br />

A<br />

C<br />

A<br />

a<br />

A<br />

G<br />

G<br />

A<br />

T<br />

A<br />

G<br />

t<br />

A<br />

a<br />

a<br />

G<br />

G<br />

A<br />

C<br />

G<br />

a<br />

a<br />

a<br />

T<br />

T<br />

T<br />

A<br />

5’<br />

T<br />

A<br />

t<br />

c-<br />

A<br />

A<br />

T<br />

A<br />

A<br />

t<br />

C<br />

G<br />

G<br />

C<br />

a<br />

G<br />

A<br />

TA<br />

G<br />

A<br />

A<br />

T<br />

G<br />

G<br />

g<br />

t<br />

G<br />

A<br />

T<br />

C<br />

t<br />

A<br />

c<br />

T<br />

CA<br />

t<br />

A<br />

T<br />

c<br />

G<br />

c<br />

a<br />

C<br />

G<br />

A<br />

G<br />

C<br />

5’<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

c c<br />

G G<br />

G<br />

3’<br />

G<br />

A<br />

3’<br />

c<br />

T<br />

T<br />

A<br />

g<br />

A<br />

t<br />

G<br />

T<br />

a<br />

TN<br />

a<br />

C<br />

C<br />

A<br />

G<br />

C<br />

G<br />

A<br />

a<br />

t<br />

A<br />

A<br />

A<br />

a<br />

A<br />

g<br />

-A<br />

A<br />

C<br />

t<br />

A<br />

C<br />

G<br />

A<br />

g<br />

C<br />

c<br />

a<br />

a<br />

a<br />

C<br />

A<br />

c<br />

A<br />

A<br />

C<br />

T<br />

T<br />

G<br />

T<br />

T<br />

A<br />

G<br />

C<br />

T<br />

T<br />

G<br />

G<br />

A<br />

A<br />

A<br />

-A<br />

A<br />

a<br />

A<br />

A<br />

g<br />

C<br />

g<br />

A<br />

c<br />

T<br />

c<br />

G<br />

A<br />

t<br />

G<br />

A<br />

G<br />

A<br />

A<br />

G<br />

A<br />

T<br />

A-<br />

A<br />

a<br />

a<br />

a<br />

c<br />

A<br />

a<br />

c<br />

G<br />

C<br />

A<br />

g<br />

A<br />

G<br />

A<br />

a<br />

t<br />

3’<br />

A<br />

A<br />

G<br />

c<br />

c<br />

t<br />

G<br />

A<br />

A<br />

g<br />

a<br />

G<br />

a<br />

g<br />

C<br />

A<br />

A<br />

A<br />

G<br />

a<br />

a<br />

a<br />

G<br />

G<br />

g<br />

a A<br />

T<br />

T<br />

T<br />

t<br />

t<br />

A<br />

A<br />

A<br />

A 3’<br />

A<br />

a<br />

a<br />

G<br />

G<br />

g<br />

G<br />

c<br />

5’<br />

G<br />

T<br />

G<br />

t<br />

T<br />

A-<br />

g<br />

3’<br />

t<br />

G<br />

g<br />

a<br />

A<br />

a-<br />

A<br />

A<br />

A-<br />

G<br />

C<br />

a<br />

5’<br />

C<br />

G<br />

a<br />

G<br />

G<br />

G<br />

A<br />

T<br />

T<br />

a<br />

g<br />

A-<br />

C<br />

A<br />

A<br />

a<br />

A<br />

G<br />

a<br />

c<br />

A<br />

C<br />

a<br />

a<br />

A<br />

G<br />

G<br />

T<br />

C<br />

c<br />

a<br />

g<br />

T<br />

A<br />

T<br />

C<br />

TA<br />

A<br />

t<br />

T<br />

A-At<br />

A<br />

A<br />

G<br />

c<br />

A<br />

A<br />

T<br />

A<br />

A<br />

T<br />

A<br />

g<br />

G<br />

g<br />

a<br />

-A<br />

C<br />

G<br />

c<br />

-A<br />

G<br />

A-<br />

g<br />

g<br />

A<br />

A<br />

C C<br />

T<br />

ta<br />

A<br />

C<br />

T<br />

A<br />

A<br />

C<br />

A<br />

A-<br />

G<br />

a<br />

a<br />

a<br />

T<br />

a<br />

T<br />

g<br />

A<br />

a<br />

T<br />

A<br />

A<br />

A<br />

t<br />

T<br />

-a<br />

A<br />

-a<br />

A<br />

C<br />

g<br />

a<br />

t<br />

G<br />

A<br />

T<br />

g<br />

A<br />

G<br />

T<br />

3’<br />

5’<br />

3’g<br />

A<br />

C<br />

G<br />

A<br />

A<br />

a<br />

a<br />

a<br />

A 3’<br />

G<br />

t<br />

A<br />

C<br />

G<br />

A<br />

C<br />

g<br />

C<br />

g<br />

g<br />

G<br />

a<br />

G<br />

c<br />

A<br />

A<br />

A<br />

G<br />

A<br />

c<br />

A<br />

3’A<br />

A<br />

t<br />

G<br />

g<br />

g<br />

T<br />

A<br />

A<br />

A<br />

g<br />

A<br />

a<br />

G<br />

G<br />

c<br />

A<br />

ac<br />

a<br />

a<br />

a<br />

A<br />

A<br />

a<br />

a<br />

A<br />

A<br />

A<br />

a<br />

A<br />

G<br />

c<br />

A<br />

g<br />

a<br />

g<br />

G<br />

A<br />

t<br />

A-<br />

G<br />

A-<br />

G<br />

a<br />

3’<br />

A a<br />

G<br />

C<br />

C<br />

3’<br />

A<br />

A<br />

3’A<br />

5’<br />

C<br />

3’<br />

G<br />

G<br />

t<br />

A<br />

A<br />

T<br />

A<br />

a<br />

a<br />

T<br />

a<br />

C<br />

a<br />

a<br />

C<br />

a<br />

G<br />

A<br />

a<br />

A<br />

A<br />

a<br />

T<br />

T<br />

T<br />

T<br />

a<br />

A<br />

A<br />

A A<br />

g<br />

T<br />

a<br />

G G<br />

G<br />

A<br />

g<br />

t<br />

G<br />

g<br />

C<br />

A<br />

A<br />

c<br />

a<br />

A<br />

A<br />

G G<br />

G<br />

A<br />

G<br />

T<br />

a<br />

T<br />

G<br />

T<br />

A<br />

g<br />

G<br />

G<br />

G<br />

A<br />

T<br />

T<br />

T<br />

C<br />

a<br />

g<br />

A<br />

T<br />

A<br />

t<br />

T<br />

G<br />

A<br />

t<br />

c<br />

A<br />

G<br />

c<br />

g<br />

A<br />

g<br />

t<br />

t T<br />

A<br />

-A<br />

C<br />

-a<br />

a<br />

t<br />

C<br />

t<br />

A<br />

A<br />

c<br />

T<br />

a<br />

A<br />

A<br />

g<br />

a<br />

5’G<br />

g<br />

t<br />

G<br />

G<br />

G<br />

a<br />

C<br />

a<br />

A<br />

a<br />

G<br />

C<br />

A<br />

A<br />

A<br />

c<br />

3’<br />

C<br />

C<br />

t<br />

g<br />

g<br />

g<br />

C<br />

a A<br />

C<br />

C<br />

T<br />

T<br />

A<br />

a<br />

A<br />

T<br />

A<br />

g<br />

a<br />

G<br />

A<br />

a<br />

G G<br />

a<br />

5’<br />

A<br />

G<br />

G<br />

a<br />

G<br />

T<br />

a<br />

A<br />

c<br />

a a<br />

C<br />

A<br />

5’<br />

A<br />

5’<br />

A<br />

a<br />

A<br />

A<br />

c<br />

A<br />

a<br />

C<br />

A<br />

g<br />

g<br />

g<br />

3’<br />

G<br />

G<br />

G<br />

G<br />

T<br />

A<br />

A<br />

G<br />

G<br />

A<br />

g<br />

g<br />

A<br />

A<br />

a<br />

A<br />

A<br />

A<br />

g<br />

C<br />

t<br />

a<br />

G<br />

A<br />

a<br />

a<br />

t<br />

T<br />

T<br />

5’<br />

t<br />

g<br />

t<br />

A<br />

C<br />

a<br />

A<br />

a<br />

A<br />

a<br />

3’<br />

A<br />

A<br />

A<br />

A A<br />

A<br />

3’<br />

A<br />

t<br />

t<br />

A<br />

A<br />

t<br />

T<br />

a<br />

G<br />

G<br />

A<br />

A<br />

G<br />

A<br />

A<br />

T<br />

G<br />

t<br />

T<br />

g<br />

a<br />

a A<br />

g<br />

G<br />

t<br />

G<br />

g<br />

G<br />

a<br />

G<br />

G<br />

a<br />

T<br />

A<br />

c<br />

G<br />

t<br />

T<br />

T<br />

G<br />

a<br />

G<br />

T<br />

t<br />

5’<br />

G<br />

G<br />

G<br />

G<br />

A<br />

c<br />

A<br />

t<br />

a<br />

A<br />

a<br />

3’<br />

a<br />

T<br />

C<br />

t<br />

A<br />

T<br />

g<br />

G<br />

T<br />

ca<br />

T<br />

G g<br />

g<br />

G<br />

ac<br />

T<br />

a<br />

t<br />

-A<br />

3’<br />

t<br />

a<br />

5’<br />

t<br />

A<br />

T<br />

5’<br />

G<br />

A<br />

G<br />

A<br />

t<br />

A<br />

g<br />

G<br />

-a<br />

a<br />

A<br />

g<br />

a<br />

C<br />

C<br />

A<br />

A<br />

t<br />

G<br />

t<br />

A A<br />

G<br />

C<br />

A<br />

G<br />

a<br />

C<br />

A<br />

A<br />

C<br />

C<br />

G<br />

t<br />

A<br />

a<br />

a<br />

a<br />

ga<br />

A<br />

G<br />

A<br />

A<br />

A<br />

a<br />

A<br />

A<br />

a<br />

g<br />

G<br />

a<br />

T<br />

A<br />

G<br />

G<br />

a<br />

A<br />

g<br />

t<br />

a<br />

A<br />

G<br />

T<br />

a<br />

A<br />

A<br />

C<br />

A<br />

t<br />

A<br />

A<br />

A<br />

A<br />

C<br />

c<br />

C<br />

A<br />

G<br />

C<br />

A<br />

G<br />

G G<br />

G<br />

G<br />

A<br />

g<br />

A<br />

G<br />

G<br />

G<br />

a<br />

C<br />

a<br />

A<br />

g<br />

C<br />

a<br />

A<br />

3’<br />

g<br />

T<br />

g<br />

A<br />

T<br />

A<br />

T<br />

A<br />

C<br />

T<br />

a<br />

A<br />

T<br />

A<br />

A<br />

A A<br />

A<br />

A<br />

A<br />

a<br />

G<br />

a<br />

T<br />

G<br />

g<br />

g<br />

t<br />

A<br />

T<br />

a<br />

G G<br />

A<br />

C<br />

G<br />

a<br />

A<br />

5’<br />

A<br />

-A<br />

T<br />

G<br />

G<br />

G G<br />

G<br />

c<br />

t<br />

A<br />

C<br />

T<br />

g<br />

G<br />

A<br />

a<br />

A<br />

G<br />

A<br />

t<br />

a<br />

A<br />

t<br />

C<br />

A<br />

C<br />

A<br />

T<br />

c<br />

T<br />

A<br />

g<br />

g<br />

t<br />

A<br />

g<br />

g<br />

g<br />

t<br />

T<br />

A A<br />

3’<br />

A<br />

g<br />

t<br />

G<br />

A<br />

g<br />

a<br />

G<br />

a<br />

A<br />

g G<br />

g<br />

a<br />

g<br />

A<br />

A<br />

A<br />

G<br />

A<br />

A<br />

g<br />

A<br />

A<br />

A<br />

A<br />

A<br />

ca<br />

A<br />

A<br />

C<br />

G<br />

g<br />

T<br />

G<br />

T<br />

A<br />

A<br />

G<br />

a<br />

A<br />

A<br />

C<br />

c<br />

T<br />

C<br />

t<br />

C<br />

A<br />

A<br />

A<br />

g<br />

a<br />

A<br />

a<br />

G<br />

T<br />

g<br />

g<br />

a<br />

A<br />

A<br />

t<br />

A<br />

a<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

C<br />

a<br />

a<br />

3’G<br />

c<br />

A<br />

g<br />

g<br />

g<br />

G<br />

A<br />

A<br />

G<br />

G<br />

3’<br />

t<br />

G<br />

a<br />

a<br />

A<br />

C<br />

a<br />

3’<br />

a<br />

G<br />

5’<br />

G<br />

g G<br />

c<br />

a<br />

T<br />

c<br />

T<br />

T<br />

T<br />

A<br />

a<br />

a-<br />

G<br />

A<br />

t<br />

G<br />

c<br />

G<br />

G<br />

a<br />

A<br />

A<br />

G<br />

A<br />

c<br />

G<br />

A<br />

AG<br />

T<br />

g<br />

-A<br />

G<br />

G<br />

a<br />

G<br />

T<br />

G<br />

A<br />

a<br />

G<br />

A<br />

c<br />

T<br />

G G<br />

a<br />

a<br />

c<br />

C<br />

G<br />

G<br />

T<br />

-a<br />

t<br />

G<br />

5’T<br />

T<br />

G<br />

T<br />

T<br />

a<br />

c C<br />

G<br />

g<br />

G<br />

g<br />

a<br />

C<br />

t<br />

G<br />

c<br />

ac<br />

3’<br />

c<br />

A<br />

G<br />

3’<br />

a<br />

A<br />

c<br />

T<br />

c<br />

t<br />

a<br />

a<br />

G<br />

G<br />

a<br />

A<br />

A<br />

a<br />

A<br />

3’<br />

T<br />

T<br />

T<br />

a<br />

ta<br />

C<br />

3’<br />

a<br />

c<br />

a<br />

a<br />

G<br />

C<br />

t<br />

A<br />

a<br />

A<br />

G<br />

G<br />

g<br />

A A<br />

A<br />

A<br />

A<br />

3’<br />

A<br />

a<br />

C<br />

c<br />

G<br />

A<br />

c<br />

a<br />

c<br />

g<br />

T<br />

t<br />

t<br />

G<br />

a<br />

a<br />

g<br />

ca<br />

5’<br />

A<br />

A<br />

a<br />

3’<br />

C<br />

A<br />

C<br />

c<br />

g<br />

T<br />

T<br />

T<br />

A<br />

A<br />

A<br />

-A<br />

G<br />

g<br />

-A<br />

A<br />

G<br />

t<br />

5’<br />

A<br />

A<br />

5’<br />

A<br />

T<br />

g<br />

g<br />

A<br />

a<br />

A 5’<br />

a<br />

C<br />

t<br />

a<br />

A<br />

a<br />

A<br />

G g<br />

A<br />

G<br />

a<br />

G<br />

A<br />

A<br />

g<br />

a<br />

A<br />

G<br />

3’<br />

G<br />

g<br />

T<br />

a<br />

T<br />

C<br />

G<br />

A<br />

T<br />

A<br />

A<br />

A<br />

a<br />

g<br />

G<br />

a<br />

3’<br />

t<br />

a<br />

T<br />

c<br />

T<br />

TAT<br />

a<br />

T<br />

C<br />

G<br />

c<br />

T<br />

A<br />

a<br />

a<br />

G<br />

a<br />

C<br />

A<br />

T<br />

A<br />

a<br />

g G<br />

A-<br />

G G<br />

A<br />

T<br />

t<br />

A<br />

T<br />

a<br />

a<br />

A<br />

a<br />

A<br />

a<br />

A<br />

a<br />

C<br />

a<br />

GC<br />

G<br />

A<br />

a<br />

C<br />

g<br />

A<br />

G G<br />

T<br />

T<br />

T<br />

A<br />

A<br />

T<br />

T<br />

T<br />

G<br />

A<br />

a<br />

t<br />

a<br />

c<br />

a a<br />

A<br />

A<br />

c<br />

a A<br />

T<br />

c<br />

G<br />

a<br />

a<br />

T<br />

T<br />

A<br />

a<br />

G<br />

G<br />

3’<br />

a<br />

a<br />

g<br />

g<br />

g<br />

A<br />

G<br />

A<br />

g<br />

A<br />

C<br />

c<br />

A<br />

t<br />

T<br />

A<br />

C<br />

C<br />

A<br />

A<br />

T<br />

C<br />

C<br />

A<br />

T<br />

g<br />

A<br />

C<br />

a<br />

A<br />

T<br />

G<br />

T<br />

g<br />

a<br />

A<br />

A a<br />

A<br />

A<br />

a<br />

C<br />

3’<br />

a<br />

C c<br />

a<br />

G<br />

G<br />

g<br />

t<br />

g<br />

g<br />

A<br />

a<br />

A<br />

a<br />

A<br />

g<br />

G<br />

A<br />

G<br />

T<br />

T<br />

c<br />

t<br />

a<br />

5’<br />

A<br />

a<br />

A<br />

a<br />

5’<br />

G<br />

A<br />

a<br />

G<br />

g<br />

A<br />

A<br />

T<br />

G<br />

g<br />

A<br />

A<br />

t<br />

a a<br />

A<br />

C<br />

t<br />

ACA<br />

A<br />

A<br />

A<br />

A<br />

c<br />

c<br />

A<br />

a<br />

A<br />

A<br />

a<br />

A<br />

T<br />

A<br />

g<br />

G<br />

G<br />

C<br />

a<br />

G<br />

T<br />

G<br />

g<br />

t<br />

G<br />

G<br />

5’<br />

G G<br />

A<br />

A<br />

C<br />

a<br />

A<br />

A<br />

C<br />

A<br />

t<br />

A<br />

C<br />

t<br />

C<br />

T<br />

G<br />

A a<br />

t<br />

A<br />

T<br />

a<br />

T<br />

a<br />

a<br />

A<br />

A a<br />

a<br />

a<br />

t<br />

G<br />

A<br />

A<br />

G<br />

A<br />

G<br />

C<br />

G<br />

a<br />

a<br />

a<br />

A<br />

A<br />

a<br />

a<br />

a<br />

T<br />

A<br />

C<br />

a<br />

g<br />

c<br />

g<br />

A<br />

c<br />

C<br />

a<br />

G<br />

-A<br />

A<br />

C<br />

A-<br />

G<br />

T<br />

T<br />

T<br />

g<br />

t<br />

A A<br />

G<br />

T<br />

T<br />

t<br />

A<br />

G<br />

G<br />

A<br />

A<br />

a<br />

A-<br />

T<br />

T<br />

a<br />

a<br />

a a<br />

c<br />

A<br />

T<br />

c<br />

A<br />

T<br />

A<br />

T<br />

3’<br />

A<br />

G<br />

A a<br />

G<br />

GN<br />

a<br />

A<br />

A<br />

5’<br />

G<br />

G<br />

T<br />

a<br />

A<br />

T<br />

G<br />

C<br />

a A<br />

C<br />

t<br />

G<br />

a<br />

T<br />

G<br />

A<br />

a<br />

G<br />

G<br />

G<br />

A<br />

A<br />

G<br />

G<br />

A<br />

A<br />

t<br />

G<br />

T<br />

A<br />

C<br />

a<br />

G<br />

A<br />

a<br />

A<br />

a<br />

G<br />

A<br />

C<br />

a<br />

g<br />

a<br />

G<br />

c<br />

g<br />

T<br />

A<br />

G<br />

5’<br />

G<br />

T<br />

t<br />

g<br />

g<br />

AC<br />

a a<br />

G<br />

G<br />

A<br />

G<br />

G<br />

a<br />

A<br />

A<br />

A<br />

G<br />

A<br />

5’<br />

A<br />

A<br />

A<br />

C<br />

G<br />

t<br />

a<br />

T<br />

C<br />

A<br />

G<br />

a<br />

a<br />

G<br />

A<br />

3’<br />

a<br />

A<br />

G<br />

T<br />

T<br />

G<br />

G<br />

G<br />

A<br />

G<br />

G<br />

TG<br />

a<br />

G<br />

G<br />

GN<br />

t<br />

a<br />

A<br />

A<br />

A<br />

C<br />

A<br />

A<br />

a<br />

a<br />

T<br />

A<br />

A<br />

A<br />

T<br />

T<br />

a<br />

C<br />

A<br />

t<br />

A<br />

C<br />

c<br />

C<br />

T<br />

A<br />

A<br />

A<br />

A<br />

A<br />

G<br />

A<br />

g<br />

T<br />

c<br />

A<br />

A<br />

G<br />

A<br />

G<br />

a<br />

A<br />

A<br />

G<br />

a<br />

C<br />

A-<br />

A<br />

c<br />

T<br />

A<br />

G g<br />

G<br />

A<br />

A<br />

A<br />

T<br />

5’<br />

A<br />

a<br />

t<br />

g<br />

A<br />

a<br />

A<br />

a<br />

G<br />

A<br />

A<br />

c<br />

g<br />

C<br />

a<br />

A<br />

-A<br />

A<br />

G<br />

T<br />

G<br />

5’<br />

c<br />

A<br />

T<br />

C<br />

5’<br />

g<br />

A<br />

A<br />

a<br />

G<br />

G<br />

a<br />

a<br />

g<br />

A<br />

G<br />

G<br />

G<br />

G<br />

a<br />

a<br />

A<br />

T<br />

a<br />

T<br />

a<br />

C<br />

C<br />

G<br />

A<br />

3’<br />

A<br />

g<br />

a<br />

a<br />

T<br />

T<br />

C<br />

G<br />

G<br />

T<br />

G<br />

a<br />

T<br />

G<br />

G<br />

t<br />

a<br />

c<br />

g<br />

c<br />

C<br />

5’<br />

C<br />

g<br />

A-<br />

G<br />

A A<br />

G<br />

A<br />

A A<br />

a<br />

t<br />

t<br />

G<br />

t<br />

A<br />

T<br />

G<br />

c<br />

A<br />

a<br />

C<br />

C<br />

C<br />

A<br />

G<br />

C<br />

C<br />

G<br />

a<br />

G<br />

T<br />

T<br />

A<br />

A<br />

A<br />

t<br />

g<br />

T<br />

t<br />

a<br />

T<br />

A<br />

g G<br />

g<br />

A<br />

g<br />

a<br />

t t<br />

a<br />

T<br />

A<br />

T<br />

A<br />

T<br />

t<br />

G<br />

T<br />

t<br />

G<br />

a<br />

3’<br />

G<br />

3’<br />

A<br />

A<br />

A<br />

a<br />

A<br />

a<br />

A<br />

G<br />

a<br />

A<br />

a<br />

a<br />

g<br />

T<br />

g<br />

G<br />

G<br />

C<br />

A<br />

TG<br />

C<br />

A A<br />

t T<br />

a<br />

C<br />

g<br />

t<br />

T<br />

A<br />

A<br />

G<br />

T<br />

A<br />

A<br />

T<br />

t<br />

a<br />

A<br />

G<br />

g<br />

A<br />

a<br />

A<br />

A<br />

G<br />

T<br />

A<br />

T<br />

A<br />

A A<br />

C<br />

A A<br />

a<br />

A<br />

A<br />

A<br />

G<br />

A<br />

g<br />

A<br />

A<br />

T<br />

g<br />

A<br />

A<br />

T<br />

G<br />

a<br />

A<br />

g<br />

A<br />

-A<br />

G<br />

A<br />

g<br />

g<br />

TA<br />

g<br />

C<br />

t t<br />

T<br />

A<br />

T<br />

T<br />

g<br />

g<br />

G<br />

A<br />

C<br />

c<br />

G<br />

T<br />

A<br />

T<br />

a<br />

C<br />

A<br />

a<br />

T<br />

C<br />

G<br />

G<br />

G<br />

G<br />

A<br />

a<br />

t<br />

C<br />

A<br />

C<br />

a<br />

G<br />

a<br />

c<br />

G<br />

G<br />

A<br />

t<br />

c<br />

c<br />

T<br />

a<br />

c<br />

C<br />

C<br />

G<br />

A<br />

A<br />

A<br />

A<br />

a<br />

A<br />

a<br />

A<br />

T<br />

A<br />

G<br />

a<br />

5’<br />

a<br />

A<br />

g<br />

T<br />

A<br />

A<br />

A<br />

A<br />

A<br />

a a<br />

A<br />

c<br />

C<br />

c<br />

A<br />

C<br />

C<br />

g<br />

c<br />

3’<br />

A<br />

G<br />

G<br />

t<br />

g<br />

C<br />

t<br />

G<br />

a<br />

G<br />

A<br />

A<br />

5’<br />

t<br />

g<br />

A<br />

c<br />

a<br />

gt<br />

A<br />

G<br />

T<br />

A<br />

T<br />

g<br />

T<br />

A<br />

T<br />

T<br />

A<br />

G<br />

t<br />

G<br />

A<br />

G<br />

t<br />

G<br />

g<br />

A<br />

A<br />

a<br />

g<br />

T<br />

A<br />

A<br />

A<br />

g<br />

G<br />

g<br />

c C<br />

A<br />

T<br />

A<br />

A<br />

A<br />

A<br />

A<br />

T<br />

a<br />

A-<br />

A<br />

TG<br />

a<br />

c<br />

a<br />

T<br />

t<br />

A<br />

t<br />

G<br />

g<br />

A<br />

A<br />

g<br />

A<br />

a<br />

T<br />

5’<br />

T<br />

C<br />

A-<br />

A<br />

A<br />

A<br />

A<br />

G<br />

A<br />

G<br />

t<br />

G<br />

a<br />

G<br />

T<br />

G<br />

a<br />

A<br />

T<br />

a<br />

g<br />

a<br />

G<br />

G<br />

A<br />

a<br />

a<br />

C<br />

G<br />

G<br />

t<br />

A A<br />

T<br />

-A<br />

G<br />

a A<br />

t T<br />

G<br />

A<br />

G<br />

g<br />

a<br />

A<br />

G<br />

g<br />

A<br />

A<br />

C<br />

A<br />

A<br />

a<br />

G<br />

3’<br />

G<br />

A<br />

C<br />

G<br />

t<br />

A<br />

A<br />

T<br />

A<br />

G<br />

c<br />

G<br />

C<br />

g<br />

G<br />

t<br />

C<br />

c<br />

A<br />

C<br />

a<br />

A<br />

C<br />

a<br />

t<br />

a<br />

C<br />

A<br />

t<br />

A<br />

3’<br />

G<br />

a<br />

A<br />

G<br />

T<br />

a<br />

G<br />

a<br />

A<br />

t<br />

A<br />

g<br />

a<br />

a<br />

A<br />

C<br />

A<br />

G<br />

a<br />

A<br />

5’g<br />

C<br />

G<br />

t<br />

G<br />

a<br />

G<br />

G<br />

a<br />

a<br />

A<br />

a<br />

-A<br />

A<br />

g<br />

c<br />

g<br />

G<br />

A<br />

A<br />

G<br />

g<br />

T<br />

G<br />

A<br />

T<br />

T<br />

A<br />

A<br />

G<br />

G<br />

A<br />

g<br />

A<br />

T<br />

G<br />

G<br />

A<br />

3’<br />

g<br />

A<br />

g<br />

A<br />

a<br />

3’<br />

C<br />

g<br />

a<br />

A<br />

c c<br />

C<br />

g<br />

A<br />

a<br />

A<br />

5’<br />

a<br />

A<br />

a<br />

A<br />

t<br />

A<br />

A<br />

A<br />

T<br />

A<br />

g g<br />

A<br />

T<br />

A<br />

A<br />

A<br />

G<br />

3’<br />

a-<br />

C<br />

T<br />

A<br />

A<br />

A<br />

A<br />

5’<br />

g<br />

G<br />

a<br />

g<br />

G<br />

A<br />

T<br />

G<br />

T<br />

G<br />

T<br />

A<br />

g<br />

A<br />

T<br />

A<br />

G<br />

T<br />

G<br />

a<br />

A<br />

A<br />

C<br />

a A<br />

C<br />

A<br />

A<br />

-A<br />

G<br />

A<br />

A<br />

t<br />

A-<br />

A<br />

A<br />

A<br />

t<br />

A<br />

-N<br />

A<br />

T<br />

5’<br />

T<br />

a<br />

A<br />

A<br />

g<br />

C<br />

t<br />

G<br />

3’<br />

g<br />

A-<br />

g<br />

A-<br />

A<br />

C<br />

a<br />

a<br />

A<br />

G<br />

A<br />

A<br />

G<br />

A<br />

A<br />

A<br />

C<br />

3’<br />

3’<br />

G<br />

G<br />

g<br />

T<br />

G<br />

G<br />

A<br />

t<br />

5’<br />

c<br />

T<br />

c<br />

A<br />

G<br />

C<br />

A<br />

a<br />

A<br />

C<br />

T<br />

A A<br />

a<br />

A<br />

a<br />

a<br />

C<br />

A<br />

A<br />

A<br />

g<br />

G<br />

3’<br />

g<br />

C<br />

g<br />

a<br />

T<br />

c<br />

c<br />

A<br />

A<br />

A<br />

C<br />

A<br />

C<br />

C<br />

G<br />

G<br />

G<br />

C<br />

C<br />

GT<br />

A<br />

t<br />

t<br />

T<br />

G<br />

T<br />

g<br />

a<br />

g<br />

A A<br />

A<br />

g<br />

A<br />

G<br />

5’<br />

C<br />

A<br />

A<br />

T t<br />

A<br />

5’<br />

A a<br />

t<br />

T<br />

t<br />

a a<br />

A<br />

t<br />

g<br />

g<br />

A<br />

G<br />

ac<br />

C<br />

A<br />

A<br />

T<br />

G<br />

A<br />

G<br />

g<br />

A<br />

C<br />

A<br />

A<br />

T<br />

c<br />

G<br />

c<br />

5’<br />

c<br />

G<br />

A<br />

A<br />

A<br />

A a<br />

A<br />

A<br />

A<br />

a<br />

A<br />

A<br />

A<br />

A<br />

a<br />

A<br />

g<br />

ac<br />

A<br />

G<br />

A<br />

G<br />

A<br />

G<br />

G<br />

G<br />

a<br />

G<br />

A<br />

T<br />

c<br />

c<br />

A A<br />

A<br />

TN<br />

G<br />

T<br />

G<br />

G<br />

G<br />

G<br />

G<br />

T<br />

5’<br />

G<br />

C<br />

A<br />

A<br />

5’<br />

G<br />

g<br />

A<br />

T<br />

G<br />

G<br />

A<br />

G<br />

A<br />

A<br />

c<br />

A<br />

A<br />

A-<br />

A<br />

G<br />

A<br />

G<br />

A<br />

3’<br />

G<br />

A<br />

G<br />

G<br />

a<br />

a<br />

a<br />

A a<br />

T<br />

G<br />

A<br />

A<br />

T<br />

A<br />

a-<br />

a<br />

A<br />

a<br />

g<br />

A<br />

C<br />

C<br />

a<br />

a<br />

T<br />

C<br />

T<br />

a<br />

C<br />

T<br />

G<br />

C<br />

G<br />

G<br />

T<br />

A<br />

a<br />

C<br />

c<br />

A<br />

C<br />

a<br />

A<br />

g<br />

g<br />

G<br />

G<br />

G<br />

G<br />

A<br />

A<br />

C<br />

c<br />

G<br />

a<br />

T<br />

A<br />

T<br />

g<br />

a<br />

A<br />

c<br />

A<br />

C<br />

C<br />

A<br />

5’<br />

C<br />

A<br />

C<br />

T<br />

A<br />

a<br />

A<br />

C<br />

A<br />

C<br />

a<br />

T<br />

T<br />

A<br />

G<br />

g<br />

g<br />

T<br />

A<br />

T<br />

G<br />

G<br />

g<br />

G<br />

a<br />

a<br />

a<br />

a<br />

a<br />

G<br />

-A<br />

a<br />

-A<br />

T<br />

a<br />

t<br />

a<br />

G<br />

T<br />

G<br />

T<br />

A<br />

a<br />

t<br />

T<br />

A<br />

3’<br />

a<br />

a<br />

g<br />

A<br />

G<br />

G<br />

G<br />

T<br />

G<br />

T<br />

G<br />

A<br />

G<br />

A<br />

C<br />

c<br />

A<br />

c<br />

C<br />

C<br />

T<br />

A<br />

C<br />

c<br />

A<br />

t<br />

A<br />

A<br />

a<br />

a<br />

A<br />

A<br />

a<br />

A A<br />

A<br />

A<br />

A<br />

A<br />

-A<br />

A<br />

G<br />

a<br />

G<br />

A<br />

A<br />

A<br />

A<br />

t<br />

T<br />

a<br />

T<br />

t<br />

T<br />

A<br />

C<br />

T<br />

c<br />

G<br />

G<br />

A<br />

A<br />

g<br />

A<br />

a<br />

A<br />

g<br />

A<br />

T<br />

t<br />

G<br />

G<br />

t<br />

A A<br />

C<br />

C C<br />

T<br />

C<br />

A<br />

g<br />

A a<br />

g<br />

A<br />

A<br />

a-<br />

5’<br />

t<br />

A<br />

A A<br />

A<br />

-A<br />

5’g<br />

3’<br />

g<br />

T<br />

A<br />

A<br />

3’<br />

A<br />

G<br />

c<br />

T<br />

g<br />

g<br />

g<br />

c<br />

ca<br />

C<br />

C<br />

a<br />

g<br />

3’<br />

g<br />

g<br />

g<br />

C<br />

t<br />

G<br />

A<br />

A<br />

g<br />

g<br />

A<br />

A<br />

A<br />

a<br />

A<br />

T<br />

A<br />

g<br />

A<br />

c<br />

A<br />

t<br />

G<br />

A<br />

3’<br />

A<br />

G<br />

G<br />

A<br />

G<br />

A<br />

3’<br />

g<br />

C<br />

A<br />

A<br />

C<br />

T<br />

t<br />

A A<br />

C<br />

a<br />

G<br />

a<br />

A<br />

A<br />

C<br />

A<br />

G<br />

C<br />

a<br />

3’<br />

G<br />

T<br />

A<br />

c<br />

G<br />

g<br />

G<br />

c<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

000002<br />

#?chr<br />

----------|-<br />

----------|-<br />

0001018414|C<br />

0001018391|C<br />

0001018411|T<br />

0001018408|A<br />

0001018370|A<br />

0001018372|C<br />

0001018365|G<br />

0001018403|A<br />

0001018394|A<br />

0001018388|A<br />

0001018417|T<br />

0001018378|A<br />

0001018371|A<br />

0001018367|G<br />

0001018422|A<br />

0001018369|A<br />

0001018379|A<br />

0001018430|A<br />

0001018380|G<br />

0001018397|A<br />

0001018395|A<br />

0001018431|G<br />

0001018384|G<br />

0001018373|G<br />

0001018366|A<br />

0001018386|G<br />

0001018416|T<br />

0001018361|G<br />

0001018399|G<br />

0001018387|T<br />

0001018383|G<br />

0001018401|A<br />

0001018363|A<br />

0001018368|G<br />

0001018402|A<br />

0001018421|G<br />

0001018432|T<br />

0001018404|C<br />

0001018377|G<br />

0001018396|A<br />

0001018359|G<br />

0001018362|A<br />

0001018376|A<br />

0001018358|G<br />

0001018428|C<br />

0001018435|C<br />

0001018375|G<br />

0001018434|A<br />

0001018418|A<br />

0001018405|A<br />

0001018374|T<br />

0001018415|T<br />

0001018420|G<br />

0001018423|G<br />

0001018412|G<br />

0001018429|A<br />

0001018410|T<br />

0001018409|C<br />

0001018381|A<br />

0001018389|A<br />

0001018407|A<br />

0001018398|T<br />

0001018393|A<br />

0001018425|G<br />

0001018433|A<br />

0001018390|A<br />

0001018436|T<br />

0001018406|C<br />

0001018424|G<br />

0001018427|C<br />

0001018392|A<br />

0001018385|A<br />

0001018364|A<br />

0001018382|T<br />

0001018419|T<br />

0001018426|C<br />

0001018413|A<br />

0001018400|A<br />

0001018360|T<br />

pos|refbase<br />

alignments<br />

Figure 2.8: Vertical Scrolling Text Mode Alignment View of SHORE mapdisplay


2.8. VISUALIZATION OF SEQUENCING READ AND ALIGNMENT DATA 55<br />

2:1018241 2:1018261 2:1018281 2:1018301 2:1018321 2:1018341 2:1018361 2:1018381 2:1018401 2:1018421 2:1018441 2:1018461<br />

C T C A C A A G A G C T G T T T T A T G T C C T A A A T G A A A C A A T T T A T G G A T T A T A T T T T T T T C T C G T C A T G A T T T T A T G G A G T T T A T T G A T T T G G T A A A T A A T A A G T A A C T T T T T T T G G T A A A A T G G T G A A A G A G G A A A C G T G A G A A G A T G G A G T A A A C A A A A A A T G A A A A C A C A A C T T G A C T T T A T G G A G G G C C C A A G T A A C T T T G A A A A C G G G C C A A A G A G A C A A T A C A T C C T T A A T A<br />

C<br />

G<br />

C<br />

A<br />

C<br />

A<br />

-<br />

C<br />

A<br />

C<br />

T<br />

C<br />

A<br />

-<br />

A<br />

A<br />

-<br />

C<br />

A<br />

A<br />

A<br />

-<br />

N<br />

C<br />

A<br />

-<br />

C<br />

A<br />

C<br />

A<br />

A<br />

-<br />

T<br />

C<br />

A<br />

G<br />

C<br />

A<br />

C<br />

A<br />

C<br />

A<br />

-<br />

A<br />

C<br />

C<br />

C<br />

C<br />

A C G<br />

C<br />

A<br />

-<br />

C<br />

C<br />

A<br />

G<br />

C<br />

A C<br />

C<br />

N G<br />

N N<br />

G<br />

G<br />

N<br />

T<br />

T<br />

A<br />

A<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

C<br />

T<br />

C<br />

T<br />

T<br />

T<br />

T<br />

T<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

A<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

A<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

-<br />

C<br />

-<br />

A<br />

C<br />

C<br />

C<br />

A<br />

A<br />

A A<br />

T<br />

A<br />

Figure 2.9: Vector Graphics Alignment View of SHORE mapdisplay<br />

2.8.2 Visualization of Read Mapping <strong>Data</strong><br />

By utilizing SHORE's generic indexed query mechanisms (section 2.2.3), the SHORE mapdisplay<br />

program is able to rapidly retrieve all read mappings associated with a user specied genomic<br />

region. By default, the program generates a comprehensible colored ASCII/ANSI text representation<br />

of the read alignments, which is presented as a scrollable terminal view using the Unix le<br />

pager utility less. For technical reasons, read mappings are displayed in a vertical layout (gure<br />

2.8).<br />

Insertions are represented as additional positions in the alignment view. <strong>An</strong>alogous to SHORE<br />

alignment string format (section 2.2.4), mismatching positions in the alignment are displayed<br />

with the reference base on the left <strong>and</strong> the query base on the right side. When a more compact<br />

view is required, display of the reference nucleotide is omitted. To allow distinguishing between<br />

reads mapping to the forward or to the reverse str<strong>and</strong> of the reference sequence, reverse str<strong>and</strong><br />

associated alignments are printed in lower case <strong>and</strong> forward str<strong>and</strong> sequences in upper case letters,<br />

<strong>and</strong> 3 ′ <strong>and</strong> 5 ′ ends are indicated by corresponding colored end markers. To allow identication of<br />

individual reads in the alignment view, inclusion of further information such as read identiers<br />

<strong>and</strong> paired-end sequencing information may be requested on program startup.<br />

The text mode mapping viewer provides a rapid, detailed view of read mapping data without<br />

interruption of workows. It is however less suited to obtaining a medium- to larger-scale<br />

overview of complex genomic regions. Therefore, we additionally provide the possibility of generating<br />

an alternative alignment view as an SVG image (gure 2.9).<br />

SVG les generated are easily <strong>and</strong> eciently rendered in a web browser. In contrast to<br />

the text view, read sequences are omitted in the SVG view except for mismatching alignment<br />

positions to enable improved rendering performance. Mismatching alignment positions as well<br />

as insertions <strong>and</strong> deletions are color coded <strong>and</strong> their sequence included for direct examination at<br />

close zoom levels. Directionality of read mappings is indicated both by shape <strong>and</strong> color of the<br />

corresponding horizontal bars. As layout locations for the read mappings are generally assigned<br />

by a greedy algorithm that picks the lowest possible y coordinate immediately when an alignment


56 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

Depth of Coverage [# Reads]<br />

0 5 10 15 20 25<br />

Median Depth of Coverage<br />

Depth Inner Quartiles<br />

Average Depth ±SD<br />

2:1 2:10000001 3:200001 3:10200001 3:20200001<br />

Position [bp]<br />

Figure 2.10: Segment-Wise Genome-Wide Box Plot<br />

is encountered, depth of coverage can be hard to judge from the read alignments alone. Therefore,<br />

the SHORE mapdisplay program generates an overlay of depth of coverage, represented as vertical<br />

bars, <strong>and</strong> the horizontal bars of the read mappings. In addition, the SVG view allows retrieval<br />

of exact depth of coverage information for each position <strong>and</strong> detailed alignment information for<br />

each of the reads displayed in the form of mouse activated tool tips.<br />

2.8.3 Visualization of Local or Genome-Scale Depth of Coverage<br />

SHORE coverage is a multi-purpose tool for depth of coverage calculation <strong>and</strong> depth based segmentation<br />

(section 2.7). The complementary SHORE count utility gathers a wide variety of segmentrelated<br />

statistics. Like most SHORE modules, both programs are capable of rapidly retrieving<br />

<strong>and</strong> processing alignment information associated with dened regions of the genome.<br />

SHORE coverage can be utilized to obtain visual representations of local depth of coverage (gure<br />

2.6) or of kmer-phasing for small RNA data analysis (not shown). Per-base rendering of depth<br />

of coverage is however inappropriate for examination of larger scale distribution of sequencing<br />

depth. Visualization capability for such types of analysis is integrated with the SHORE count<br />

program. The software collects segment associated statistics for either user provided or xed size<br />

jumping window segments. The segment data collected may be incorporated into a continuous<br />

genome- or chromosome-wide box plot (gure 2.10).<br />

In the representation, the median line connects the median depth of coverage value for each<br />

of the segments provided, <strong>and</strong> is enclosed by boxes indicating second <strong>and</strong> third quartiles as well<br />

as the single st<strong>and</strong>ard deviation range surrounding the average depth of coverage of the segment.<br />

Red vertical bars indicate chromosome boundaries. As read depths mostly represent integer<br />

values, median <strong>and</strong> quartile values are adjusted to compensate quantization (section 2.8.4).


2.8. VISUALIZATION OF SEQUENCING READ AND ALIGNMENT DATA 57<br />

Quality Score<br />

16 24 32 40<br />

Depth of Coverage [# Reads]<br />

0 5 10<br />

Median Base Quality<br />

Inner Quality Quartiles<br />

1 11 21 31 41 51 61 71 81 1 11 21 31 41 51 61 71 81<br />

Cycle [bp]<br />

Median Depth of Coverage<br />

Depth Inner Quartiles<br />

3:9250001 3:9250001<br />

Position [bp]<br />

Figure 2.11: Quantile Interpolation Applied to Base Qualities <strong>and</strong> to Depth of Coverage<br />

2.8.4 Quantile Correction<br />

As in the case of base quality data or positional depth of coverage, quantile values of integer<br />

data distributed over a narrow range only coarsely represent the distribution of interest due to<br />

the small number of dierent values that such quantiles can assume (gure 2.11, left side).<br />

Given that the integer data in fact represent real-valued entities that have at some point<br />

been rounded, then it can be assumed that the original value of an integer s follows an unknown<br />

distribution supported on the interval [s − 0.5, s + 0.5). For an integer data set of size N, this<br />

implies for an arbitrary integer j-th c-quantile q i , that the expected value for the quantile of<br />

the original, non-rounded data q r data equals the minimal value to be considered for q r plus the<br />

k-th order statistic for n values distributed according to the unknown distribution. Here, n is<br />

the number of values in the integer data set of equal value to q i , <strong>and</strong> k = ⌈(j · N)/c⌉ − u, where<br />

u denotes the number of integer data with smaller value than the quantile q i .<br />

Given n i. i. d. samples X 1 , . . . , X n , then their k-th order statistic is a r<strong>and</strong>om variable W (n,k)<br />

with probability density function<br />

( ) n − 1<br />

f W(n,k) (x) = n · · f X (x) · F X (x) k−1 · (1 − F X (x)) n−k (2.7)<br />

k − 1<br />

Assuming X is distributed uniformly over the unit interval, then the k-th order statistic follows


58 CHAPTER 2. A HIGH-THROUGHPUT DNA SEQUENCING DATA ANALYSIS SUITE<br />

a beta distribution, W (n,k) ∼ Beta(k, n − k + 1), with expected value <strong>and</strong> variance<br />

E(W (n,k) ) =<br />

k<br />

n + 1<br />

k<br />

Var(W (n,k) ) =<br />

(n + 1) · (n + 2) ·<br />

(<br />

1 − k )<br />

n + 1<br />

The assumption of rounded data is reasonable for base quality values <strong>and</strong> arguable also for<br />

observed read count data. Therefore, it can be considered valid to estimate corrected quantile<br />

values as<br />

ˆq r = q i − 0.5 +<br />

k<br />

n + 1<br />

(2.8)<br />

(2.9)<br />

(2.10)<br />

While the assumption of uniformity for the distribution of the original data to be considered<br />

may be questionable with special emphasis for read depth data with value zero, the adjusted<br />

quantiles still provide a considerable improvement for appraisal of the distribution compared<br />

to their integer counterparts. Equation 2.7 might be exploited where a better approximation<br />

of the actual distribution is available. For simplicity of implementation, we do not calculate<br />

the exact values of the ˆq r , but employ a probabilistic algorithm that adds a r<strong>and</strong>om real value<br />

uniformly sampled from the interval [−0.5, 0.5) to each integer value prior to quantile calculation<br />

(gure 2.11, right side).


3 A C ++ Framework for High-Throughput DNA<br />

Sequencing<br />

In a large, multi-purpose data analysis software solution, there are inevitably many<br />

recurring processing tasks <strong>and</strong> algorithmic challenges. To avoid large amounts of<br />

complex, redundant <strong>and</strong> error-prone code, the common subproblems must be isolated<br />

as reusable modules. With the SHORE DNA sequencing data analysis suite, we have<br />

striven to implement frequently required segments of code using consistent design patterns<br />

<strong>and</strong> interfaces. The resulting C++ programming framework <strong>lib</strong>shore constitutes<br />

the subject of this chapter.<br />

3.1 Overview<br />

High-throughput sequencing has become adopted as a versatile tool applicable to a wide range of<br />

dierent scientic questions (section 1.3). While data analysis for dierent types of application<br />

must be approached from unique angles (section 1.5), in their elementary building blocks methods<br />

<strong>and</strong> algorithms often share considerable overlap on many levels.<br />

Reading, parsing <strong>and</strong> formatted output of sequencing data in st<strong>and</strong>ard storage formats is a<br />

near universal requirement. Additionally, subsetting data sets by dened properties, achievable<br />

through dierent combinations of simple accept-reject ltering operations, is an often powerful<br />

tool for adjustment of an analysis' sensitivity-specicity tradeo. More complex operations rely<br />

on the relationship between multiple data set elements, like e. g. removal of PCR duplicate sequences<br />

from read alignment data, or alter certain properties of the data set elements themselves,<br />

<strong>and</strong> are therefore sensitive to, <strong>and</strong> potentially useful not only in various combinations, but also<br />

orders of application. Finally, entire analysis processes can be modeled as modules or series of<br />

modules transforming the type of the data, for example read alignments into depth of coverage<br />

or read alignments into positional pileups into variant calls.<br />

While isolation <strong>and</strong> encapsulation of subproblems facilitates correct implementation <strong>and</strong> comprehensibility<br />

of each individual step, the logic required to tie components into end-to-end processing<br />

pipelines can itself become extensive. For versatile modes of application, modularized<br />

components must therefore be embedded into a more generic framework capable of providing the<br />

glue between input, output <strong>and</strong> data ltering, manipulation <strong>and</strong> transformation.<br />

The <strong>lib</strong>shore C++ framework code is loosely categorized into ten dierent packages including<br />

application framework functionality, generic data processing infrastructure <strong>and</strong> sequencing specic<br />

processing modules <strong>and</strong> data structures. With gure 3.1 we present a package overview in<br />

a simplied UML-like representation, illustrating exemplary classes <strong>and</strong> associations.<br />

The base package comprises elementary low level functionality <strong>and</strong> utilities as well as machine<br />

or system-dependent blocks of code. While its functionality is required for most other packages,<br />

it is self-contained <strong>and</strong> does not by itself depend on other parts of the <strong>lib</strong>rary.<br />

The class program from the package of the same name constitutes the base class of all SHORE<br />

comm<strong>and</strong> line utilities. The base class combines comm<strong>and</strong> line interface denition, documen-<br />

61


62 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />

base<br />

program<br />

stream<br />

av_parser<br />

add_option(spec,var,desc)<br />

operator()(argc,argv)<br />

uncaught_h<strong>and</strong>ler<br />

operator()(func)<br />

istreams<br />

xz_istream<br />

gzx_istream<br />

0..*<br />

std::istream<br />

program<br />

add_option(specication,variable,description)<br />

operator()(argc,argv)<br />

algo<br />

<strong>Data</strong>Iterator<br />

FunctionArrayIterator<br />

<strong>Data</strong>Iterator<br />

sux_array FunctionArrayIterator<br />

CmpX<br />

CmpY<br />

d2tree CmpXY<br />

cmp_x: CmpX<br />

cmp_y: CmpY<br />

cmp_xy : CmpXY<br />

dp_aligner_pipe<br />

fmtio<br />

Pipe<br />

Reader<br />

sam_reader<br />

Reader<br />

bam_reader<br />

Reader<br />

fastq_reader<br />

Reader<br />

s_reader<br />

sux_query<br />

line_sorter<br />

sux_index<br />

twodex<br />

Cmp<br />

sort_les(les)<br />

merge_les(les)<br />

is_sorted(le)<br />

upper_bound(le, line)<br />

Reader<br />

alignment_reader<br />

Reader<br />

maplist_reader<br />

Reader<br />

read_reader<br />

Reader<br />

atread_reader<br />

container<br />

intpack<br />

statistics<br />

Distribution<br />

binomial<br />

Distribution<br />

poisson<br />

mtc<br />

ostreams<br />

parallel<br />

thread<br />

1..*<br />

parallelizer<br />

sync<br />

datatype<br />

processing<br />

T<br />

T<br />

Pipe<br />

xz_ostream<br />

gzx_ostream<br />

alignment_lter_pipe<br />

alignment<br />

Pipe<br />

pipe_facade<br />

Writer<br />

Source<br />

feed<br />

Sink<br />

Reader<br />

extractor<br />

T<br />

T<br />

Derived<br />

read<br />

serial<br />

desync<br />

Source/Pipe/Sink<br />

Pipe/Sink<br />

parallel<br />

Pipe<br />

pipe<br />

0..*<br />

T<br />

alignment_tokenizer<br />

Source<br />

source<br />

Writer-Reader<br />

Writer<br />

Reader<br />

pipe_box<br />

Sink<br />

sink<br />

std::ostream<br />

nucleotide<br />

sequence_record<br />

Reader<br />

Pipe<br />

Writer<br />

Writer<br />

atread_writer<br />

Writer<br />

maplist_writer<br />

Writer<br />

sam_writer<br />

Reader<br />

fasta_reader<br />

Basic_Reader<br />

Reader<br />

T<br />

monolithic buer_chain<br />

0..*<br />

plugin<br />

T<br />

Figure 3.1: <strong>lib</strong>shore Overview


3.1. OVERVIEW 63<br />

tation <strong>and</strong> comm<strong>and</strong> line parsing through an auxiliary class av_parser <strong>and</strong> incorporates an<br />

object of class uncaught_h<strong>and</strong>ler to provide h<strong>and</strong>ling <strong>and</strong> reporting of fatal error conditions<br />

encountered by the application.<br />

Classes istreams <strong>and</strong> ostreams of the stream package simplify h<strong>and</strong>ling of les or other input<br />

or output destinations. Interpretation of special location strings or prexes facilitates specication<br />

of special les or other sources or destinations of input <strong>and</strong> output as option arguments or<br />

option default values, like e. g. st<strong>and</strong>ard input, output or Unix pipes. Input or output locations<br />

compressed in GZIP, BAM or XZ format are delegated to appropriate helper stream classes for<br />

automatic decoding, encoding or h<strong>and</strong>ling of r<strong>and</strong>om access requests (section 2.2.2). Both classes<br />

in addition provide input/output error checking <strong>and</strong> further stream-related functionality. Class<br />

ostreams furthermore implements simplied h<strong>and</strong>ling of temporary le destinations as well as<br />

optional temporary le compression for improved h<strong>and</strong>ling of large amounts of temporary data.<br />

Package algo comprises of dynamic programming alignment (section 2.5.2) as well as indexing,<br />

sorting <strong>and</strong> query algorithms <strong>and</strong> associated persistent data structure implementations<br />

(section 2.2.3). The platform for range indexing <strong>and</strong> queries is provided by a generic 2-d tree algorithm<br />

template d2tree, which is further utilized by the class twodex that supplies a persistent<br />

index data structure congurable for a variety of genomic data formats. Multiple algorithm templates<br />

with the prex suffix_ implement generic sux array construction <strong>and</strong> queries [144, 145].<br />

The sux array algorithms are employed by objects of class suffix_index, which provides persistent<br />

disk indexes for genomic sequences congurable <strong>and</strong> capable of answering various types<br />

of sequence match queries. To support up to 64 bit data at reasonable space eciency, persistent<br />

suffix_index <strong>and</strong> twodex index data are encoded <strong>and</strong> decoded non-byte aligned with respective<br />

exact required bit width utilizing class intpack of the container package.<br />

The <strong>lib</strong>shore data processing infrastructure is dened <strong>and</strong> implemented as a set of templates<br />

forming the processing package (section 3.2). Package parallel extends on this infrastructure<br />

to build an abstracted parallel processing interface (section 3.3).<br />

Despite growing acceptance of data exchange specications such as SAM/BAM for short read<br />

mapping data, data format issues can present a major impediment to application of algorithms<br />

to input data from diverse sources. Therefore, the <strong>lib</strong>rary includes support for a broad variety<br />

of sequencing read, mapping <strong>and</strong> further sequencing <strong>and</strong> DNA-related input <strong>and</strong> output le formats<br />

provided through the fmtio package. Format support is implemented as Reader <strong>and</strong> Writer<br />

classes with a consistent interface modeled after the requirements of the data processing infrastructure.<br />

To simplify working with various formats, for sequencing read <strong>and</strong> alignment data the<br />

<strong>lib</strong>rary provides multi-format readers read_reader <strong>and</strong> alignment_reader which automatically<br />

incorporate the adequate Reader objects for each respective input le.<br />

File format parsing <strong>and</strong> decoding requires adequate in-memory representation for the respective<br />

elements of the various types of data set, which are dened in the datatype package. Plain<br />

data structures read <strong>and</strong> alignment provide the st<strong>and</strong>ard representation of short sequencing<br />

read data, whereas large genomic sequences are stored as sequence_record objects with more<br />

sophisticated internal memory management. Additionally, processing modules <strong>and</strong> utilities specic<br />

to a certain type of data set element or certain properties of data set elements are also<br />

grouped under this package. For example, a class alignment_filter_pipe combines application<br />

of a variety of frequently required ltering <strong>and</strong> editing modules to read mapping data. Parsing<br />

<strong>and</strong> decoding the SHORE alignment string pair-wise alignment representation (section 2.2.4) into<br />

CIGAR string-equivalent tokens is provided as a class alignment_tokenizer. Class nucleotide<br />

provides decoding, encoding <strong>and</strong> ecient manipulation of IUPAC encoded DNA bases.<br />

The additional statistics package supplies the basis for statistical hypothesis tests based on<br />

a variety of distributions. The complementary class mtc re-implements various multiple hypothesis<br />

test correction procedures in C++ following the R multtest package implementation [146].


64 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />

Supported procedures include false discovery rate (FDR) calculation following Benjamini <strong>and</strong><br />

Hochberg [143] as well as Benjamini <strong>and</strong> Yekutieli [147] <strong>and</strong> further family wise error rate<br />

(FWER) control via Bonferroni [148], Hochberg [149], Holm [150] as well as Sidak [151] correction<br />

algorithms.<br />

3.2 A Modular Signal-Slot Processing Framework<br />

3.2.1 Reader <strong>and</strong> Writer Concepts<br />

The requirement to support large numbers of dierent input le formats emphasizes the advantage<br />

of consistent interfaces to provide access to the data. The <strong>lib</strong>shore framework therefore<br />

denes the concept of a Reader as a class providing three methods named has_data, current<br />

<strong>and</strong> next which allow linear iteration over the data set's elements.<br />

The has_data method returns a boolean value indicating whether further data set elements<br />

are available to the respective Reader object. The method current provides access to the element<br />

of the data set at the current position of the iteration, <strong>and</strong> next indicates the current element<br />

has been processed <strong>and</strong> may be discarded. Initialization of the following element for retrieval by<br />

current may be h<strong>and</strong>led by either of the next or has_data methods.<br />

Conversely, we dene the concept of a Writer as a class supporting methods append <strong>and</strong><br />

flush. The method append is used to pass the object a data set element for processing, whereas<br />

flush serves as a notication that no more data are available for processing, triggering nalizing<br />

actions such as addition of footer sequences for compressed data sets.<br />

Intermediary processing <strong>and</strong> ltering modules can thus be realized as classes conforming to<br />

both the Writer <strong>and</strong> Reader concepts. Method flush in this case adopts the role of nishing data<br />

processing <strong>and</strong> propagating all remaining data to the object's Reader interface. For example, in a<br />

sliding window analysis output of data will be delayed until enough input data become available<br />

to cover the entire width of the window. On receiving a ush request, processing of remaining<br />

elements must however be completed without waiting for further data associated with the sliding<br />

window.<br />

3.2.2 Denition of Processing Network Topology<br />

With the Reader <strong>and</strong> Writer concepts, propagation of data through a network of processing<br />

modules might be accomplished through a series of nested processing loops (listing 3.1). The<br />

example illustrates propagation of data read from a single source of input through two cascaded<br />

ltering operations. Processing results are to be written to the application's st<strong>and</strong>ard output<br />

stream, whereas intermediate results following the rst step of ltering are additionally recorded<br />

as a separate output le. The rst set of loops of lines 827 constitute the actual propagation of<br />

input data. Read mapping data provided by the reader object are appended to the rst ltering<br />

module (line 10). As a result, the lter object may or may not produce an arbitrary amount of<br />

data (e. g. due to sliding window based removal of PCR duplicated sequences), which are passed<br />

to the intermediate le writer as well as the second lter object (lines 1224). <strong>Data</strong> produced by<br />

the second lter object must be transmitted to the nal output accordingly through a further<br />

nested loop (lines 1721). Additional sets of nested loops correspond to propagation of remaining<br />

data following appropriately stacked ush requests to each of the modules (lines 3056).<br />

The demonstrated mode of explicit data propagation is verbose <strong>and</strong> error prone. Our programming<br />

framework therefore implements a set of class templates providing a signal-slot pipeline<br />

architecture to allow more concise <strong>and</strong> comprehensible representations. A signal constitutes a<br />

function object that may be connected to an arbitrary number of function objects supporting


3.2. A MODULAR SIGNAL-SLOT PROCESSING FRAMEWORK 65<br />

1 alignment_reader r("in.bam");<br />

alignment_filter f1(filter1_config);<br />

alignment_filter f2(filter2_config);<br />

maplist_writer w1("filtered1.maplist");<br />

5 maplist_writer w2("stdout");<br />

// Process data.<br />

while(r.has_data())<br />

{<br />

10 f1.append(r.current());<br />

while(f1.has_data())<br />

{<br />

w1.append(f1.current());<br />

15 f2.append(f1.current());<br />

while(f2.has_data())<br />

{<br />

w2.append(f2.current());<br />

20 f2.next();<br />

}<br />

25<br />

}<br />

}<br />

r.next();<br />

f1.next();<br />

// Finish.<br />

30 f1.flush();<br />

while(f1.has_data())<br />

{<br />

w1.append(f1.current());<br />

35 f2.append(f1.current());<br />

while(f2.has_data())<br />

{<br />

w2.append(f2.current());<br />

40 f2.next();<br />

}<br />

45<br />

}<br />

w1.flush();<br />

f1.next();<br />

f2.flush();<br />

50 while(f2.has_data())<br />

{<br />

w2.append(f2.current());<br />

f2.next();<br />

}<br />

55<br />

w2.flush();<br />

Listing 3.1: Using Processing Modules with Nested Loops


66 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />

Source<br />

data<br />

Pipe<br />

data<br />

Sink<br />

ush<br />

ush<br />

freeze<br />

freeze<br />

thaw<br />

thaw<br />

Signal<br />

Slot<br />

Figure 3.2: Pipeline Connections<br />

1 alignment_source r("in.bam");<br />

alignment_filter_pipe f1(filter1_config);<br />

alignment_filter_pipe f2(filter2_config);<br />

maplist_sink w1("filtered1.maplist");<br />

5 maplist_sink w2("stdout");<br />

// Connect modules.<br />

r | f1 | f2 | w2;<br />

10 f1 | w1;<br />

// Process data <strong>and</strong> finish.<br />

r.dump();<br />

Listing 3.2: Using Processing Modules with the <strong>lib</strong>shore Pipeline API<br />

the same types of function parameters (slots). Calling a signal function passes the respective set<br />

of function parameters to all connected slots in an unspecied order of invocation.<br />

We thereby dene the concept of a Source as a class providing signals named data as well<br />

as flush, <strong>and</strong> further slots labeled freeze <strong>and</strong> thaw. A Sink concept is conversely dened as a<br />

class providing slots named data <strong>and</strong> flush, as well as signals freeze <strong>and</strong> thaw. Furthermore, a<br />

Pipe amalgamates both the Source <strong>and</strong> the Sink concept. Thus dened Pipe processing modules<br />

immediately forward all elements of output data produced in response to ush requests or data<br />

appended the data slot to their respective data signal. A processing pipeline can thus be dened<br />

as a series of connected Pipe elements terminated by Sources <strong>and</strong> Sinks at either end, respectively<br />

(gure 3.2).<br />

A Source or Sink can be realized using respective templates source <strong>and</strong> sink capable of<br />

appropriately transforming Reader <strong>and</strong> Writer objects. A further template pipe is provided for<br />

transformation of objects conforming to both the Reader <strong>and</strong> Writer concepts. For each le<br />

format Reader <strong>and</strong> Writer class, the <strong>lib</strong>rary denes corresponding Source <strong>and</strong> Sink classes by<br />

means of the respective template, which are marked by suxes _source <strong>and</strong> _sink.<br />

Source, Pipe <strong>and</strong> Sink objects can be combined into extensive processing networks through<br />

connection of the appropriate signals <strong>and</strong> slots. To improve the clarity of denition of such<br />

networks, an overload of the pipe operator ( |) is provided for establishing pair-wise module<br />

connections. The code of the processing example can thereby be simplied to a large extent<br />

(listing 3.2). Lines 8 <strong>and</strong> 10 dene the topology of the processing network. The source template


3.2. A MODULAR SIGNAL-SLOT PROCESSING FRAMEWORK 67<br />

further denes a method dump, which causes all data provided by the incorporated Reader to be<br />

passed to the data signal before triggering the flush signal to nalize processing (line 13).<br />

For cases where application to entire data sets is not the intended use, modules must provide<br />

a means of suspending the ow of processing. Classes implementing the Reader concept may for<br />

example oer congurable st<strong>and</strong>ard ltering facilities comprising of multiple cascaded processing<br />

modules. However, while the Reader concept implies that each data set element is retrievable<br />

individually, Pipe elements may produce an arbitrary amount of output in response to input<br />

data processing, which would render such modules unsuitable for this type of reuse. We therefore<br />

introduce the additional module connections freeze <strong>and</strong> thaw serving to suspend <strong>and</strong> resume<br />

processing pipeline data ow, respectively (gure 3.2). These connections run anti-parallel to<br />

the data <strong>and</strong> flush signals. As signal invocation corresponds to a series of nested function<br />

calls transmitted along the connected modules, source <strong>and</strong> pipe templates are able to directly<br />

detect suspend <strong>and</strong> resume requests sent by downstream elements. A provided class template<br />

extractor utilizes the implemented suspend facilities to act as a bridge between the Sink <strong>and</strong><br />

Reader concepts. Thus, extractor objects may be inserted at the end of a line of processing<br />

modules to retrieve generated output data individually. A complementary template feed provides<br />

bridging between the Writer <strong>and</strong> Source concepts.<br />

3.2.3 Module Implementation<br />

Encapsulation of subproblems as modules implementing both the Writer <strong>and</strong> Reader concepts<br />

comes with the disadvantage that all data generated in response to a single invocation of one of<br />

the Writer methods must be kept in an internal buer for subsequent retrieval via the Reader<br />

interface. However, in the context of a processing pipeline each element of output data can in<br />

principle directly be transmitted to downstream modules for further processing, often avoiding<br />

the need to implement internal buering. Direct propagation can be achieved by direct implementation<br />

of the Pipe concept, thus exposing direct access to the module's data <strong>and</strong> ush<br />

signals. Implementation logic of unbuered, interruptible processing is however more complex<br />

than the buered equivalent. We therefore provide a base class template pipe_facade enforcing<br />

a restricted programming model to support correct implementation of such modules.<br />

The facade template requires distribution of processing implementation across four function<br />

calls prepare, append, next <strong>and</strong> flush. A function emit is further made accessible to subclasses<br />

to propagate generated output. The template enforces a single-emit rule by which invocation<br />

of the emit base class function is permissible exactly once during each call to either of the four<br />

implementation functions. The purpose of each of the four functions is illustrated by example<br />

(listing 3.3).<br />

Raw sequencing read input data will typically be ordered by an identier specic to the<br />

respective insert, whereas order of the individual reads retrieved from the insert is arbitrary.<br />

For some purposes further ordering of sequencing reads of the same insert by their paired-end<br />

sequencing index may however be convenient. Implementation of an appropriate reordering<br />

module demonstrates the purpose of each of the four facade functions.<br />

The reordering module collects all reads belonging to the same insert in an internal buer,<br />

<strong>and</strong> subsequently orders them by value of their paired-end index. The prepare method serves as<br />

an initial notication about next element that will be passed to the object's append method. In<br />

the case of the example module, identiers are compared to determine whether further reads are<br />

available for the insert currently being processed. If the next read does not belong to the current<br />

insert, reordering of the reads by paired-end index is triggered. Reads are then transmitted for<br />

downstream processing by invocation of emit on the rst of the collected reads. In concert with


68 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />

1 class order_paired_reads_pipe<br />

:public pipe_facade<br />

{<br />

private:<br />

10 // Grants pipe_facade access to private member functions.<br />

friend class pipeline_core_access;<br />

15<br />

// Stores all reads with the same identifier.<br />

std::vector m_buffer;<br />

// Orders all reads collected in buffer by their paired-end index.<br />

void order_reads_by_index();<br />

// Tests if a read has no mates.<br />

20 static bool is_single_read(const read &);<br />

void next()<br />

{<br />

if(!m_buffer.empty())<br />

25 {<br />

m_buffer.pop_front();<br />

30 }<br />

}<br />

if(!m_buffer.empty())<br />

emit(m_buffer.front());<br />

void prepare(const read & r)<br />

{<br />

35 if(!m_buffer.empty() && (r.id != m_buffer.front().id))<br />

{<br />

order_reads_by_index();<br />

emit(m_buffer.front());<br />

}<br />

40 }<br />

void append(const read & r)<br />

{<br />

if(is_single_read(r))<br />

45 emit(r);<br />

else<br />

m_buffer.push_back(r);<br />

}<br />

50 void flush()<br />

{<br />

if(!m_buffer.empty())<br />

{<br />

order_reads_by_index();<br />

55 emit(m_buffer.front());<br />

}<br />

}<br />

};<br />

Listing 3.3: Usage Example for the pipe_facade Template


3.2. A MODULAR SIGNAL-SLOT PROCESSING FRAMEWORK 69<br />

the method next described below, the internal sequencing read buer thereby gets successively<br />

freed for processing of the following insert.<br />

Processing of new input data must be h<strong>and</strong>led by the method append. In the context of the<br />

reordering example, single-end <strong>and</strong> orphan reads i. e. all reads that for any reason are the only<br />

ones for their respective insert can be immediately transmitted downstream, whereas paired<br />

reads are appended to the internal sequencing read buer for subsequent sorting.<br />

The method next is automatically invoked through the facade class on completing propagation<br />

of an element of output data. Subclasses may implement the method with the purpose<br />

of discarding data no longer required. The emit method may be re-invoked inside the method<br />

body. In the example module, this property is exploited to iteratively emit all reads of the<br />

current insert <strong>and</strong> free the buer for processing of the following insert.<br />

By means of the division of implementation among the four processing functions in concert<br />

with the single-emit rule, the facade template is able to ensure correct data propagation <strong>and</strong><br />

h<strong>and</strong>ling of suspend <strong>and</strong> resume requests triggered through the freeze <strong>and</strong> thaw slots. Subclasses<br />

may however omit implementation of individual methods not required for the respective<br />

processing task.<br />

3.2.4 In-Place <strong>Data</strong> Manipulation<br />

In the pipeline model, processing modules are presented with individual data set elements that<br />

are invalidated as the following element becomes available. As demonstrated by the example<br />

(listing 3.3), this implies that operations that rely on a relation between multiple elements must<br />

maintain an internal buer to accumulate the relevant data. Furthermore, to avoid unintended<br />

side eects between dierent branches of the downstream processing setup, modules <strong>and</strong> data<br />

sources must prohibit direct manipulation of transmitted data. Copying of data is therefore also a<br />

prerequisite to implementing manipulation of certain properties of the data set's elements. Both<br />

restrictions may therefore result in a considerable level of unnecessary data copying overhead.<br />

To mitigate this issue with the explicit model of data propagation (section 3.2.2), Reader<br />

classes are initially implemented following a lower level concept Basic_Reader, which is dened<br />

as a class with a single method next. The method can be used to attempt reading a single data<br />

set element into an externally provided buer, <strong>and</strong> returns a boolean ag to indicate whether an<br />

element could be successfully retrieved. <strong>Data</strong> read into the provided buer can then be freely<br />

manipulated prior to further processing. The Basic_Reader variant of a class is identied by the<br />

prex basic_. The corresponding Reader classes compliant with the processing infrastructure<br />

are generated automatically via a class template monolithic (gure 3.1).<br />

The processing framework provides a more generic mechanism to avoid data copying overhead.<br />

By implementation as plug-in class, operations with identical input <strong>and</strong> output data type can be<br />

realized using in-place manipulation of an arbitrary number of data set elements determined by<br />

the class.<br />

For plug-in module support, Reader, Source or Pipe classes inherit a template class<br />

buffer_chain. The template maintains an extensible list of re-assignable buers for the<br />

respective type of data set element associated with the data source. Prior to downstream<br />

propagation of data, objects of classes inheriting a class plugin of appropriate type may be<br />

granted access to the list of maintained buers for manipulation, removal or requesting further<br />

buering of data set elements.<br />

As a simple example, we present the implementation of a plug-in for sort order verication<br />

(listing 3.4). Prior to propagation of the rst of the buered elements, plug-in modules<br />

are presented with the internal list of buers, as well as a list of unused buers to which discarded<br />

elements may be added. The module's apply method may freely manipulate, discard,


70 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />

1 class sequence_sorting_check<br />

:public plugin<br />

{<br />

private:<br />

5<br />

// Generates an exception to indicate data are not sorted<br />

// according to expectation.<br />

void sorting_error();<br />

10 public:<br />

virtual int apply(std::deque & buffers,<br />

std::vector & unused,<br />

const bool flush)<br />

15 {<br />

if(buffers.size() < 2)<br />

return flush ? (PLUGIN_SUCCESS) : (PLUGIN_STARVED);<br />

if(buffers[1]->sequence < buffers[0]->sequence)<br />

20 sorting_error();<br />

};<br />

}<br />

return PLUGIN_SUCCESS;<br />

Listing 3.4: Plugin Implementation Example<br />

re-use buers of the unused buers list or allocate further buers for its data manipulation task.<br />

Given enough data set elements have been buered for the module to carry out its operation,<br />

a status value PLUGIN_SUCCESS is returned, <strong>and</strong> status PLUGIN_STARVED otherwise to cause the<br />

buffer_chain object to delay data propagation <strong>and</strong> collect further data. In case of the sort order<br />

example, the method indicates at least two elements are required to complete the operation (line<br />

16). Given the prerequisite is fullled, the module proceeds to verify the ordering of the rst two<br />

elements found in the buer (line 19).<br />

Support for plug-in modules can be readily combined with the pipe_facade approach (listing<br />

3.5). The source code example presents a simple conversion module for sequencing read<br />

data types. To allow a more direct representation of the le structure, sequences read from 454<br />

St<strong>and</strong>ard Flowgram Format (SFF) les are initially stored in a data structure distinct from the<br />

<strong>lib</strong>shore regular sequencing read representation. To be used with generic read processing modules,<br />

a conversion of read data is thus required. The append method of the conversion module requests<br />

an unused buer <strong>and</strong> uses it to perform the data type conversion (lines 21, 23). Given plug-in<br />

modules succeed, a data set element is transmitted downstream (lines 25, 26). The module's<br />

next method invalidates the rst of the buered elements <strong>and</strong> transmits further data when appropriate.<br />

When combined with the buffer_chain class, the method flush of the pipe_facade<br />

template must be implemented to nalize processing of elements that were possibly delayed by<br />

request of plug-in modules (lines 2933).


3.3. A SIMPLIFIED PARALLELIZATION INTERFACE 71<br />

1 class sff_flatten_pipe<br />

:public shore::pipe_facade,<br />

public buffer_chain<br />

{<br />

5 private:<br />

// Grants pipe_facade access to private member functions.<br />

friend class shore::pipeline_core_access;<br />

10 // Performs the conversion from sff_read to read type.<br />

static void init_read_from_sff(read & r, const sff_read & sff);<br />

void next()<br />

{<br />

15 if(buffer_chain_pop())<br />

emit(buffer_chain_front());<br />

}<br />

void append(const shore::sff_read & r)<br />

20 {<br />

shore::read & current = buffer_chain_push();<br />

init_read_from_sff(current, r);<br />

25 if(buffer_chain_prepare())<br />

emit(buffer_chain_front());<br />

}<br />

void flush()<br />

30 {<br />

if(buffer_chain_flush(false))<br />

emit(buffer_chain_front());<br />

}<br />

};<br />

Listing 3.5: Usage Example for the buffer_chain Template<br />

3.3 A Simplied Parallelization Interface<br />

While run time of simple high-throughput sequencing data processing tasks is often primarily<br />

determined by the available capacities for input <strong>and</strong> output, computationally more expensive<br />

operations such as mapping of reads to a reference genome sequence may require heavy<br />

parallelization to complete in an acceptable time span. Correctness of in general error prone<br />

implementation of parallel algorithms is dicult to assess. We build on the <strong>lib</strong>shore pipeline<br />

infrastructure (section 3.2) to realize a parallelization framework that oers a simplied parallel<br />

programming model sucient for typical processing of sequencing data. The framework provides<br />

an interface that abstracts from parallelization-related implementation details <strong>and</strong> is able<br />

to seamlessly support both shared <strong>and</strong> distributed memory modes of operation.


72 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />

1 const int NUM_THREADS = 16;<br />

const int BATCH_SIZE = 100;<br />

parallelizer par(NUM_THREADS, BATCH_SIZE);<br />

5 sync syn1;<br />

sync syn2;<br />

serial r("in.bam");<br />

parallel f1(filter1_config);<br />

10 parallel f2(filter2_config);<br />

serial w1("filtered1.maplist");<br />

serial w2("stdout");<br />

// Connect modules.<br />

15 r | par | f1 | f2 | syn2 | w2;<br />

f1 | syn1 | w1;<br />

// Process data <strong>and</strong> finish.<br />

20 dump(r);<br />

Listing 3.6: Parallel Processing Example<br />

3.3.1 Parallelization Modules<br />

Our parallelization <strong>lib</strong>rary is based primarily on ve dierent C++ class templates (gure 3.1).<br />

Class templates parallelizer, sync <strong>and</strong> desync can be instantiated to provide special parallelization<br />

modules for incorporation into a processing pipeline. Instantiated objects constitute<br />

special processing modules with identical input <strong>and</strong> output data type. The respective templates<br />

are parametrized on the type of data that is processable by the module. The template<br />

parallelizer is responsible for thread creation <strong>and</strong> management as well as dispatching of data<br />

to dierent threads or processes. Objects instantiated from the class template sync can be<br />

incorporated into a parallel pipeline to protect following pipeline modules from concurrent modi-<br />

cation. The third type of parallelization module desync allows transition from pipeline modules<br />

protected by a sync object to further concurrently modiable processing modules. For creation<br />

of parallel processing pipelines, the actual processing modules must further be incorporated by<br />

class templates parallel or serial to indicate whether the respective module may be used<br />

concurrently.<br />

While the basic processing infrastructure only enforces compatibility of input <strong>and</strong> output<br />

data types of connected modules, the parallelization templates impose additional restrictions<br />

upon which classes of object can be connected. Outputs of serial objects may be connected<br />

to either other serial or to parallelizer or desync objects. Objects instantiated from the<br />

parallel template are the only objects capable of receiving input from parallelizer objects.<br />

The parallel objects can be connected downstream to either further objects instantiated from<br />

the parallel template, or to appropriate sync objects. Finally, desync outputs can only be<br />

connected to parallel objects.<br />

Using these templates, the original data processing example (listing 3.2) can be reformulated<br />

to support parallel modes of operation (listing 3.6). As indicated by the example, the level of<br />

parallelism is requested through a parallelizer object (line 4). The parallelizer distributes<br />

batches of data collected from the connected reader object to the worker threads associated with<br />

it. The two subsequent ltering operations are carried out in concurrent mode, whereas the


3.3. A SIMPLIFIED PARALLELIZATION INTERFACE 73<br />

output operations must be protected by respective sync objects to prevent garbled output.<br />

Our API abstracts from the actual implementation by which parallelization <strong>and</strong> synchronization<br />

of program ow is accomplished. The underlying framework currently supports multithreaded<br />

operation as well as distributed memory congurations such as compute clusters utilizing<br />

the Message Passing Interface (MPI) st<strong>and</strong>ard. To avoid a hard dependency on MPI<br />

implementations, support for distributed operation is disabled by default, but may be enabled<br />

at compile time without further source code modication.<br />

3.3.2 Parallel Pipeline Architecture<br />

Internally, automation of parallel processing is accomplished by introduction of further signalslot<br />

connections between the parallelization modules. The framework's mode of operation is<br />

illustrated with a block diagram depicting all possible module connections (gure 3.3(a) ). The<br />

data Source for the processing pipeline is incorporated by the leftmost serial object. The<br />

data <strong>and</strong> flush signals of the source are connected to a multiplexer object internal to the<br />

parallelizer. The parallelizer further incorporates multiple thread objects (multiplicity is<br />

indicated by three dots in the lower compartment of the respective block diagram element). <strong>Data</strong><br />

collected by the multiplexer are dispatched to threads by means of a condition variable <strong>and</strong><br />

simple exchange of buers.<br />

Pipeline initialization is controlled by an auxiliary connection numthreads oered by all parallelization<br />

classes (not shown). Objects of class parallel on initialization create multiple separate<br />

processing modules. Each of these modules is associated with one specic thread of the<br />

parallelizer, to which it is connected to through its data <strong>and</strong> flush slots.<br />

On the output side, each module incorporated by the parallel object is further connected<br />

to a unique track provided by the sync object. All of the tracks' data signals are connected<br />

to the same downstream serial object, whereas received flush requests must be integrated by<br />

the parent sync object to ensure that the downstream section of the pipeline gets ushed after<br />

processing has completed on all threads.<br />

A sync incorporates a mutex variable which is used to protect downstream parts of the<br />

pipeline from concurrent use. On activation of a track object's data or flush slots, this mutex<br />

is acquired by the respective track before conveying the signal to the connected serial object.<br />

By the nested function call property of signals passed to connected pipeline elements, it is ensured<br />

that all contiguous, serial downstream operations are completed as the signal call returns. The<br />

mutex can thus be released immediately to free the protected modules for use by other threads.<br />

The desync template is intended to enable correct resumption of parallel mode of operation<br />

downstream of a section of the processing pipeline controlled by a sync object. To accomplish<br />

that, desync objects must provide a decoupling of downstream processing modules from the<br />

upstream serial modules. For this purpose, the module receives locking information from the<br />

controlling sync object by means of four additional signal-slot connections. These connections<br />

between the sync <strong>and</strong> the desync object are established through objects lock_link provided<br />

by the serial modules composing the enclosed section of the pipeline. A signal postlock is<br />

transmitted by the sync object after the mutex variable has been acquired by a thread. In<br />

response, the desync object starts to accumulate all data that is generated by the upstream<br />

pipeline in a thread-specic buer. The signal preunlock is received immediately before the<br />

mutex is unlocked by the upstream sync. The signal causes the thread-specic buer to be<br />

detached from upstream pipeline input. Finally, a third signal postunlock indicates that the<br />

respective thread has just relinquished the sync's mutex variable, <strong>and</strong> causes the desync to dump<br />

all data collected in a buer to the channel of the downstream pipeline that is associated with


74 CHAPTER 3. A C++ FRAMEWORK FOR HIGH-THROUGHPUT DNA SEQUENCING<br />

(a)<br />

serial<br />

parallelizer<br />

parallel<br />

sync<br />

serial<br />

desync<br />

parallel<br />

Source<br />

multiplexer<br />

track<br />

Pipe<br />

multiplexer<br />

thread<br />

Pipe<br />

lock_link<br />

. . .<br />

. . .<br />

. . .<br />

. . .<br />

track<br />

Sink<br />

. . .<br />

. . .<br />

(b)<br />

serial<br />

parallelizer<br />

parallel<br />

sync<br />

serial<br />

desync<br />

parallel<br />

multiplexer<br />

track<br />

multiplexer<br />

thread<br />

Pipe<br />

lock_link<br />

. . .<br />

. . .<br />

. . .<br />

. . .<br />

track<br />

Sink<br />

. . .<br />

. . .<br />

Signal<br />

data<br />

ush<br />

Slot<br />

postlock<br />

preunlock<br />

postunlock<br />

join<br />

Figure 3.3: Connections between Parallel Processing Objects<br />

the respective thread. <strong>An</strong> auxiliary flush signal in addition conveys the thread-specic ush<br />

state across synchronized sections of the processing pipeline.<br />

Finally, a connection join is supported by all parallel processing modules to ensure correct<br />

termination of processing across all threads.<br />

For distributed memory architectures, data are distributed from a single root process to<br />

several worker processes. Worker processes must receive the data to be processed concurrently<br />

from the root process, <strong>and</strong> must subsequently return their output data for processing by serial<br />

modules (gure 3.3(b) ).<br />

Objects of parallelizer, sync <strong>and</strong> desync initialized in worker processes automatically<br />

establish connections for exchange of data with their respective counterpart in the root process,<br />

where each connection is associated with an additional thread in the root process.<br />

To prevent initialization conicts <strong>and</strong> overhead, serial objects are associated with the additional<br />

function of preventing creation of their respective incorporated processing module when


3.3. A SIMPLIFIED PARALLELIZATION INTERFACE 75<br />

not in the root process.<br />

Threads of worker processes' parallelizer objects operate by requesting data from the respective<br />

counterpart in the root process. Failure to retrieve data indicates that processing is<br />

complete, <strong>and</strong> triggers invocation of the thread's ush signal <strong>and</strong> subsequently thread termination.<br />

Otherwise, data received are transmitted to the subsequent parallel object, identical with<br />

threaded operation.<br />

Subsequent sync objects submit received data <strong>and</strong> ush signals to the corresponding object<br />

of the root process. Sections formed by serial objects however cause an interruption in the<br />

pipelines of worker processes. Therefore, sync objects must exploit the postlock signal transmitted<br />

through serial modules to trigger a downstream desync object to request data generated<br />

by serial processing from its counterpart in the root process.


4 Closing Remarks<br />

Routine analysis of high-throughput sequencing data not only requires development of an appropriate<br />

analysis strategy for the respective application as well as solving associated algorithmic<br />

challenges, but is further greatly facilitated by embedding the approach into an underlying framework<br />

optimized for h<strong>and</strong>ling large sequence data sets. With SHORE, we have implemented an<br />

end-to-end pipeline providing a powerful <strong>and</strong> modular analysis environment. Through tight<br />

integration of all required analysis modules, data ow is optimized, minimizing overhead <strong>and</strong><br />

incompatibilities at the transition points between the various steps of an analysis. Through embedding<br />

into the pipeline environment, even supplementary utility algorithms are augmented by<br />

readily available complementary functionality such as capable read data ltering.<br />

A variety of analysis modules dedicated to specic high-throughput sequencing applications<br />

are already made available with SHORE. With the SHORE peak module, this work introduced support<br />

for ChIP-Seq enrichment proling oering highly congurable peak detection <strong>and</strong> ltering<br />

mechanisms. With an approach that renders analysis results of simultaneously processed data<br />

sets immediately comparable, obtaining conclusions from replicate experiments is in our view<br />

greatly facilitated. With further general-purpose modules SHORE coverage <strong>and</strong> SHORE count providing<br />

exible depth of coverage-based sequence segmentation <strong>and</strong> comprehensive gathering of<br />

segment-associated statistics, SHORE should prove applicable to a wide range of expression pro-<br />

ling, enrichment or depletion protocols without further adaptation.<br />

With the integrated approach, we were able to introduce a common foundation in the form<br />

of the <strong>lib</strong>shore framework. This has allowed us to implement a common data storage solution<br />

allowing universal support of exible <strong>and</strong> eective data compression <strong>and</strong> accelerated data set<br />

queries across all analysis modules. Thereby, we maintain the usefulness of the analysis suite in<br />

the face of vastly increased quantities of data produced by sequencing instruments. With generalpurpose<br />

modules such as SHORE oligo-match, we further demonstrate that usability of st<strong>and</strong>-alone<br />

utilities also benets from the common framework in terms of le format independence, familiar<br />

comm<strong>and</strong> line interfaces, automatic support for compressed or streaming input <strong>and</strong> output, or<br />

capable parallelization. With the <strong>lib</strong>shore processing framework, we provide a means for designing<br />

complex analysis algorithms as a collection of manageable, self-contained modules. The ability<br />

to parallelize many such algorithms without further modication from our point of view has<br />

signicant value with the still increasing rate of sequencing data production, <strong>and</strong> might serve as a<br />

basis for the development of further abstracted parallel processing approaches targeted at specic<br />

subclasses of sequencing application. For example, parallel implementations of reference-guided<br />

variant calling algorithms require appropriate data partitioning <strong>and</strong> result merging mechanisms,<br />

distributing read mapping data to dierent processing threads according to their location on the<br />

reference sequence. Such mechanisms are potentially intricate if multi-position traits such as<br />

allelic phasing of single nucleotide polymorphisms are to be assessed.<br />

Despite next-generation sequencing technology now having been available for several years,<br />

for most applications surprisingly little consensus has been reached regarding optimal analysis<br />

77


78 CHAPTER 4. CLOSING REMARKS<br />

algorithms. With the sample preparation procedure leading up to the sequencing device being<br />

application-specic, weakly st<strong>and</strong>ardized, involving signicant amounts of manual work <strong>and</strong> thus<br />

ultimately highly variable, development of robust statistical models has proved intricate. Furthermore,<br />

strategies for result validation are usually laborious, cost-intensive or have yet to be<br />

developed. In many cases, specicity or sensitivity of predictions can be improved by intersection<br />

or union of results obtained through partially complementary algorithms, respectively. For<br />

example, with the SHORE mapowcell utility, we provide a unied interface that allows to readily<br />

obtain <strong>and</strong> combine the results of a variety of available read mapping algorithms in consistent<br />

<strong>and</strong> immediately comparable output formats. Implementation of further wrapper infrastructure<br />

to similarly integrate third-party analysis algorithms for applications such as variation calling or<br />

enrichment proling might be a future direction to obtain a valuable resource to readily determine<br />

individual algorithms' strengths <strong>and</strong> weaknesses through large-scale comparison. However,<br />

the various algorithms available for a specic high-throughput sequencing application often share<br />

similar fundamental ideas, <strong>and</strong> thus suer from corresponding systematic issues.<br />

The initially rapid pace at which innovations improving throughput, read lengths or error<br />

rates have been achieved now seems to be diminishing for the current generation of sequencing<br />

technologies. However, conceptually dierent high-throughput sequencing approaches such as<br />

nanopore sequencing are being investigated, <strong>and</strong> might in the medium term account for further<br />

drastic changes in the eld. With cost reduced by a further magnitude <strong>and</strong> improved reliability<br />

<strong>and</strong> automation of sample preparation, DNA sequencing could nd its way into routine medical<br />

application. However, other than cost-per-base, changes in the properties of the data should<br />

prove most relevant for scientic application <strong>and</strong> data analysis.<br />

Sequencing read length has long been perceived as a critical factor for genome assembly or<br />

assessment of genome structure. It seems likely that a technology able to deliver read lengths<br />

of multiple tens of kilobases at competitive accuracy <strong>and</strong> price would bring about a major shift<br />

towards multiple whole-genome comparison strategies. In addition to whole-genome topology,<br />

such approaches will implicitly yield traits like copy number variation as well as single nucleotide<br />

polymorphisms <strong>and</strong> all other types of localized, small-scale variation, <strong>and</strong> possibly render the<br />

corresponding reference-based resequencing approaches irrelevant. However, to fully leverage the<br />

advantages compared to reference sequence-guided analysis, signicant further challenges with<br />

respect to data analysis <strong>and</strong> representation will have to be overcome. Reasonable relations must<br />

be devised to capture homology, replacing the linear reference genome coordinate system. Providing<br />

manageable breakdown <strong>and</strong> visualization of such data will be essential, <strong>and</strong> likely require<br />

a variety of dierent approaches depending on the respective focus of the investigation. To assess<br />

the functional implications of primary analysis results, gene annotation or annotation transfer<br />

algorithms will have to be simplied <strong>and</strong> improved. Finally, ensuring immediate comparability<br />

of results across dierent studies should prove challenging.<br />

Technological innovations that could potentially replace or drastically improve the power<br />

of current deep sequencing approaches such as whole-genome enrichment or expression pro-<br />

ling are more dicult to envision. Single-molecule sequencing methods however may in the<br />

future permit amplication- <strong>and</strong> ligation-free sequencing of samples from small amounts of starting<br />

material. From such elimination of steps of the sample preparation procedure might follow a<br />

signicant reduction of sequence-specic biases. Furthermore, this simplication should present<br />

opportunities for st<strong>and</strong>ardization <strong>and</strong> automation of sequencing <strong>lib</strong>rary production to further reduce<br />

technical variability. With minimization of the amount of required DNA <strong>and</strong> tissue, sample<br />

composition should correspondingly be rendered more controllable. Spike-in of st<strong>and</strong>ard quantities<br />

of DNA oligomers seems to have been ab<strong>and</strong>oned by the scientic community, but might in<br />

future rened <strong>and</strong> st<strong>and</strong>ardized incarnations provide an additional resource to data normalization<br />

<strong>and</strong> estimation of measurement variance. Dropping sequencing cost <strong>and</strong> simplication of <strong>lib</strong>rary


79<br />

preparation should further induce tolerance to employ the resource sequencing less sparingly.<br />

While most current studies due to economical considerations rely on no more than two replicate<br />

experiments, gathering extensive experimental evidence in the form of many replicate data sets<br />

could become st<strong>and</strong>ard in the future. Combined, these factors may entail improved statistical<br />

power from quantitative sequencing experiments, while at the same time facilitating statistical<br />

modeling of the sequencing <strong>and</strong> sample preparation process.<br />

To this eect, the fundamental approaches to quantitative sequencing data analysis should<br />

require rather gradual modication compared to genome structure <strong>and</strong> variation analysis. Reference<br />

sequence-based analysis as implemented in our modules SHORE peak or SHORE count will<br />

probably remain the preferred strategy in the near future, although it may become more commonplace<br />

to generate appropriate reference genomes de-novo along with the quantitative data. A<br />

medium-term goal might be complementing the available algorithms for quantitative data with<br />

methods that allow to better leverage the statistical power acquired by increasing the number of<br />

replicate experiments. Short-term improvements could possibly be achieved by investing in more<br />

sophisticated models <strong>and</strong> algorithms to capture <strong>and</strong> estimate an increasing number of parameters<br />

of <strong>lib</strong>rary preparation, sequencing <strong>and</strong> data analysis. However, the characteristics of current<br />

data sets may be the result of superimposition <strong>and</strong> complex interplay of a variety of eects, <strong>and</strong><br />

are furthermore subject to constant change as protocols <strong>and</strong> sequencing chemistry evolve.<br />

With progress with respect to sequencing read length, sequence assembly <strong>and</strong> automated<br />

genome annotation algorithms, reference-based variation detection as implemented by<br />

SHORE consensus or SHORE qVar modules could be superseded by fundamentally dierent strategies.<br />

However, SHORE <strong>and</strong> <strong>lib</strong>shore may still constitute a valuable framework for implementation<br />

of such future methods. While increasing read length should automatically resolve many of<br />

the ambiguities that current analysis algorithms are confronted with, the biological questions<br />

to address by future algorithms are likely to gain complexity. They should therefore pose<br />

computationally intensive tasks that benet from application programming frameworks oriented<br />

towards the ecient processing of large amounts of biological sequence data.<br />

While the focus of this work was on the technical <strong>and</strong> algorithmic aspects of obtaining primary<br />

analysis results from various types of high-throughput DNA sequencing data set, a future<br />

challenge will lie with moving from elucidation of isolated properties of the cellular mechanism<br />

to building consistent theoretical frameworks through integration of the data made accessible by<br />

dierent sequencing applications <strong>and</strong> further complementary technologies.


Bibliography<br />

[1] S. Ossowski. Computational Approaches for Next Generation Sequencing <strong>An</strong>alysis <strong>and</strong><br />

MiRNA Target Search. PhD thesis. Wilhelmstr. 32, 72074 Tübingen: Universität Tübingen,<br />

2010.<br />

[2] F. Sanger, S. Nicklen, <strong>and</strong> A. R. Coulson. DNA sequencing with chain-terminating inhibitors.<br />

In: Proc. Natl. Acad. Sci. U.S.A. 74.12 (Dec. 1977), pp. 54635467. doi: 10.<br />

1073/pnas.74.12.5463. pmid: 271968.<br />

[3] J. C. Venter, M. D. Adams, G. G. Sutton, A. R. Kerlavage, H. O. Smith, <strong>and</strong> M.<br />

Hunkapiller. Shotgun sequencing of the human genome. In: Science 280.5369 (June<br />

1998), pp. 15401542. doi: 10.1126/science.280.5369.1540. pmid: 9644018.<br />

[4] T. S. Centre <strong>and</strong> T. W. U. G. S. Center. Toward a complete human genome sequence.<br />

In: Genome Res. 8.11 (Nov. 1998), pp. 10971108. doi: 10.1101/gr.8.11.1097. pmid:<br />

9847074.<br />

[5] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G. Sutton, H. O.<br />

Smith, M. Y<strong>and</strong>ell, C. A. Evans, R. A. Holt, J. D. Gocayne, P. Amanatides, R. M. Ballew,<br />

D. H. Huson, J. R. Wortman, Q. Zhang, C. D. Kodira, X. H. Zheng, L. Chen, M. Skupski,<br />

G. Subramanian, P. D. Thomas, J. Zhang, G. L. Gabor Miklos, C. Nelson, S. Broder, A. G.<br />

Clark, J. Nadeau, V. A. McKusick, N. Zinder, et al. The sequence of the human genome.<br />

In: Science 291.5507 (Feb. 2001), pp. 13041351. doi: 10.1126/science.1058040. pmid:<br />

11181995.<br />

[6] E. S. L<strong>and</strong>er, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon,<br />

K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howl<strong>and</strong>,<br />

L. Kann, J. Lehoczky, R. LeVine, P. McEwan, K. McKernan, J. Meldrim, J. P. Mesirov,<br />

C. Mir<strong>and</strong>a, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C.<br />

Sougnez, et al. Initial sequencing <strong>and</strong> analysis of the human genome. In: Nature 409.6822<br />

(Feb. 2001), pp. 860921. doi: 10.1038/35057062. pmid: 11237011.<br />

[7] M. Frommer, L. E. McDonald, D. S. Millar, C. M. Collis, F. Watt, G. W. Grigg, P. L.<br />

Molloy, <strong>and</strong> C. L. Paul. A genomic sequencing protocol that yields a positive display of<br />

5-methylcytosine residues in individual DNA str<strong>and</strong>s. In: Proc. Natl. Acad. Sci. U.S.A.<br />

89.5 (Mar. 1992), pp. 18271831. doi: 10.1073/pnas.89.5.1827. pmid: 1542678.<br />

[8] J. C. Venter, K. Remington, J. F. Heidelberg, A. L. Halpern, D. Rusch, J. A. Eisen,<br />

D. Wu, I. Paulsen, K. E. Nelson, W. Nelson, D. E. Fouts, S. Levy, A. H. Knap, M. W.<br />

Lomas, K. Nealson, O. White, J. Peterson, J. Homan, R. Parsons, H. Baden-Tillson, C.<br />

Pfannkoch, Y. H. Rogers, <strong>and</strong> H. O. Smith. Environmental genome shotgun sequencing<br />

of the Sargasso Sea. In: Science 304.5667 (Apr. 2004), pp. 6674. doi: 10.1126/science.<br />

1093857. pmid: 15001713.<br />

81


82 BIBLIOGRAPHY<br />

[9] S. Brenner, M. Johnson, J. Bridgham, G. Golda, D. H. Lloyd, D. Johnson, S. Luo, S.<br />

McCurdy, M. Foy, M. Ewan, R. Roth, D. George, S. Eletr, G. Albrecht, E. Vermaas, S. R.<br />

Williams, K. Moon, T. Burcham, M. Pallas, R. B. DuBridge, J. Kirchner, K. Fearon,<br />

J. Mao, <strong>and</strong> K. Corcoran. Gene expression analysis by massively parallel signature sequencing<br />

(MPSS) on microbead arrays. In: Nat. Biotechnol. 18.6 (June 2000), pp. 630<br />

634. doi: 10.1038/76469. pmid: 10835600.<br />

[10] M. Margulies, M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J.<br />

Berka, M. S. Braverman, Y. J. Chen, Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V.<br />

Gomes, B. C. Godwin, W. He, S. Helgesen, C. H. Ho, C. H. Ho, G. P. Irzyk, S. C. J<strong>and</strong>o,<br />

M. L. Alenquer, T. P. Jarvie, K. B. Jirage, J. B. Kim, J. R. Knight, J. R. Lanza, J. H.<br />

Leamon, S. M. Lefkowitz, M. Lei, et al. Genome sequencing in microfabricated highdensity<br />

picolitre reactors. In: Nature 437.7057 (Sept. 2005), pp. 376380. doi: 10.1038/<br />

nature03959. pmid: 16056220.<br />

[11] D. R. Bentley, S. Balasubramanian, H. P. Swerdlow, G. P. Smith, J. Milton, C. G. Brown,<br />

K. P. Hall, D. J. Evers, C. L. Barnes, H. R. Bignell, J. M. Boutell, J. Bryant, R. J.<br />

Carter, R. Keira Cheetham, A. J. Cox, D. J. Ellis, M. R. Flatbush, N. A. Gormley, S. J.<br />

Humphray, L. J. Irving, M. S. Karbelashvili, S. M. Kirk, H. Li, X. Liu, K. S. Maisinger,<br />

L. J. Murray, B. Obradovic, T. Ost, M. L. Parkinson, M. R. Pratt, et al. Accurate whole<br />

human genome sequencing using reversible terminator chemistry. In: Nature 456.7218<br />

(Nov. 2008), pp. 5359. doi: 10.1038/nature07517. pmid: 18987734.<br />

[12] K. J. McKernan, H. E. Peckham, G. L. Costa, S. F. McLaughlin, Y. Fu, E. F. Tsung, C. R.<br />

Clouser, C. Duncan, J. K. Ichikawa, C. C. Lee, Z. Zhang, S. S. Ranade, E. T. Dimalanta,<br />

F. C. Hyl<strong>and</strong>, T. D. Sokolsky, L. Zhang, A. Sheridan, H. Fu, C. L. Hendrickson, B. Li, L.<br />

Kotler, J. R. Stuart, J. A. Malek, J. M. Manning, A. A. <strong>An</strong>tipova, D. S. Perez, M. P. Moore,<br />

K. C. Hayashibara, M. R. Lyons, R. E. Beaudoin, et al. Sequence <strong>and</strong> structural variation<br />

in a human genome uncovered by short-read, massively parallel ligation sequencing using<br />

two-base encoding. In: Genome Res. 19.9 (Sept. 2009), pp. 15271541. doi: 10.1101/gr.<br />

091868.109. pmid: 19546169.<br />

[13] J. M. Rothberg, W. Hinz, T. M. Rearick, J. Schultz, W. Mileski, M. Davey, J. H. Leamon,<br />

K. Johnson, M. J. Milgrew, M. Edwards, J. Hoon, J. F. Simons, D. Marran, J. W. Myers,<br />

J. F. Davidson, A. Branting, J. R. Nobile, B. P. Puc, D. Light, T. A. Clark, M. Huber,<br />

J. T. Branciforte, I. B. Stoner, S. E. Cawley, M. Lyons, Y. Fu, N. Homer, M. Sedova, X.<br />

Miao, B. Reed, et al. <strong>An</strong> integrated semiconductor device enabling non-optical genome<br />

sequencing. In: Nature 475.7356 (July 2011), pp. 348352. doi: 10.1038/nature10242.<br />

pmid: 21776081.<br />

[14] J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan,<br />

B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark,<br />

R. Dalal, A. Dewinter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. Heiner, K.<br />

Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. Lin, et al. Real-time<br />

DNA sequencing from single polymerase molecules. In: Science 323.5910 (Jan. 2009),<br />

pp. 133138. doi: 10.1126/science.1162986. pmid: 19023044.<br />

[15] M. J. Levene, J. Korlach, S. W. Turner, M. Foquet, H. G. Craighead, <strong>and</strong> W. W. Webb.<br />

Zero-mode waveguides for single-molecule analysis at high concentrations. In: Science<br />

299.5607 (Jan. 2003), pp. 682686. doi: 10.1126/science.1079700. pmid: 12560545.<br />

[16] D. Loakes. Survey <strong>and</strong> summary: The applications of universal DNA base analogues.<br />

In: Nucleic Acids Res. 29.12 (June 2001), pp. 24372447. doi: 10.1093/nar/29.12.2437.<br />

pmid: 11410649.


BIBLIOGRAPHY 83<br />

[17] X. Zhang, J. Yazaki, A. Sundaresan, S. Cokus, S. W. Chan, H. Chen, I. R. Henderson,<br />

P. Shinn, M. Pellegrini, S. E. Jacobsen, <strong>and</strong> J. R. Ecker. Genome-wide high-resolution<br />

mapping <strong>and</strong> functional analysis of DNA methylation in arabidopsis. In: Cell 126.6 (Sept.<br />

2006), pp. 11891201. doi: 10.1016/j.cell.2006.08.003. pmid: 16949657.<br />

[18] S. Ossowski, K. Schneeberger, R. M. Clark, C. Lanz, N. Warthmann, <strong>and</strong> D. Weigel.<br />

Sequencing of natural strains of Arabidopsis thaliana with short reads. In: Genome Res.<br />

18.12 (Dec. 2008), pp. 20242033. doi: 10.1101/gr.080200.108. pmid: 18818371.<br />

[19] J. Cao, K. Schneeberger, S. Ossowski, T. Gunther, S. Bender, J. Fitz, D. Koenig, C.<br />

Lanz, O. Stegle, C. Lippert, X. Wang, F. Ott, J. Muller, C. Alonso-Blanco, K. Borgwardt,<br />

K. J. Schmid, <strong>and</strong> D. Weigel. Whole-genome sequencing of multiple Arabidopsis thaliana<br />

populations. In: Nat. Genet. 43.10 (Oct. 2011), pp. 956963. doi: 10.1038/ng.911.<br />

[20] K. Schneeberger, S. Ossowski, C. Lanz, T. Juul, A. H. Petersen, K. L. Nielsen, J. E.<br />

J?rgensen, D. Weigel, <strong>and</strong> S. U. <strong>An</strong>dersen. SHOREmap: simultaneous mapping <strong>and</strong> mutation<br />

identication by deep sequencing. In: Nat. Methods 6.8 (Aug. 2009), pp. 550551.<br />

doi: 10.1038/nmeth0809-550. pmid: 19644454.<br />

[21] K. Schneeberger. Whole genome analysis of Arabidopsis thaliana using Next Generation<br />

Sequencing. PhD thesis. Wilhelmstr. 32, 72074 Tübingen: Universität Tübingen, 2010.<br />

urn: urn:nbn:de:bsz:21-opus-56685.<br />

[22] E. A. Dinsdale, R. A. Edwards, D. Hall, F. <strong>An</strong>gly, M. Breitbart, J. M. Brulc, M. Furlan,<br />

C. Desnues, M. Haynes, L. Li, L. McDaniel, M. A. Moran, K. E. Nelson, C. Nilsson, R.<br />

Olson, J. Paul, B. R. Brito, Y. Ruan, B. K. Swan, R. Stevens, D. L. Valentine, R. V.<br />

Thurber, L. Wegley, B. A. White, <strong>and</strong> F. Rohwer. Functional metagenomic proling of<br />

nine biomes. In: Nature 452.7187 (Apr. 2008), pp. 629632. doi: 10.1038/nature06810.<br />

[23] J. Frias-Lopez, Y. Shi, G. W. Tyson, M. L. Coleman, S. C. Schuster, S. W. Chisholm,<br />

<strong>and</strong> E. F. Delong. Microbial community gene expression in ocean surface waters. In:<br />

Proc. Natl. Acad. Sci. U.S.A. 105.10 (Mar. 2008), pp. 38053810. doi: 10.1073/pnas.<br />

0708897105. pmid: 18316740.<br />

[24] X. Mou, S. Sun, R. A. Edwards, R. E. Hodson, <strong>and</strong> M. A. Moran. Bacterial carbon<br />

processing by generalist species in the coastal ocean. In: Nature 451.7179 (Feb. 2008),<br />

pp. 708711. doi: 10.1038/nature06513. pmid: 18223640.<br />

[25] A. Mortazavi, B. A. Williams, K. McCue, L. Schaeer, <strong>and</strong> B. Wold. Mapping <strong>and</strong><br />

quantifying mammalian transcriptomes by RNA-Seq. In: Nat. Methods 5.7 (July 2008),<br />

pp. 621628. doi: 10.1038/nmeth.1226. pmid: 18516045.<br />

[26] B. T. Wilhelm, S. Marguerat, S. Watt, F. Schubert, V. Wood, I. Goodhead, C. J. Penkett,<br />

J. Rogers, <strong>and</strong> J. Bahler. Dynamic repertoire of a eukaryotic transcriptome surveyed<br />

at single-nucleotide resolution. In: Nature 453.7199 (June 2008), pp. 12391243. doi:<br />

10.1038/nature07002. pmid: 18488015.<br />

[27] R. Lister, R. C. O'Malley, J. Tonti-Filippini, B. D. Gregory, C. C. Berry, A. H. Millar,<br />

<strong>and</strong> J. R. Ecker. Highly integrated single-base resolution maps of the epigenome in Arabidopsis.<br />

In: Cell 133.3 (May 2008), pp. 523536. doi: 10.1016/j.cell.2008.03.029.<br />

pmid: 18423832.<br />

[28] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, <strong>and</strong> M. Snyder. The<br />

transcriptional l<strong>and</strong>scape of the yeast genome dened by RNA sequencing. In: Science<br />

320.5881 (June 2008), pp. 13441349. doi: 10.1126/science.1158441. pmid: 18451266.


84 BIBLIOGRAPHY<br />

[29] N. Cloonan, A. R. Forrest, G. Kolle, B. B. Gardiner, G. J. Faulkner, M. K. Brown, D. F.<br />

Taylor, A. L. Steptoe, S. Wani, G. Bethel, A. J. Robertson, A. C. Perkins, S. J. Bruce,<br />

C. C. Lee, S. S. Ranade, H. E. Peckham, J. M. Manning, K. J. McKernan, <strong>and</strong> S. M.<br />

Grimmond. Stem cell transcriptome proling via massive-scale mRNA sequencing. In:<br />

Nat. Methods 5.7 (July 2008), pp. 613619. doi: 10.1038/nmeth.1223. pmid: 18516046.<br />

[30] C. Lu, K. Kulkarni, F. F. Souret, R. MuthuValliappan, S. S. Tej, R. S. Poethig, I. R.<br />

Henderson, S. E. Jacobsen, W. Wang, P. J. Green, <strong>and</strong> B. C. Meyers. MicroRNAs <strong>and</strong><br />

other small RNAs enriched in the Arabidopsis RNA-dependent RNA polymerase-2 mutant.<br />

In: Genome Res. 16.10 (Oct. 2006), pp. 12761288. doi: 10.1101/gr.5530106.<br />

pmid: 16954541.<br />

[31] J. G. Ruby, C. Jan, C. Player, M. J. Axtell, W. Lee, C. Nusbaum, H. Ge, <strong>and</strong> D. P. Bartel.<br />

Large-scale sequencing reveals 21U-RNAs <strong>and</strong> additional microRNAs <strong>and</strong> endogenous<br />

siRNAs in C. elegans. In: Cell 127.6 (Dec. 2006), pp. 11931207. doi: 10.1016/j.cell.<br />

2006.10.040. pmid: 17174894.<br />

[32] R. D. Morin, M. D. O'Connor, M. Grith, F. Kuchenbauer, A. Delaney, A. L. Prabhu,<br />

Y. Zhao, H. McDonald, T. Zeng, M. Hirst, C. J. Eaves, <strong>and</strong> M. A. Marra. Application of<br />

massively parallel sequencing to microRNA proling <strong>and</strong> discovery in human embryonic<br />

stem cells. In: Genome Res. 18.4 (Apr. 2008), pp. 610621. doi: 10.1101/gr.7179508.<br />

pmid: 18285502.<br />

[33] P. L<strong>and</strong>graf, M. Rusu, R. Sheridan, A. Sewer, N. Iovino, A. Aravin, S. Pfeer, A. Rice,<br />

A. O. Kamphorst, M. L<strong>and</strong>thaler, C. Lin, N. D. Socci, L. Hermida, V. Fulci, S. Chiaretti,<br />

R. Foa, J. Schliwka, U. Fuchs, A. Novosel, R. U. Muller, B. Schermer, U. Bissels, J. Inman,<br />

Q. Phan, M. Chien, D. B. Weir, R. Choksi, G. De Vita, D. Frezzetti, H. I. Trompeter,<br />

et al. A mammalian microRNA expression atlas based on small RNA <strong>lib</strong>rary sequencing.<br />

In: Cell 129.7 (June 2007), pp. 14011414. doi: 10.1016/j.cell.2007.04.040. pmid:<br />

17604727.<br />

[34] D. S. Johnson, A. Mortazavi, R. M. Myers, <strong>and</strong> B. Wold. Genome-wide mapping of in<br />

vivo protein-DNA interactions. In: Science 316.5830 (June 2007), pp. 14971502. doi:<br />

10.1126/science.1141319. pmid: 17540862.<br />

[35] G. Robertson, M. Hirst, M. Bainbridge, M. Bilenky, Y. Zhao, T. Zeng, G. Euskirchen, B.<br />

Bernier, R. Varhol, A. Delaney, N. Thiessen, O. L. Grith, A. He, M. Marra, M. Snyder,<br />

<strong>and</strong> S. Jones. Genome-wide proles of STAT1 DNA association using chromatin immunoprecipitation<br />

<strong>and</strong> massively parallel sequencing. In: Nat. Methods 4.8 (Aug. 2007),<br />

pp. 651657. doi: 10.1038/nmeth1068. pmid: 17558387.<br />

[36] A. Barski, S. Cuddapah, K. Cui, T. Y. Roh, D. E. Schones, Z. Wang, G. Wei, I. Chepelev,<br />

<strong>and</strong> K. Zhao. High-resolution proling of histone methylations in the human genome.<br />

In: Cell 129.4 (May 2007), pp. 823837. doi: 10 . 1016 / j . cell . 2007 . 05 . 009. pmid:<br />

17512414.<br />

[37] T. A. Down, V. K. Rakyan, D. J. Turner, P. Flicek, H. Li, E. Kulesha, S. Graf, N.<br />

Johnson, J. Herrero, E. M. Tomazou, N. P. Thorne, L. Backdahl, M. Herberth, K. L.<br />

Howe, D. K. Jackson, M. M. Miretti, J. C. Marioni, E. Birney, T. J. Hubbard, R. Durbin,<br />

S. Tavare, <strong>and</strong> S. Beck. A Bayesian deconvolution strategy for immunoprecipitationbased<br />

DNA methylome analysis. In: Nat. Biotechnol. 26.7 (July 2008), pp. 779785. doi:<br />

10.1038/nbt1414. pmid: 18612301.


BIBLIOGRAPHY 85<br />

[38] D. D. Licatalosi, A. Mele, J. J. Fak, J. Ule, M. Kayikci, S. W. Chi, T. A. Clark, A. C.<br />

Schweitzer, J. E. Blume, X. Wang, J. C. Darnell, <strong>and</strong> R. B. Darnell. HITS-CLIP yields<br />

genome-wide insights into brain alternative RNA processing. In: Nature 456.7221 (Nov.<br />

2008), pp. 464469. doi: 10.1038/nature07488. pmid: 18978773.<br />

[39] D. E. Schones, K. Cui, S. Cuddapah, T. Y. Roh, A. Barski, Z. Wang, G. Wei, <strong>and</strong> K.<br />

Zhao. Dynamic regulation of nucleosome positioning in the human genome. In: Cell<br />

132.5 (Mar. 2008), pp. 887898. doi: 10.1016/j.cell.2008.02.022. pmid: 18329373.<br />

[40] S. J. Cokus, S. Feng, X. Zhang, Z. Chen, B. Merriman, C. D. Haudenschild, S. Pradhan,<br />

S. F. Nelson, M. Pellegrini, <strong>and</strong> S. E. Jacobsen. Shotgun bisulphite sequencing of the<br />

Arabidopsis genome reveals DNA methylation patterning. In: Nature 452.7184 (Mar.<br />

2008), pp. 215219. doi: 10.1038/nature06745. pmid: 18278030.<br />

[41] A. Meissner, T. S. Mikkelsen, H. Gu, M. Wernig, J. Hanna, A. Sivachenko, X. Zhang,<br />

B. E. Bernstein, C. Nusbaum, D. B. Jae, A. Gnirke, R. Jaenisch, <strong>and</strong> E. S. L<strong>and</strong>er.<br />

Genome-scale DNA methylation maps of pluripotent <strong>and</strong> dierentiated cells. In: Nature<br />

454.7205 (Aug. 2008), pp. 766770. doi: 10.1038/nature07107. pmid: 18600261.<br />

[42] N. A. Baird, P. D. Etter, T. S. Atwood, M. C. Currey, A. L. Shiver, Z. A. Lewis, E. U.<br />

Selker, W. A. Cresko, <strong>and</strong> E. A. Johnson. Rapid SNP discovery <strong>and</strong> genetic mapping using<br />

sequenced RAD markers. In: PLoS ONE 3.10 (2008), e3376. doi: 10.1371/journal.<br />

pone.0003376. pmid: 18852878.<br />

[43] E. M. Willing, M. Homann, J. D. Klein, D. Weigel, <strong>and</strong> C. Dreyer. Paired-end RADseq<br />

for de novo assembly <strong>and</strong> marker design without available reference. In: Bioinformatics<br />

27.16 (Aug. 2011), pp. 21872193. doi: 10.1093/bioinformatics/btr346. pmid:<br />

21712251.<br />

[44] C. Zong, S. Lu, A. R. Chapman, <strong>and</strong> X. S. Xie. Genome-wide detection of singlenucleotide<br />

<strong>and</strong> copy-number variations of a single human cell. In: Science 338.6114 (Dec.<br />

2012), pp. 16221626. doi: 10.1126/science.1229164. pmid: 23258894.<br />

[45] G. R. Abecasis, D. Altshuler, A. Auton, L. D. Brooks, R. M. Durbin, R. A. Gibbs, M. E.<br />

Hurles, G. A. McVean, D. Altshuler, R. M. Durbin, G. R. Abecasis, D. R. Bentley, A.<br />

Chakravarti, A. G. Clark, F. S. Collins, F. M. De La Vega, P. Donnelly, M. Egholm,<br />

P. Flicek, S. B. Gabriel, R. A. Gibbs, B. M. Knoppers, E. S. L<strong>and</strong>er, H. Lehrach, E. R.<br />

Mardis, G. A. McVean, D. A. Nickerson, L. Peltonen, A. J. Schafer, S. T. Sherry, et al. A<br />

map of human genome variation from population-scale sequencing. In: Nature 467.7319<br />

(Oct. 2010), pp. 10611073. doi: 10.1038/nature09534. pmid: 20981092.<br />

[46] G. R. Abecasis, A. Auton, L. D. Brooks, M. A. DePristo, R. M. Durbin, R. E. H<strong>and</strong>saker,<br />

H. M. Kang, G. T. Marth, G. A. McVean, D. M. Altshuler, R. M. Durbin, G. R. Abecasis,<br />

D. R. Bentley, A. Chakravarti, A. G. Clark, P. Donnelly, E. E. Eichler, P. Flicek, S. B.<br />

Gabriel, R. A. Gibbs, E. D. Green, M. E. Hurles, B. M. Knoppers, J. O. Korbel, E. S.<br />

L<strong>and</strong>er, C. Lee, H. Lehrach, E. R. Mardis, G. T. Marth, G. A. McVean, et al. <strong>An</strong><br />

integrated map of genetic variation from 1,092 human genomes. In: Nature 491.7422<br />

(Nov. 2012), pp. 5665. doi: 10.1038/nature11632. pmid: 23128226.<br />

[47] D. Haussler, S. J. O'Brien, O. A. Ryder, F. K. Barker, M. Clamp, A. J. Crawford, R.<br />

Hanner, O. Hanotte, W. E. Johnson, J. A. McGuire, W. Miller, R. W. Murphy, W. J.<br />

Murphy, F. H. Sheldon, B. Sinervo, B. Venkatesh, E. O. Wiley, F. W. Allendorf, G. Amato,<br />

C. S. Baker, A. Bauer, A. Beja-Pereira, E. Bermingham, G. Bernardi, C. R. Bonvicino, S.<br />

Brenner, T. Burke, J. Cracraft, M. Diekhans, S. Edwards, et al. Genome 10K: a proposal<br />

to obtain whole-genome sequence for 10,000 vertebrate species. In: J. Hered. 100.6 (2009),<br />

pp. 659674. doi: 10.1093/jhered/esp086. pmid: 19892720.


86 BIBLIOGRAPHY<br />

[48] B. A. Methe, K. E. Nelson, M. Pop, H. H. Creasy, M. G. Giglio, C. Huttenhower, D.<br />

Gevers, J. F. Petrosino, S. Abubucker, J. H. Badger, A. T. Chinwalla, A. M. Earl, M. G.<br />

FitzGerald, R. S. Fulton, K. Hallsworth-Pepin, E. A. Lobos, R. Madupu, V. Magrini,<br />

J. C. Martin, M. Mitreva, D. M. Muzny, E. J. Sodergren, J. Versalovic, A. M. Wollam,<br />

K. C. Worley, J. R. Wortman, S. K. Young, Q. Zeng, K. M. Aagaard, O. O. Abolude,<br />

et al. A framework for human microbiome research. In: Nature 486.7402 (June 2012),<br />

pp. 215221. doi: 10.1038/nature11209. pmid: 22699610.<br />

[49] C. Huttenhower, D. Gevers, R. Knight, S. Abubucker, J. H. Badger, A. T. Chinwalla,<br />

H. H. Creasy, A. M. Earl, M. G. FitzGerald, R. S. Fulton, M. G. Giglio, K. Hallsworth-<br />

Pepin, E. A. Lobos, R. Madupu, V. Magrini, J. C. Martin, M. Mitreva, D. M. Muzny,<br />

E. J. Sodergren, J. Versalovic, A. M. Wollam, K. C. Worley, J. R. Wortman, S. K. Young,<br />

Q. Zeng, K. M. Aagaard, O. O. Abolude, E. Allen-Vercoe, E. J. Alm, L. Alvarado, et al.<br />

Structure, function <strong>and</strong> diversity of the healthy human microbiome. In: Nature 486.7402<br />

(June 2012), pp. 207214. doi: 10.1038/nature11234. pmid: 22699609.<br />

[50] P. R. Burton, D. G. Clayton, L. R. Cardon, N. Craddock, P. Deloukas, A. Duncanson, D. P.<br />

Kwiatkowski, M. I. McCarthy, W. H. Ouweh<strong>and</strong>, N. J. Samani, J. A. Todd, P. Donnelly,<br />

J. C. Barrett, P. R. Burton, D. Davison, P. Donnelly, D. Easton, D. Evans, H. T. Leung,<br />

J. L. Marchini, A. P. Morris, C. C. Spencer, M. D. Tobin, L. R. Cardon, D. G. Clayton,<br />

A. P. Attwood, J. P. Boorman, B. Cant, U. Everson, J. M. Hussey, et al. Genome-wide<br />

association study of 14,000 cases of seven common diseases <strong>and</strong> 3,000 shared controls. In:<br />

Nature 447.7145 (June 2007), pp. 661678. doi: 10.1038/nature05911. pmid: 17554300.<br />

[51] R. W. Carthew <strong>and</strong> E. J. Sontheimer. Origins <strong>and</strong> Mechanisms of miRNAs <strong>and</strong> siRNAs.<br />

In: Cell 136.4 (Feb. 2009), pp. 642655. doi: 10 . 1016 / j . cell . 2009 . 01 . 035. pmid:<br />

19239886.<br />

[52] O. Voinnet. Origin, biogenesis, <strong>and</strong> activity of plant microRNAs. In: Cell 136.4 (Feb.<br />

2009), pp. 669687. doi: 10.1016/j.cell.2009.01.046. pmid: 19239888.<br />

[53] S. E. Linsen, E. de Wit, G. Janssens, S. Heater, L. Chapman, R. K. Parkin, B. Fritz, S. K.<br />

Wyman, E. de Bruijn, E. E. Voest, S. Kuersten, M. Tewari, <strong>and</strong> E. Cuppen. Limitations<br />

<strong>and</strong> possibilities of small RNA digital gene expression proling. In: Nat. Methods 6.7<br />

(July 2009), pp. 474476. doi: 10.1038/nmeth0709-474. pmid: 19564845.<br />

[54] S. U. Meyer, M. W. Pfa, <strong>and</strong> S. E. Ulbrich. Normalization strategies for microRNA pro-<br />

ling experiments: a 'normal' way to a hidden layer of complexity? In: Biotechnol. Lett.<br />

32.12 (Dec. 2010), pp. 17771788. doi: 10.1007/s10529-010-0380-z. pmid: 20703800.<br />

[55] J. D. Hollister, L. M. Smith, Y. L. Guo, F. Ott, D. Weigel, <strong>and</strong> B. S. Gaut. Transposable<br />

elements <strong>and</strong> small RNAs contribute to gene expression divergence between Arabidopsis<br />

thaliana <strong>and</strong> Arabidopsis lyrata. In: Proc. Natl. Acad. Sci. U.S.A. 108.6 (Feb. 2011),<br />

pp. 23222327. doi: 10.1073/pnas.1018222108.<br />

[56] S. Barker, M. Weinfeld, <strong>and</strong> D. Murray. DNA-protein crosslinks: their induction, repair,<br />

<strong>and</strong> biological consequences. In: Mutat. Res. 589.2 (Mar. 2005), pp. 111135. doi: 10.<br />

1016/j.mrrev.2004.11.003. pmid: 15795165.<br />

[57] A. Valouev, J. Ichikawa, T. Tonthat, J. Stuart, S. Ranade, H. Peckham, K. Zeng, J. A.<br />

Malek, G. Costa, K. McKernan, A. Sidow, A. Fire, <strong>and</strong> S. M. Johnson. A high-resolution,<br />

nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning.<br />

In: Genome Res. 18.7 (July 2008), pp. 10511063. doi: 10.1101/gr.076463.108.<br />

pmid: 18477713.


BIBLIOGRAPHY 87<br />

[58] L. Yant, J. Mathieu, T. T. Dinh, F. Ott, C. Lanz, H. Wollmann, X. Chen, <strong>and</strong> M. Schmid.<br />

Orchestration of the oral transition <strong>and</strong> oral development in Arabidopsis by the bifunctional<br />

transcription factor APETALA2. In: Plant Cell 22.7 (July 2010), pp. 2156<br />

2170. doi: 10.1105/tpc.110.075606.<br />

[59] M. Hafner, M. L<strong>and</strong>thaler, L. Burger, M. Khorshid, J. Hausser, P. Berninger, A. Rothballer,<br />

M. Ascano, A. C. Jungkamp, M. Munschauer, A. Ulrich, G. S. Wardle, S. Dewell,<br />

M. Zavolan, <strong>and</strong> T. Tuschl. Transcriptome-wide identication of RNA-binding protein<br />

<strong>and</strong> microRNA target sites by PAR-CLIP. In: Cell 141.1 (Apr. 2010), pp. 129141. doi:<br />

10.1016/j.cell.2010.03.009. pmid: 20371350.<br />

[60] M. Muers. Technology: Getting Moore from DNA sequencing. In: Nat. Rev. Genet. 12.9<br />

(Sept. 2011), p. 586. doi: 10.1038/nrg3059. pmid: 21808262.<br />

[61] G. E. Moore. Cramming More Components onto <strong>Integrated</strong> Circuits. In: Electronics<br />

38.8 (Apr. 19, 1965), pp. 114117. doi: 10.1109/JPROC.1998.658762.<br />

[62] R. F. Service. Gene sequencing. The race for the $1000 genome. In: Science 311.5767<br />

(Mar. 2006), pp. 15441546. doi: 10.1126/science.311.5767.1544. pmid: 16543431.<br />

[63] C. Ledergerber <strong>and</strong> C. Dessimoz. Base-calling for next-generation sequencing platforms.<br />

In: Brief. Bioinformatics 12.5 (Sept. 2011), pp. 489497. doi: 10 . 1093 / bib / bbq077.<br />

pmid: 21245079.<br />

[64] D. Aird, M. G. Ross, W. S. Chen, M. Danielsson, T. Fennell, C. Russ, D. B. Jae, C.<br />

Nusbaum, <strong>and</strong> A. Gnirke. <strong>An</strong>alyzing <strong>and</strong> minimizing PCR amplication bias in Illumina<br />

sequencing <strong>lib</strong>raries. In: Genome Biol. 12.2 (2011), R18. doi: 10.1186/gb-2011-12-2-<br />

r18. pmid: 21338519.<br />

[65] B. Ewing, L. Hillier, M. C. Wendl, <strong>and</strong> P. Green. Base-calling of automated sequencer<br />

traces using phred. I. Accuracy assessment. In: Genome Res. 8.3 (Mar. 1998), pp. 175<br />

185. doi: 10.1101/gr.8.3.175. pmid: 9521921.<br />

[66] B. Ewing <strong>and</strong> P. Green. Base-calling of automated sequencer traces using phred. II. Error<br />

probabilities. In: Genome Res. 8.3 (Mar. 1998), pp. 186194. doi: 10.1101/gr.8.3.186.<br />

pmid: 9521922.<br />

[67] M. Kircher, U. Stenzel, <strong>and</strong> J. Kelso. Improved base calling for the Illumina Genome<br />

<strong>An</strong>alyzer using machine learning strategies. In: Genome Biol. 10.8 (2009), R83. doi:<br />

10.1186/gb-2009-10-8-r83. pmid: 19682367.<br />

[68] W. C. Kao, K. Stevens, <strong>and</strong> Y. S. Song. BayesCall: A model-based base-calling algorithm<br />

for high-throughput short-read sequencing. In: Genome Res. 19.10 (Oct. 2009), pp. 1884<br />

1895. doi: 10.1101/gr.095299.109. pmid: 19661376.<br />

[69] J. Rougemont, A. Amzallag, C. Iseli, L. Farinelli, I. Xenarios, <strong>and</strong> F. Naef. Probabilistic<br />

base calling of Solexa sequencing data. In: BMC Bioinformatics 9 (2008), p. 431. doi:<br />

10.1186/1471-2105-9-431. pmid: 18851737.<br />

[70] Y. Erlich, P. P. Mitra, M. delaBastide, W. R. McCombie, <strong>and</strong> G. J. Hannon. Alta-Cyclic:<br />

a self-optimizing base caller for next-generation sequencing. In: Nat. Methods 5.8 (Aug.<br />

2008), pp. 679682. doi: 10.1038/nmeth.1230. pmid: 18604217.<br />

[71] A. R. Quinlan, D. A. Stewart, M. P. Stromberg, <strong>and</strong> G. T. Marth. Pyrobayes: an improved<br />

base caller for SNP discovery in pyrosequences. In: Nat. Methods 5.2 (Feb. 2008), pp. 179<br />

181. doi: 10.1038/nmeth.1172. pmid: 18193056.


88 BIBLIOGRAPHY<br />

[72] S. <strong>An</strong>drews. FastQC: A quality control tool for high throughput sequence data. May 2012.<br />

WebCite: 68W7OHlEL. url: http : / / www . bioinformatics . babraham . ac . uk / projects /<br />

fastqc/ Feb. 27, 2013.<br />

[73] A. Gordon. FASTX-Toolkit: FASTQ/A short-reads pre-processing tools. Feb. 2010. WebCite:<br />

5zWQ6Eqh6. url: http://hannonlab.cshl.edu/fastx_toolkit/index.html Feb. 27,<br />

2013.<br />

[74] T. Lassmann, Y. Hayashizaki, <strong>and</strong> C. O. Daub. TagDusta program to eliminate artifacts<br />

from next generation sequencing data. In: Bioinformatics 25.21 (Nov. 2009), pp. 2839<br />

2840. doi: 10.1093/bioinformatics/btp527. pmid: 19737799.<br />

[75] R. Schmieder, Y. W. Lim, F. Rohwer, <strong>and</strong> R. Edwards. TagCleaner: Identication <strong>and</strong><br />

removal of tag sequences from genomic <strong>and</strong> metagenomic datasets. In: BMC Bioinformatics<br />

11 (2010), p. 341. doi: 10.1186/1471-2105-11-341. pmid: 20573248.<br />

[76] R. K. Patel <strong>and</strong> M. Jain. NGS QC Toolkit: a toolkit for quality control of next generation<br />

sequencing data. In: PLoS ONE 7.2 (2012), e30619. doi: 10 . 1371 / journal . pone .<br />

0030619. pmid: 22312429.<br />

[77] D. R. Zerbino <strong>and</strong> E. Birney. Velvet: algorithms for de novo short read assembly using<br />

de Bruijn graphs. In: Genome Res. 18.5 (May 2008), pp. 821829. doi: 10.1101/gr.<br />

074492.107. pmid: 18349386.<br />

[78] M. H. Schulz, D. R. Zerbino, M. Vingron, <strong>and</strong> E. Birney. Oases: robust de novo RNAseq<br />

assembly across the dynamic range of expression levels. In: Bioinformatics 28.8 (Apr.<br />

2012), pp. 10861092. doi: 10.1093/bioinformatics/bts094. pmid: 22368243.<br />

[79] R. Luo, B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan, Y. Liu, J.<br />

Tang, G. Wu, H. Zhang, Y. Shi, Y. Liu, C. Yu, B. Wang, Y. Lu, C. Han, D. Cheung, S.-M.<br />

Yiu, S. Peng, Z. Xiaoqian, G. Liu, X. Liao, Y. Li, H. Yang, J. Wang, T.-W. Lam, <strong>and</strong><br />

J. Wang. SOAPdenovo2: an empirically improved memory-ecient short-read de novo<br />

assembler. In: GigaScience 1.1 (2012), p. 18. doi: 10.1186/2047-217X-1-18.<br />

[80] J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. L<strong>and</strong>er,<br />

C. Nusbaum, <strong>and</strong> D. B. Jae. ALLPATHS: de novo assembly of whole-genome shotgun<br />

microreads. In: Genome Res. 18.5 (May 2008), pp. 810820. doi: 10.1101/gr.7337908.<br />

pmid: 18340039.<br />

[81] I. Maccallum, D. Przybylski, S. Gnerre, J. Burton, I. Shlyakhter, A. Gnirke, J. Malek, K.<br />

McKernan, S. Ranade, T. P. Shea, L. Williams, S. Young, C. Nusbaum, <strong>and</strong> D. B. Jae.<br />

ALLPATHS 2: small genomes assembled accurately <strong>and</strong> with high continuity from short<br />

paired reads. In: Genome Biol. 10.10 (2009), R103. doi: 10.1186/gb-2009-10-10-r103.<br />

pmid: 19796385.<br />

[82] F. J. Ribeiro, D. Przybylski, S. Yin, T. Sharpe, S. Gnerre, A. Abouelleil, A. M. Berlin, A.<br />

Montmayeur, T. P. Shea, B. J. Walker, S. K. Young, C. Russ, C. Nusbaum, I. MacCallum,<br />

<strong>and</strong> D. B. Jae. Finished bacterial genomes from shotgun sequence data. In: Genome<br />

Res. 22.11 (Nov. 2012), pp. 22702277. doi: 10.1101/gr.141515.112. pmid: 22829535.<br />

[83] S. Gnerre, I. Maccallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, T.<br />

Sharpe, G. Hall, T. P. Shea, S. Sykes, A. M. Berlin, D. Aird, M. Costello, R. Daza, L.<br />

Williams, R. Nicol, A. Gnirke, C. Nusbaum, E. S. L<strong>and</strong>er, <strong>and</strong> D. B. Jae. High-quality<br />

draft assemblies of mammalian genomes from massively parallel sequence data. In: Proc.<br />

Natl. Acad. Sci. U.S.A. 108.4 (Jan. 2011), pp. 15131518. doi: 10.1073/pnas.1017351108.<br />

pmid: 21187386.


BIBLIOGRAPHY 89<br />

[84] K. Schneeberger, S. Ossowski, F. Ott, J. D. Klein, X. Wang, C. Lanz, L. M. Smith, J.<br />

Cao, J. Fitz, N. Warthmann, S. R. Henz, D. H. Huson, <strong>and</strong> D. Weigel. Reference-guided<br />

assembly of four diverse Arabidopsis thaliana genomes. In: Proc. Natl. Acad. Sci. U.S.A.<br />

108.25 (June 2011), pp. 1024910254. doi: 10.1073/pnas.1107739108.<br />

[85] H. Li <strong>and</strong> R. Durbin. Fast <strong>and</strong> accurate short read alignment with Burrows-<br />

Wheeler transform. In: Bioinformatics 25.14 (July 2009), pp. 17541760. doi:<br />

10.1093/bioinformatics/btp324. pmid: 19451168.<br />

[86] B. Langmead. Aligning short sequencing reads with Bowtie. In: Curr Protoc Bioinformatics<br />

Chapter 11 (Dec. 2010), Unit 11.7. doi: 10.1002/0471250953.bi1107s32. pmid:<br />

21154709.<br />

[87] B. Langmead <strong>and</strong> S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. In: Nat.<br />

Methods 9.4 (Apr. 2012), pp. 357359. doi: 10.1038/nmeth.1923. pmid: 22388286.<br />

[88] R. Li, Y. Li, K. Kristiansen, <strong>and</strong> J. Wang. SOAP: short oligonucleotide alignment program.<br />

In: Bioinformatics 24.5 (Mar. 2008), pp. 713714. doi: 10.1093/bioinformatics/<br />

btn025. pmid: 18227114.<br />

[89] R. Li, C. Yu, Y. Li, T. W. Lam, S. M. Yiu, K. Kristiansen, <strong>and</strong> J. Wang. SOAP2: an<br />

improved ultrafast tool for short read alignment. In: Bioinformatics 25.15 (Aug. 2009),<br />

pp. 19661967. doi: 10.1093/bioinformatics/btp336. pmid: 19497933.<br />

[90] S. M. Rumble, P. Lacroute, A. V. Dalca, M. Fiume, A. Sidow, <strong>and</strong> M. Brudno. SHRiMP:<br />

accurate mapping of short color-space reads. In: PLoS Comput. Biol. 5.5 (May 2009),<br />

e1000386. doi: 10.1371/journal.pcbi.1000386. pmid: 19461883.<br />

[91] M. David, M. Dzamba, D. Lister, L. Ilie, <strong>and</strong> M. Brudno. SHRiMP2: sensitive yet practical<br />

SHort Read Mapping. In: Bioinformatics 27.7 (Apr. 2011), pp. 10111012. doi:<br />

10.1093/bioinformatics/btr046. pmid: 21278192.<br />

[92] H. Li, J. Ruan, <strong>and</strong> R. Durbin. Mapping short DNA sequencing reads <strong>and</strong> calling variants<br />

using mapping quality scores. In: Genome Res. 18.11 (Nov. 2008), pp. 18511858. doi:<br />

10.1101/gr.078212.108. pmid: 18714091.<br />

[93] K. Schneeberger, J. Hagmann, S. Ossowski, N. Warthmann, S. Gesing, O. Kohlbacher,<br />

<strong>and</strong> D. Weigel. Simultaneous alignment of short reads against multiple genomes. In:<br />

Genome Biol. 10.9 (2009), R98. doi: 10.1186/gb-2009-10-9-r98. pmid: 19761611.<br />

[94] G. Jean, A. Kahles, V. T. Sreedharan, F. De Bona, <strong>and</strong> G. Ratsch. RNA-Seq read<br />

alignments with PALMapper. In: Curr Protoc Bioinformatics Chapter 11 (Dec. 2010),<br />

Unit 11.6. doi: 10.1002/0471250953.bi1106s32. pmid: 21154708.<br />

[95] M. C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. In: Bioinformatics<br />

25.11 (June 2009), pp. 13631369. doi: 10.1093/bioinformatics/btp236. pmid:<br />

19357099.<br />

[96] P. Ferragina <strong>and</strong> G. Manzini. Opportunistic data structures with applications. In: Proceedings<br />

of the 41st <strong>An</strong>nual Symposium on Foundations of Computer Science. FOCS '00.<br />

Washington, DC, USA: IEEE Computer Society, 2000, pp. 390. isbn: 0-7695-0850-2. doi:<br />

10.1109/SFCS.2000.892127.<br />

[97] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K.<br />

Garimella, D. Altshuler, S. Gabriel, M. Daly, <strong>and</strong> M. A. DePristo. The Genome <strong>An</strong>alysis<br />

Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.<br />

In: Genome Res. 20.9 (Sept. 2010), pp. 12971303. doi: 10.1101/gr.107524.110. pmid:<br />

20644199.


90 BIBLIOGRAPHY<br />

[98] M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A.<br />

Philippakis, G. del <strong>An</strong>gel, M. A. Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M.<br />

Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel, D. Altshuler, <strong>and</strong> M. J. Daly.<br />

A framework for variation discovery <strong>and</strong> genotyping using next-generation DNA sequencing<br />

data. In: Nat. Genet. 43.5 (May 2011), pp. 491498. doi: 10.1038/ng.806. pmid:<br />

21478889.<br />

[99] S. Ossowski, K. Schneeberger, J. I. Lucas-Lledo, N. Warthmann, R. M. Clark, R. G. Shaw,<br />

D. Weigel, <strong>and</strong> M. Lynch. The rate <strong>and</strong> molecular spectrum of spontaneous mutations<br />

in Arabidopsis thaliana. In: Science 327.5961 (Jan. 2010), pp. 9294. doi: 10 . 1126 /<br />

science.1180677. pmid: 20044577.<br />

[100] H. Li, B. H<strong>and</strong>saker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis,<br />

<strong>and</strong> R. Durbin. The Sequence Alignment/Map format <strong>and</strong> SAMtools. In: Bioinformatics<br />

25.16 (Aug. 2009), pp. 20782079. doi: 10 . 1093 / bioinformatics / btp352. pmid:<br />

19505943.<br />

[101] H. Li. Improving SNP discovery by base alignment quality. In: Bioinformatics 27.8 (Apr.<br />

2011), pp. 11571158. doi: 10.1093/bioinformatics/btr076. pmid: 21320865.<br />

[102] M. D. Robinson <strong>and</strong> A. Oshlack. A scaling normalization method for dierential expression<br />

analysis of RNA-seq data. In: Genome Biol. 11.3 (2010), R25. doi: 10.1186/gb-<br />

2010-11-3-r25. pmid: 20196867.<br />

[103] L. X. Garmire <strong>and</strong> S. Subramaniam. Evaluation of normalization methods in mammalian<br />

microRNA-Seq data. In: RNA 18.6 (June 2012), pp. 12791288. doi: 10 . 1261 / rna .<br />

030916.111. pmid: 22532701.<br />

[104] Y. Zhang, T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum,<br />

R. M. Myers, M. Brown, W. Li, <strong>and</strong> X. S. Liu. Model-based analysis of ChIP-Seq<br />

(MACS). In: Genome Biol. 9.9 (2008), R137. doi: 10.1186/gb-2008-9-9-r137. pmid:<br />

18798982.<br />

[105] A. Valouev, D. S. Johnson, A. Sundquist, C. Medina, E. <strong>An</strong>ton, S. Batzoglou, R. M. Myers,<br />

<strong>and</strong> A. Sidow. Genome-wide analysis of transcription factor binding sites based on ChIP-<br />

Seq data. In: Nat. Methods 5.9 (Sept. 2008), pp. 829834. doi: 10.1038/nmeth.1246.<br />

pmid: 19160518.<br />

[106] R. Jothi, S. Cuddapah, A. Barski, K. Cui, <strong>and</strong> K. Zhao. Genome-wide identication of in<br />

vivo protein-DNA binding sites from ChIP-Seq data. In: Nucleic Acids Res. 36.16 (Sept.<br />

2008), pp. 52215231. doi: 10.1093/nar/gkn488. pmid: 18684996.<br />

[107] J. Rozowsky, G. Euskirchen, R. K. Auerbach, Z. D. Zhang, T. Gibson, R. Bjornson, N.<br />

Carriero, M. Snyder, <strong>and</strong> M. B. Gerstein. PeakSeq enables systematic scoring of ChIPseq<br />

experiments relative to controls. In: Nat. Biotechnol. 27.1 (Jan. 2009), pp. 6675.<br />

doi: 10.1038/nbt.1518. pmid: 19122651.<br />

[108] H. Ji, H. Jiang, W. Ma, D. S. Johnson, R. M. Myers, <strong>and</strong> W. H. Wong. <strong>An</strong> integrated<br />

software system for analyzing ChIP-chip <strong>and</strong> ChIP-seq data. In: Nat. Biotechnol. 26.11<br />

(Nov. 2008), pp. 12931300. doi: 10.1038/nbt.1505. pmid: 18978777.<br />

[109] A. P. Fejes, G. Robertson, M. Bilenky, R. Varhol, M. Bainbridge, <strong>and</strong> S. J. Jones. Find-<br />

Peaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read<br />

sequencing technology. In: Bioinformatics 24.15 (Aug. 2008), pp. 17291730. doi: 10.<br />

1093/bioinformatics/btn305. pmid: 18599518.


BIBLIOGRAPHY 91<br />

[110] P. J. Cock, C. J. Fields, N. Goto, M. L. Heuer, <strong>and</strong> P. M. Rice. The Sanger FASTQ le<br />

format for sequences with quality scores, <strong>and</strong> the Solexa/Illumina FASTQ variants. In:<br />

Nucleic Acids Res. 38.6 (Apr. 2010), pp. 17671771. doi: 10.1093/nar/gkp1137. pmid:<br />

20015970.<br />

[111] T. S. F. S. W. Group. The SAM Format Specication. Sept. 2011. WebCite: 6FqvzGZJ8.<br />

url: http://samtools.sourceforge.net/SAM1.pdf Feb. 27, 2013.<br />

[112] L. Stein. Generic Feature Format Version 3. Feb. 2013. WebCite: 5R00Wxobq. url: http:<br />

//www.sequenceontology.org/gff3.shtml Mar. 4, 2013.<br />

[113] M. G. Reese, B. Moore, C. Batchelor, F. Salas, F. Cunningham, G. T. Marth, L. Stein, P.<br />

Flicek, M. Y<strong>and</strong>ell, <strong>and</strong> K. Eilbeck. A st<strong>and</strong>ard variation le format for human genome<br />

sequences. In: Genome Biol. 11.8 (2010), R88. doi: 10.1186/gb-2010-11-8-r88. pmid:<br />

20796305.<br />

[114] P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E.<br />

H<strong>and</strong>saker, G. Lunter, G. T. Marth, S. T. Sherry, G. McVean, R. Durbin, R. Durbin,<br />

D. Altshuler, G. Abecasis, D. Bentley, A. Chakravarti, A. Clark, F. De La Vega, P.<br />

Donnelly, M. Dunn, P. Flicek, S. Gabriel, E. Green, R. Gibbs, B. Knoppers, E. L<strong>and</strong>er,<br />

H. Lehrach, E. Mardis, G. Marth, et al. The variant call format <strong>and</strong> VCFtools. In:<br />

Bioinformatics 27.15 (Aug. 2011), pp. 21562158. doi: 10.1093/bioinformatics/btr330.<br />

pmid: 21653522.<br />

[115] P. Deutsch. DEFLATE Compressed <strong>Data</strong> Format Specication version 1.3. RFC 1951<br />

(Informational). Internet Engineering Task Force, May 1996. WebCite: 6Fr6mTscx. url:<br />

http://www.ietf.org/rfc/rfc1951.txt.<br />

[116] P. Deutsch. GZIP le format specication version 4.3. RFC 1952 (Informational). Internet<br />

Engineering Task Force, May 1996. WebCite: 6Fr6csgTT. url: http://www.ietf.org/rfc/<br />

rfc1952.txt.<br />

[117] X. Chen, M. Li, B. Ma, <strong>and</strong> J. Tromp. DNACompress: fast <strong>and</strong> eective DNA sequence<br />

compression. In: Bioinformatics 18.12 (Dec. 2002), pp. 16961698. doi: 10.1093/<br />

bioinformatics/18.12.1696. pmid: 12490460.<br />

[118] F. Hach, I. Numanagic, C. Alkan, <strong>and</strong> S. C. Sahinalp. SCALCE: boosting sequence<br />

compression algorithms using locally consistent encoding. In: Bioinformatics 28.23 (Dec.<br />

2012), pp. 30513057. doi: 10.1093/bioinformatics/bts593. pmid: 23047557.<br />

[119] D. C. Jones, W. L. Ruzzo, X. Peng, <strong>and</strong> M. G. Katze. Compression of next-generation<br />

sequencing reads aided by highly ecient de novo assembly. In: Nucleic Acids Res. 40.22<br />

(Dec. 2012), e171. doi: 10.1093/nar/gks754. pmid: 22904078.<br />

[120] M. Hsi-Yang Fritz, R. Leinonen, G. Cochrane, <strong>and</strong> E. Birney. Ecient storage of high<br />

throughput DNA sequencing data using reference-based compression. In: Genome Res.<br />

21.5 (May 2011), pp. 734740. doi: 10.1101/gr.114819.110. pmid: 21245279.<br />

[121] S. Golomb. Run-length encodings (Corresp.) In: Information Theory, IEEE Transactions<br />

on 12.3 (1966), pp. 399401. issn: 0018-9448. doi: 10.1109/TIT.1966.1053907.<br />

[122] D. Human. A Method for the Construction of Minimum-Redundancy Codes. In: Proceedings<br />

of the IRE 40.9 (1952), pp. 10981101. issn: 0096-8390. doi: 10.1109/JRPROC.<br />

1952.273898.<br />

[123] W. J. Kent, A. S. Zweig, G. Barber, A. S. Hinrichs, <strong>and</strong> D. Karolchik. BigWig <strong>and</strong><br />

BigBed: enabling browsing of large distributed datasets. In: Bioinformatics 26.17 (Sept.<br />

2010), pp. 22042207. doi: 10.1093/bioinformatics/btq351. pmid: 20639541.


92 BIBLIOGRAPHY<br />

[124] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, <strong>and</strong><br />

D. Haussler. The human genome browser at UCSC. In: Genome Res. 12.6 (June 2002),<br />

pp. 9961006. doi: 10.1101/gr.229102. pmid: 12045153.<br />

[125] H. Li. Tabix: fast retrieval of sequence features from generic TAB-delimited les. In:<br />

Bioinformatics 27.5 (Mar. 2011), pp. 718719. doi: 10.1093/bioinformatics/btq671.<br />

pmid: 21208982.<br />

[126] A. Guttman. R-trees: a dynamic index structure for spatial searching. In: SIGMOD Rec.<br />

14.2 (June 1984), pp. 4757. issn: 0163-5808. doi: 10.1145/971697.602266.<br />

[127] L. Collin <strong>and</strong> I. Pavlov. The .xz File Format version 1.0.4. Aug. 2009. WebCite: 6FqpSZFsW.<br />

url: http://tukaani.org/xz/xz-file-format.txt Nov. 4, 2012.<br />

[128] J. L. Bentley. Multidimensional binary search trees used for associative searching. In:<br />

Commun. ACM 18.9 (Sept. 1975), pp. 509517. issn: 0001-0782. doi: 10.1145/361002.<br />

361007.<br />

[129] Y. Livnat, H.-W. Shen, <strong>and</strong> C. R. Johnson. A Near Optimal Isosurface Extraction Algorithm<br />

Using the Span Space. In: IEEE Transactions on Visualization <strong>and</strong> Computer<br />

Graphics 2.1 (Mar. 1996), pp. 7384. issn: 1077-2626. doi: 10.1109/2945.489388.<br />

[130] D. Lee <strong>and</strong> C. Wong. Worst-case analysis for region <strong>and</strong> partial region searches in multidimensional<br />

binary search trees <strong>and</strong> balanced quad trees. English. In: Acta Informatica<br />

9.1 (1977), pp. 2329. issn: 0001-5903. doi: 10.1007/BF00263763.<br />

[131] S. Kurtz. The Vmatch large scale sequence analysis software. Dec. 2011. WebCite:<br />

5EyNRAwVK. url: http://www.vmatch.de Feb. 27, 2013.<br />

[132] M. I. Abouelhoda, S. Kurtz, <strong>and</strong> E. Ohlebusch. Replacing sux trees with enhanced sux<br />

arrays. In: Journal of Discrete Algorithms 2.1 (2004), pp. 53 86. doi: 10.1016/S1570-<br />

8667(03)00065-0.<br />

[133] S. B. Needleman <strong>and</strong> C. D. Wunsch. A general method applicable to the search for<br />

similarities in the amino acid sequence of two proteins. In: J. Mol. Biol. 48.3 (Mar.<br />

1970), pp. 443453. doi: 10.1016/0022-2836(70)90057-4. pmid: 5420325.<br />

[134] T. F. Smith <strong>and</strong> M. S. Waterman. Identication of common molecular subsequences.<br />

In: J. Mol. Biol. 147.1 (Mar. 1981), pp. 195197. doi: 10.1016/0022-2836(81)90087-5.<br />

pmid: 7265238.<br />

[135] F. Van Nieuwerburgh, R. C. Thompson, J. Ledesma, D. Deforce, T. Gaasterl<strong>and</strong>, P.<br />

Ordoukhanian, <strong>and</strong> S. R. Head. Illumina mate-paired DNA sequencing-<strong>lib</strong>rary preparation<br />

using Cre-Lox recombination. In: Nucleic Acids Res. 40.3 (Feb. 2012), e24. doi:<br />

10.1093/nar/gkr1000. pmid: 22127871.<br />

[136] W. J. Kent. BLATthe BLAST-like alignment tool. In: Genome Res. 12.4 (Apr. 2002),<br />

pp. 656664. doi: 10.1101/gr.229202. pmid: 11932250.<br />

[137] E. Moyroud, E. G. Minguet, F. Ott, L. Yant, D. Pose, M. Monniaux, S. Blanchet, O.<br />

Bastien, E. Thevenon, D. Weigel, M. Schmid, <strong>and</strong> F. Parcy. Prediction of regulatory<br />

interactions from genome sequences using a biophysical model for the Arabidopsis LEAFY<br />

transcription factor. In: Plant Cell 23.4 (Apr. 2011), pp. 12931306. doi: 10.1105/tpc.<br />

111.083329.


BIBLIOGRAPHY 93<br />

[138] R. Br<strong>and</strong>t, M. Salla-Martret, J. Bou-Torrent, T. Musielak, M. Stahl, C. Lanz, F. Ott,<br />

M. Schmid, T. Greb, M. Schwarz, S. B. Choi, M. K. Barton, B. J. Reinhart, T. Liu,<br />

M. Quint, J. C. Palauqui, J. F. Martinez-Garcia, <strong>and</strong> S. Wenkel. Genome-wide bindingsite<br />

analysis of REVOLUTA reveals a link between leaf patterning <strong>and</strong> light-mediated<br />

growth responses. In: Plant J. 72.1 (Oct. 2012), pp. 3142. doi: 10 . 1111 / j . 1365 -<br />

313X.2012.05049.x.<br />

[139] R. G. Immink, D. Pose, S. Ferrario, F. Ott, K. Kaufmann, F. L. Valentim, S. de Folter, F.<br />

van der Wal, A. D. van Dijk, M. Schmid, <strong>and</strong> G. C. <strong>An</strong>genent. Characterization of SOC1's<br />

central role in owering by the identication of its upstream <strong>and</strong> downstream regulators.<br />

In: Plant Physiol. 160.1 (Sept. 2012), pp. 433449. doi: 10.1104/pp.112.202614.<br />

[140] D. Pose, L. Verhage, F. Ott, L. Yant, J. Mathieu, G. C. <strong>An</strong>genent, R. G. Immink, <strong>and</strong> M.<br />

Schmid. Temperature-dependent regulation of owering by antagonistic FLM variants.<br />

In: Nature (Sept. 2013). doi: 10.1038/nature12633.<br />

[141] P. Merelo, Y. Xie, L. Br<strong>and</strong>, F. Ott, D. Weigel, J. L. Bowman, M. G. Heisler, <strong>and</strong> S.<br />

Wenkel. Genome-Wide Identication of KANADI1 Target Genes. In: PLoS ONE 8.10<br />

(2013), e77341. doi: 10.1371/journal.pone.0077341.<br />

[142] E. S. L<strong>and</strong>er <strong>and</strong> M. S. Waterman. Genomic mapping by ngerprinting r<strong>and</strong>om clones: a<br />

mathematical analysis. In: Genomics 2.3 (Apr. 1988), pp. 231239. doi: 10.1016/0888-<br />

7543(88)90007-9. pmid: 3294162.<br />

[143] Y. Benjamini <strong>and</strong> Y. Hochberg. Controlling the false discovery rate: a practical <strong>and</strong><br />

powerful approach to multiple testing. In: Journal of the Royal Statistical Society. Series<br />

B (Methodological) (1995), pp. 289300.<br />

[144] U. Manber <strong>and</strong> G. Myers. Sux arrays: a new method for on-line string searches. In:<br />

Proceedings of the rst annual ACM-SIAM symposium on Discrete algorithms. SODA '90.<br />

San Francisco, California, USA: Society for Industrial <strong>and</strong> Applied Mathematics, 1990,<br />

pp. 319327. isbn: 0-89871-251-3.<br />

[145] G. Nong, S. Zhang, <strong>and</strong> W. H. Chan. Two Ecient Algorithms for Linear Time Sux<br />

Array Construction. In: IEEE Transactions on Computers 60.10 (2011), pp. 14711484.<br />

issn: 0018-9340. doi: 10.1109/TC.2010.188.<br />

[146] K. S. Pollard, H. N. Gilbert, Y. Ge, S. Taylor, <strong>and</strong> S. Dudoit. multtest: Resampling-based<br />

multiple hypothesis testing. R package version 2.4.0.<br />

[147] Y. Benjamini <strong>and</strong> D. Yekutieli. The control of the false discovery rate in multiple testing<br />

under dependency. In: <strong>An</strong>nals of statistics (2001), pp. 11651188.<br />

[148] O. J. Dunn. Multiple comparisons among means. In: Journal of the American Statistical<br />

Association 56.293 (1961), pp. 5264.<br />

[149] Y. Hochberg. A sharper Bonferroni procedure for multiple tests of signicance. In:<br />

Biometrika 75.4 (1988), pp. 800802. doi: 10.1093/biomet/75.4.800.<br />

[150] S. Holm. A simple sequentially rejective multiple test procedure. In: Sc<strong>and</strong>inavian journal<br />

of statistics (1979), pp. 6570.<br />

[151] P. H. Westfall <strong>and</strong> S. Young. Resampling-Based Multiple Testing: Examples <strong>and</strong> Methods<br />

for P-Value Adjustment. A Wiley-Interscience publication. Wiley, 1993. isbn: 9 780471<br />

557616.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!