An Improved Genetic Algorithm for DNA Sequencing - Penn State ...

An Improved Genetic Algorithm for DNA Sequencing - Penn State ... An Improved Genetic Algorithm for DNA Sequencing - Penn State ...

cs.hbg.psu.edu
from cs.hbg.psu.edu More from this publisher
01.06.2015 Views

The Pennsylvania State University The Graduate School Capital College An Improved Genetic Algorithm Solving the DNA Sequencing Problem with Errors A Master’s Paper in Computer Science by Waleed Youssef c○2004 Waleed Youssef Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science March 2004

The <strong>Penn</strong>sylvania <strong>State</strong> University<br />

The Graduate School<br />

Capital College<br />

<strong>An</strong> <strong>Improved</strong> <strong>Genetic</strong> <strong>Algorithm</strong><br />

Solving the <strong>DNA</strong> <strong>Sequencing</strong><br />

Problem with Errors<br />

A Master’s Paper in<br />

Computer Science<br />

by<br />

Waleed Youssef<br />

c○2004 Waleed Youssef<br />

Submitted in Partial Fulfillment<br />

of the Requirements<br />

<strong>for</strong> the Degree of<br />

Master of Science<br />

March 2004


Abstract<br />

<strong>Genetic</strong> <strong>Algorithm</strong>s have turned out to be very effective in solving the computationally<br />

NP-hard problem of <strong>DNA</strong> <strong>Sequencing</strong>. In general, <strong>Genetic</strong> <strong>Algorithm</strong>s<br />

produce optimal or close to optimal solutions in polynomial time.<br />

In this research, we describe a new genetic algorithm <strong>for</strong> solving the <strong>DNA</strong><br />

sequencing problem. The algorithm allows the input spectrum to contain<br />

both positive and negative errors as could be expected from a hybridization<br />

experiment. The main features of the algorithm described here include a preprocessing<br />

step that reduces the size of the input spectrum using a dynamic<br />

programming technique and an efficient local optimization process. In experimental<br />

tests, the algorithm per<strong>for</strong>med very well against existing algorithms.<br />

It outper<strong>for</strong>med them in terms of match percentages. The running time,<br />

although better than existing algorithms, was not highlighted as an important<br />

factor because of different machines used in testing. The algorithm also<br />

per<strong>for</strong>med very well on large data sets generated from real genomes data.<br />

i


Table of Contents<br />

Abstract<br />

Acknowledgement<br />

List of Figures<br />

List of Tables<br />

i<br />

iv<br />

v<br />

vi<br />

1 Introduction 1<br />

2 Preliminaries 2<br />

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

2.2 Basic In<strong>for</strong>mation . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />

2.3 <strong>Sequencing</strong> by Hybridization . . . . . . . . . . . . . . . . . . . 5<br />

2.4 Fragments Assembly . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2.5 <strong>Genetic</strong> <strong>Algorithm</strong>s . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

3 Problem Formulation 9<br />

3.1 Describing the Problem . . . . . . . . . . . . . . . . . . . . . . 10<br />

3.2 Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

3.3 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />

3.4 Other Related In<strong>for</strong>mation . . . . . . . . . . . . . . . . . . . . 13<br />

3.4.1 The Hamiltonian and Eulerian Paths . . . . . . . . . . 13<br />

3.4.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . 14<br />

4 <strong>Algorithm</strong> 17<br />

4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

4.2 Encoding and Initialization . . . . . . . . . . . . . . . . . . . . 19<br />

4.3 Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

4.4 Parent Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

4.5 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

ii


4.6 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

4.7 Local Optimization . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

4.8 Replacement Scheme . . . . . . . . . . . . . . . . . . . . . . . 23<br />

4.9 Stopping Condition . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

5 Standardizing the Data 24<br />

6 Experimental Results 27<br />

7 Conclusion 36<br />

iii


Acknowledgements<br />

Dr. T. Bui has been a valued advisor. He provided me with excellent<br />

insights and knowledge. His expertise has made this project a wonderful<br />

experience <strong>for</strong> me. I owe a good deal to him <strong>for</strong> making this project a<br />

success. My special thanks to the committee members: Dr. Q. Ding, Dr.<br />

P. Naumov, and Dr. L. Null <strong>for</strong> their positive comments and feedbacks.<br />

Also, thanks to everyone in the Computer Science department <strong>for</strong> the great<br />

learning experience I’ve ever had.<br />

I would like also to thank Dr. J. Blazewicz and Dr. M. Kasprzak, Institute<br />

of Computing Science, Poznan University of Technology, <strong>for</strong> providing me<br />

with the data used in [7], Dr. S. Skiena, Department of Computer Science,<br />

<strong>State</strong> University of New York, <strong>for</strong> useful discussions, and Jie Li, Iowa <strong>State</strong><br />

University <strong>for</strong> providing me with an implementation of the Smith-Waterman<br />

algorithm.<br />

A version of this paper appears in [9].<br />

iv


List of Figures<br />

1 A Typical Structure of Steady-<strong>State</strong> <strong>Genetic</strong> <strong>Algorithm</strong> . . . . 7<br />

2 A Typical Structure of Hybrid <strong>Genetic</strong> <strong>Algorithm</strong> . . . . . . . 8<br />

3 Reconstructing the sequence in case of no errors . . . . . . . . 11<br />

4 Reconstructing the sequence in case of errors in the spectrum 13<br />

5 Recurrence Formula <strong>for</strong> Sequence Aignment . . . . . . . . . . 15<br />

6 The Enhanced <strong>Genetic</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong> sequencing . . . . 17<br />

7 The Preprocessing <strong>Algorithm</strong>. . . . . . . . . . . . . . . . . . . 19<br />

8 The <strong>Algorithm</strong> to repair the Chromosome . . . . . . . . . . . 22<br />

9 The <strong>Algorithm</strong> <strong>for</strong> the Local Optimization Process . . . . . . . 23<br />

10 The Spectrum Generator <strong>Algorithm</strong>. . . . . . . . . . . . . . . 25<br />

11 Comparison between our algorithm and others . . . . . . . . . 27<br />

12 Plot of Match Percentage against Error Combinations (l=10) . 30<br />

13 Plot of Running Time against Error Combinations (l=10) . . 30<br />

14 Plot of Match Percentage against Error Combinations (l=20) . 31<br />

15 Plot of Running Time against Error Combinations (l=20) . . 32<br />

16 Plot of Match Percentage against Negative Errors . . . . . . . 32<br />

17 Plot of Running Time against Negative Errors . . . . . . . . . 33<br />

18 Plot of Match Percentage against Positive Errors . . . . . . . 34<br />

19 Plot of Running Time against Positive Errors . . . . . . . . . 34<br />

20 Plot of Match Percentage against Optimal Length . . . . . . . 35<br />

21 Plot of Running Time against Optimal Length . . . . . . . . . 35<br />

22 Plot of Match Percentage against Fragment Length . . . . . . 36<br />

23 Plot of Running Time against Fragment Length . . . . . . . . 37<br />

v


List of Tables<br />

1 The Dynamic Programming Matrix representation <strong>for</strong> Example 4 . . . . 16<br />

2 Genomes obtained from the GenBank . . . . . . . . . . . . . . . . . 24<br />

3 Positive and Negative error combinations used in testing the algorithm . 26<br />

4 Summary of results from our algorithm and others . . . . . . . . . . . 28<br />

vi


1 INTRODUCTION 1<br />

1 Introduction<br />

Determining the genome of living organisms has been a major research initiative<br />

world wide in the last few years. One of the principal steps in this<br />

endeavor is the sequencing of <strong>DNA</strong>. In<strong>for</strong>mally, <strong>DNA</strong> sequencing is the process<br />

of determining the correct order of nucleotides in a <strong>DNA</strong> segment. Many<br />

techniques have been developed <strong>for</strong> <strong>DNA</strong> sequencing. <strong>DNA</strong> sequencing experiments<br />

are typically per<strong>for</strong>med in two stages: shotgun sequencing and<br />

walking. In shotgun sequencing, many short, randomly selected fragments of<br />

a <strong>DNA</strong> segment are sequenced. Due to the stochastic nature of this process,<br />

there are parts of the <strong>DNA</strong> segment that are left unsequenced or insufficiently<br />

covered. These parts are then covered by a deterministic finishing process<br />

called walking [22].<br />

The two most popular methods <strong>for</strong> <strong>DNA</strong> sequencing are the Sanger<br />

method and the <strong>Sequencing</strong> by Hybridization (SBH) method [23]. In this<br />

research we consider only the SBH method. SBH follows the methodology<br />

known as “break, read, and assemble”. In this methodology a <strong>DNA</strong> sequence<br />

is partitioned into smaller size fragments. The fragments are then read using<br />

a fluorescent light. The assemble phase tries to retrieve the original sequence<br />

from the shorter length fragments, i.e., to determine the exact sequence of<br />

nucleotides of the <strong>DNA</strong> molecule.<br />

From an algorithmic point of view the <strong>DNA</strong> sequencing problem is the<br />

problem of constructing a chromosome that most likely contains all the <strong>DNA</strong><br />

fragments in a given input set, called the spectrum. The spectrum is usually<br />

obtained through some experiments such as a hybridization experiment. It<br />

should be noted that the fragments in a spectrum have overlaps and all fragments<br />

have the same length. If a spectrum contains all possible fragments of<br />

length l of a <strong>DNA</strong> sequence and there are no errors in the fragments, then<br />

there exist efficient algorithms <strong>for</strong> reconstructing the original <strong>DNA</strong> sequence<br />

from the spectrum. In general, however, there are errors in the spectrum,<br />

e.g., missing fragments or erroneous fragments, making the problem of reconstructing<br />

the original <strong>DNA</strong> sequence an NP-hard problem [4].


2 PRELIMINARIES 2<br />

In this study we present a genetic algorithm <strong>for</strong> the <strong>DNA</strong> sequencing<br />

problem. This algorithm differs from others in that it can efficiently handle<br />

different types of errors in the input. Additionally, the algorithm includes<br />

a preprocessing step that effectively reduces the size of the input, thereby<br />

reducing the running time of the algorithm. It also helps in improving the<br />

chance of getting the optimal answer. The idea of the preprocessing step can<br />

be extended to create a hierarchical structure that enables the algorithm to<br />

deal with much longer sequences. Experimental results show that this new<br />

algorithm outper<strong>for</strong>med other algorithms from [1] and [7]. We also per<strong>for</strong>med<br />

extensive test of the algorithm on data that we generated systematically<br />

from genomes obtained from the GenBank [13]. Data was generated using<br />

another algorithm we developed to simulate the Sequence by Hybridization<br />

experiment. To determine the quality of the results obtained, the Smith-<br />

Waterman algorithm <strong>for</strong> sequence alignment was used [27]. The results of<br />

these experiments show that our algorithm is very robust against a large<br />

range of data generated with different types of errors.<br />

The rest of this paper is organized as follows. Section 2 describes some<br />

common terminology and background in<strong>for</strong>mation. Section 3 defines the<br />

problem <strong>for</strong>mally, gives background in<strong>for</strong>mation about algorithmic methods<br />

and techniques used when developing the algorithm, and defines some techniques<br />

that were used be<strong>for</strong>e to solve the same problem. The algorithm is<br />

described in Section 4. Techniques used to obtain and standardize the data<br />

are described in Section 5. Experimental results comparing the per<strong>for</strong>mance<br />

of our algorithm against others are given in Section 6. Section 6 also includes<br />

results showing the per<strong>for</strong>mance of our algorithm on large data sets that we<br />

generated. Conclusions and future directions are given in Section 7.<br />

2 Preliminaries<br />

<strong>DNA</strong> contains the genetic codes that are passed from generation to generation.<br />

It determines many facets of how organisms develop. Over many years,<br />

biologists have tried to determine and analyze sequences of genomes that


2 PRELIMINARIES 3<br />

hold characteristics of all living things. Many mysteries have been identified<br />

and many secrets have been revealed. One of the most challenging technical<br />

projects in recent days is the Human Genome project. <strong>Sequencing</strong> the<br />

whole human genome will help reveal the estimated 30,000 to 35,000 human<br />

genes within our <strong>DNA</strong> as well as the regions controlling them. The resulting<br />

<strong>DNA</strong> sequence maps will be used by 21st century scientists to explore human<br />

biology and other complex phenomena [17].<br />

In the next few subsections, we will describe some of the terminologies<br />

needed <strong>for</strong> the rest of this paper. We will also give brief preliminary in<strong>for</strong>mation<br />

about <strong>DNA</strong> <strong>Sequencing</strong> in general and the <strong>Sequencing</strong> by Hybridization<br />

experiment in particular. Additional background in<strong>for</strong>mation will also be<br />

discussed.<br />

2.1 Definitions<br />

<strong>DNA</strong> can be defined as a string of symbols drawn from the set Σ={A,C,T,G},<br />

where A represents Adenine, C represents Cytosine, G represents Guanine,<br />

and T represents Thymine. Each symbol is known as a nucleotide. <strong>An</strong><br />

oligonucleotide is a short sequence of nucleotides. It is also known as a<br />

fragment. A sequence or strand is a string of larger number of nucleotides.<br />

Usually, genetic algorithms refer to the strand or sequence as a chromosome.<br />

A hybridization experiment is an experiment that takes a <strong>DNA</strong> strand and<br />

produces a copy of fragments of that strand. These fragments usually have<br />

overlap. The set of all fragments that results from a hybridization experiment<br />

is known as a spectrum. All fragments in a spectrum have the same length.<br />

Example 1.<br />

• A fragment of length 10: ACTGCTGGTT<br />

• A chromosome C: ACTGCTGGTGTCTGTACGAGGTACGTAGCA<br />

• Spectrum S of cardinality 21, obtained from chromosome C:<br />

{ACTGCTGGTG, CTGCTGGTGT, TGCTGGTGTC, GCTGGTGTCT,


2 PRELIMINARIES 4<br />

CTGGTGTCTG, TGGTGTCTGT, GGTGTCTGTA, GTGTCTGTAC,<br />

TGTCTGTACG, GTCTGTACGA, TCTGTACGAG, CTGTACGAGG,<br />

TGTACGAGGT, GTACGAGGTA, TACGAGGTAC, ACGAGGTACG,<br />

CGAGGTACGT, GAGGTACGTA, AGGTACGTAG, GGTACGTAGC,<br />

GTACGTAGCA}<br />

2.2 Basic In<strong>for</strong>mation<br />

<strong>DNA</strong> (deoxyribonucleic acid) consists of two strands, each of which contains<br />

nucleotides obtained from the set Σ={A, C, G,T} (Technically, there are<br />

other components in a <strong>DNA</strong> strand such as phosphates). The nucleotides in<br />

each strand are connected together in series. The two strands of the <strong>DNA</strong><br />

are twisted together into the famous double helix structure. Furthermore,<br />

each nucleotide in a strand is connected to a complementary nucleotide in<br />

the other strand, where A is paired with T and C is paired with G. Thus,<br />

each strand in a <strong>DNA</strong> completely determines the other. <strong>Sequencing</strong> techniques<br />

make use of this important characteristic to determine the original<br />

sequence of oligonucleotides by adding labeled complementary solutions to<br />

the experiment to determine the fragment that paired up with the labeled<br />

one.<br />

Three main areas of interest can be distinguished in the field of <strong>DNA</strong>:<br />

<strong>DNA</strong> sequencing, <strong>DNA</strong> assembling, and <strong>DNA</strong> mapping. <strong>DNA</strong> sequencing is<br />

the process of determining the original sequence of nucleotides from a set of<br />

<strong>DNA</strong> fragments. <strong>DNA</strong> assembling is the process of assembling the sequenced<br />

fragments into longer contigs. Finally, <strong>DNA</strong> mapping deals with the whole<br />

chromosomes and tries to place marked <strong>DNA</strong> fragments (usually genes) on<br />

certain chromosome region [5]. Our research concentrated mainly on <strong>DNA</strong><br />

sequencing. Also, we were able to extend the algorithm presented here to<br />

per<strong>for</strong>m some <strong>DNA</strong> assembly work as will be shown in the next few sections.


2 PRELIMINARIES 5<br />

2.3 <strong>Sequencing</strong> by Hybridization<br />

Hybridization is a parallel experiment with high throughput. It is superb <strong>for</strong><br />

limited regions of a genome. It is a procedure in which two single-stranded<br />

<strong>DNA</strong> molecules containing complementary sequences of nucleotides bond together.<br />

A probe is a <strong>DNA</strong> molecule that is fluorescently labeled. By testing<br />

whether a probe hybridizes to a given sequence, it is possible to determine<br />

whether the sequence contains a piece that is complementary to the probe.<br />

Techniques have been devised that make it possible to test the hybridization<br />

of a single probe to hundreds of different sequences in a single automated<br />

experiment. In the hybridization experiment, <strong>DNA</strong> arrays - also known as<br />

<strong>DNA</strong> chips - containing thousands of short fragments with length l (<strong>for</strong> example,<br />

l =10) and attached to a surface are applied to a solution containing<br />

the unknown fluorescent-labeled <strong>DNA</strong> fragment. After the reaction, one can<br />

obtain a set of oligonucleotides which are fragments of the examined <strong>DNA</strong><br />

sequence by reading a fluorescent image of the chip. Those fragments or<br />

oligonucleotides constitute the spectrum. The spectrum is read by exposing<br />

it to the light. The original sequence is then reconstructed using combinatorial<br />

algorithms that take as input the generated spectrum and return as<br />

output a sequence that best represents the set of input spectrum.<br />

The reliability of the fragment depends on its binding energy to the target.<br />

It is a function of many factors such as the length of the probe, the<br />

oligonucleotide contents of the probe, and similar sequences in the target.<br />

The length of the probes should not be too small if we want to be able to<br />

reconstruct the sequence efficiently. Short probes will overlap in size shorter<br />

than the probe length which means that if the probe is too short, it is almost<br />

impossible to reconstruct the original sequence. Long probes increase the<br />

hybridization reliability. However, fragments longer than a few hundred nucleotides<br />

cannot be sequenced reliably by current methods, so the fragments<br />

will typically be rather short. In addition, long fragments are unin<strong>for</strong>mative<br />

at the single nucleotide level. The fragment length also influences thermal<br />

stability.<br />

<strong>Sequencing</strong> by hybridization (SBH) is an elegant and efficient sequencing


2 PRELIMINARIES 6<br />

method in the case of error-free data. SBH is fast and convenient. However, it<br />

is limited to extremely short sequences as even sequencing an 8–base sequence<br />

implies an array with 4 8 = 65, 536 elements. Sequences of a hundred bases<br />

would require a currently infeasible array size. In general, sequencing might<br />

be accomplished with the entire set of 4 N probes of length N. However, in the<br />

original experiment, errors exist. Errors are correlated more with differences<br />

in melting temperature than purely random errors.<br />

2.4 Fragments Assembly<br />

Fragments Assembly is the process of constructing longer sequence from the<br />

input spectrum. Shorter sequences can be assembled using the <strong>DNA</strong> sequencing<br />

process. <strong>DNA</strong> sequencing is not very efficient <strong>for</strong> longer sequences.<br />

In general, the process of assembling large sequences can be divided into two<br />

main steps. The first step is to reconstruct longer sequences from input spectra.<br />

Reconstructing longer sequences from short fragments can be efficiently<br />

done using <strong>DNA</strong> sequencing algorithms. The second step is to align sequences<br />

obtained from the <strong>DNA</strong> sequencing process through sequence alignment algorithms.<br />

The process can be looked at as a tree representation where shorter<br />

fragments are the leaves and the goal is to move up the tree to obtain longer<br />

sequences until the root is reached and the optimal sequence is obtained.<br />

2.5 <strong>Genetic</strong> <strong>Algorithm</strong>s<br />

Be<strong>for</strong>e describing the problem more <strong>for</strong>mally, we provide some background<br />

in<strong>for</strong>mation on how genetic algorithms can be used to solve many NP-hard<br />

problems efficiently. <strong>Genetic</strong> algorithms were <strong>for</strong>mally introduced in 1975 by<br />

John Holland at University of Michigan [15]. In 1992 John Koza used genetic<br />

algorithms to evolve programs to per<strong>for</strong>m certain tasks. He called his<br />

method genetic programming. Since then, genetic algorithms have become<br />

more and more popular. The continuing price and per<strong>for</strong>mance improvements<br />

of computational systems have made them attractive <strong>for</strong> many types<br />

of optimization problems. In particular, genetic algorithms work very well


2 PRELIMINARIES 7<br />

on mixed (continuous and discrete) combinatorial problems. They are less<br />

susceptible to getting ‘stuck’ at local optima than gradient search methods,<br />

but they tend to be computationally expensive.<br />

To use a genetic algorithm, the solution to the problem must be represented<br />

as a genome (or chromosome). The genetic algorithm then creates a<br />

population of solutions and applies genetic operators such as mutation and<br />

crossover to evolve the solutions in order to find the best one(s). Many variations<br />

of genetic algorithms exist depending on the structure of the algorithm.<br />

A genetic algorithm that generates only one offspring per generation is called<br />

steady-state genetic algorithm, as opposed to a generational genetic algorithm<br />

that replaces the whole population, or a large subset of it, per generation. A<br />

typical structure of a steady-state genetic algorithm is given in Figure 1. If a<br />

local optimization step is added to the steady-state genetic algorithm, then<br />

the algorithm is said to be hybridized and the scheme is called hybrid genetic<br />

algorithm. A typical structure <strong>for</strong> the hybrid GA is shown in Figure 2. We<br />

provide a hybrid steady-state GA <strong>for</strong> the <strong>DNA</strong> sequencing problem.<br />

<strong>Genetic</strong> <strong>Algorithm</strong><br />

generate a random initial population P<br />

repeat<br />

Select two parents p 1 and p 2 from population<br />

offspring ←− crossover(p 1 , p 2 )<br />

mutate( offspring)<br />

if suited ( offspring) then<br />

replace(P, offspring)<br />

until (there is no improvement)<br />

return the best member of P<br />

Figure 1: A Typical Structure of Steady-<strong>State</strong> <strong>Genetic</strong> <strong>Algorithm</strong><br />

Below, we introduce some common and important terminology of genetic<br />

algorithms as well as some recommendations that should be taken into considerations<br />

when designing a new genetic algorithm. Those recommendations<br />

and tips will improve the evolution of the algorithm and eventually lead to<br />

obtaining better results.


2 PRELIMINARIES 8<br />

<strong>Genetic</strong> <strong>Algorithm</strong><br />

generate a random initial population P<br />

repeat<br />

Select two parents p 1 and p 2 from population<br />

offspring ←− crossover(p 1 , p 2 )<br />

mutate( offspring)<br />

localOptimize( offspring)<br />

if suited ( offspring) then<br />

replace(P, offspring)<br />

until (there is no improvement)<br />

return the best member of P<br />

Figure 2: A Typical Structure of Hybrid <strong>Genetic</strong> <strong>Algorithm</strong><br />

Population: A genetic algorithm starts with a set of initial solutions<br />

(chromosomes) called a population. The size of the population depends on<br />

the problem and on the encoding of the chromosomes. Larger populations<br />

usually need more time to evolve. Smaller populations might not be able to<br />

reach the optimal solution. They are more subject to being stuck at local<br />

optima. A good population size is about 50 to 100. The population size<br />

has a direct affect on per<strong>for</strong>mance; the larger the population, the slower the<br />

algorithm.<br />

Encoding: Encoding depends on the problem and also on the size of<br />

instance of the problem. There are no general possible ways of encoding the<br />

chromosome. <strong>An</strong> encoding that is best <strong>for</strong> one problem might not be suitable<br />

<strong>for</strong> other problems.<br />

Crossover: Crossover is the process of combining two parents to obtain<br />

a new member of the population called an offspring. Crossover depends on<br />

the chosen encoding and on the problem. One possible crossover operator is<br />

a k-point crossover.<br />

In the k-point crossover, two parent chromosomes are cut in exactly k<br />

positions resulting in k + 1 segments <strong>for</strong> each chromosome. Segments are


3 PROBLEM FORMULATION 9<br />

then alternatively concatenated resulting in two new offsprings. The two<br />

new offsprings are compared and the better one is returned.<br />

Mutation: The goal of mutation is to enhance the diversity of the population<br />

by introducing new members that are mutated from their parents.<br />

The mutation rate should be very low. The best rates seem to be about<br />

0.5%-5%. Increasing the mutation rate creates members in the population<br />

with new characteristics.<br />

Selection: Many schemes exist <strong>for</strong> the selection process. One of the<br />

simplest methods is the basic selection method. In this method, a random<br />

number representing a chromosome in the population is selected. Each<br />

member of the population has the same probability of being selected. <strong>An</strong>other<br />

commonly used method is the roulette wheel selection method. In<br />

this method, members with higher fitness from the population have a higher<br />

chance of being selected. There are also more sophisticated methods that<br />

change the parameters of selection during the run of the genetic algorithm.<br />

They behave very similar to simulated annealing. Choosing the right method<br />

to use is problem dependent.<br />

Local Optimization: The goal of the local optimization step is to try<br />

to enhance the current chromosome and obtain a better one that is either a<br />

local or global optimal solution to the original problem. The local optimization<br />

is problem dependent. There are no common techniques <strong>for</strong> per<strong>for</strong>ming<br />

local optimization. <strong>An</strong> optimization step that is good <strong>for</strong> one problem might<br />

not be good <strong>for</strong> another.<br />

3 Problem Formulation<br />

In this section we give some basic algorithmic background in<strong>for</strong>mation as<br />

well as a <strong>for</strong>mal description of the <strong>DNA</strong> sequencing problem.


3 PROBLEM FORMULATION 10<br />

3.1 Describing the Problem<br />

The <strong>DNA</strong> sequencing problem is the problem of determining a <strong>DNA</strong> strand<br />

based on a given spectrum. The problem can be modeled as follows. Let<br />

Σ = {A,C,T,G} be an alphabet. Here we consider a spectrum as a set of<br />

strings of length l over Σ. A spectrum consists of fragments of equal sizes. A<br />

spectrum is said to be ideal if the following condition is true <strong>for</strong> all but one<br />

fragment in the spectrum: the suffix of length l−1 in a fragment is a prefix of<br />

exactly one other fragment in the spectrum. The <strong>DNA</strong> sequencing problem<br />

can then be stated as the problem of constructing a string over Σ from a given<br />

spectrum (not necessarily an ideal spectrum), so that the resulting string is<br />

the shortest string that contains as many of the fragments in the spectrum<br />

as possible.<br />

3.2 Error Model<br />

Many variations of the sequencing problem exist depending on the model of<br />

error used and on other factors in the experiment. Errors occurring during<br />

the hybridization experiment play an important factor in reconstructing the<br />

original sequence. <strong>Algorithm</strong>s solving the <strong>DNA</strong> <strong>Sequencing</strong> problem usually<br />

consider one of two cases. The first case is to assume that the input spectrum<br />

is ideal, meaning that it has no errors. The other case is to deal with errors<br />

in the spectrum.<br />

In general the input spectrum is not an ideal one. The errors appearing in<br />

a spectrum are usually due to errors in the hybridization experiment. Errors<br />

can be classified as positive or negative. The spectrum has positive errors<br />

when it contains fragments that are not part of the original sequence. It<br />

has negative errors when it fails to contain some oligonucleotides. Certain<br />

errors are random, meaning that they may disappear when the experiment is<br />

repeated. However, many hybridization errors are systematic, meaning that<br />

they are likely to repeat each time the experiment is run [25][26].<br />

If there are no errors, the problem of <strong>DNA</strong> sequencing is similar to the<br />

Shortest Superstring problem [23], which is defined as the problem of recon-


3 PROBLEM FORMULATION 11<br />

structing a string given a collection of overlapped substrings. The Shortest<br />

Superstring problem seems to be much easier than the original problem of<br />

<strong>DNA</strong> <strong>Sequencing</strong> and there exist efficient algorithms <strong>for</strong> this problem [24].<br />

There also exists an approximation algorithm with an approximation factor<br />

of three, i.e., the superstring it produces is at most three times as long as the<br />

optimal shortest superstring [8]. To help illustrate the idea of <strong>DNA</strong> <strong>Sequencing</strong><br />

when the input spectra contain no errors, let us consider the following<br />

example.<br />

Example 2. Let the original sequence to be found be ACAGTGACTG.<br />

Let the fragment length be 5 i.e., l=5. Assume that the hybridization experiment<br />

has 0% positive error and 0% negative error. Then, the output from<br />

the experiment would be the set S, where S={ACAGT, CAGTG, AGTGA,<br />

GTGAC, TGACT, GACTG}. The cardinality of S is 6. In the case of no<br />

errors in the spectrum, each fragment intersects with another in exactly l −1<br />

positions. Thus, the total length can be calculated as n + l − 1. Hence, in<br />

this example, the optimal length will be 10. The overlap occurs in exactly<br />

4 positions. The final sequence can be determined as shown in Figure 3 as<br />

ACAGTGACTG.<br />

ACAGT.....<br />

.CAGTG....<br />

..AGTGA...<br />

...GTGAC..<br />

....TGACT.<br />

.....GACTG<br />

----------<br />

ACAGTGACTG<br />

Figure 3: Reconstructing the sequence in case of no errors<br />

The existence of errors in the input spectrum makes the problem of reconstructing<br />

the original sequence an NP-hard problem [4]. Missing fragments<br />

from the experiment turn the problem into the problem of finding the most


3 PROBLEM FORMULATION 12<br />

likely sequence [4]. The most likely sequence is the shortest one containing<br />

almost all the fragments as a substring. Some fragments might be excluded<br />

from the final result. Those excluded fragments are the ones that represent<br />

the positive errors in the experiment. Also, fragments might not be<br />

completely overlapped. Under normal situations, two fragments of length l<br />

intersect in l − 1 positions. However, because of negative errors, the longest<br />

overlap might not be of length l −1. <strong>An</strong>other source of difficulty exists when<br />

the spectrum contains repeated fragments. Most existing algorithms that<br />

allow <strong>for</strong> errors in the input spectrum put restrictions on the error model<br />

[5][11][12]. There are few algorithms that have no restriction on the input<br />

error model. Two such algorithms are in [1] and [7]. Our algorithm also puts<br />

no restrictions on the error model. We require only an upper bound <strong>for</strong> both<br />

the negative and positive errors in the spectrum. A comparison between our<br />

algorithm and those two algorithms is presented in Section 6.<br />

Example 3. Using the same <strong>DNA</strong> sequence as in Example 2, let us now assume<br />

there are errors in the input spectrum. Assume that fragment AGTGA<br />

is a negative error fragment, i.e., it is missing from the input sequence S. Instead,<br />

fragment AGTCA appears in the set S as a positive error. That is, it<br />

is not part of the final optimal sequence. Then S={ACAGT, CAGTG, GT-<br />

GAC, TGACT, GACTG, AGTCA}. The optimal length is calculated using<br />

the same <strong>for</strong>mula as the ideal spectrum except that negative errors are added<br />

to the <strong>for</strong>mula and positive errors are subtracted from it. Then, the optimal<br />

length would be n + l − 1 + 1 − 1, or 10. The algorithm solving the <strong>DNA</strong><br />

sequencing problem with error should detect the positive error fragment and<br />

try not to include it in the final solution. One possible solution is shown in<br />

Figure 4.<br />

3.3 Input and Output<br />

<strong>Algorithm</strong>s solving the <strong>DNA</strong> <strong>Sequencing</strong> problem take as input the spectrum<br />

of all fragments. The output is a <strong>DNA</strong> sequence that is the most likely one<br />

that includes all fragments. In the case of an ideal spectrum, the result


3 PROBLEM FORMULATION 13<br />

ACAGT.....<br />

.CAGTG....<br />

...GTGAC..<br />

....TGACT.<br />

.....GACTG<br />

----------<br />

ACCAGTACTG<br />

Figure 4: Reconstructing the sequence in case of errors in the spectrum<br />

sequence would be of length n + l − 1, where n is the cardinality of the<br />

spectrum and l is the length of the fragments. However, because of errors, this<br />

may not be always the case. The algorithm presented here deals with errors,<br />

so the output sequence would not necessarily be of length n+l −1. Negative<br />

errors may cause the output sequence to be shorter in length. Positive errors<br />

may cause it to be longer. In some other cases, those types of errors would<br />

mistakenly cause the algorithm to converge to a sequence that does not<br />

necessarily represent the optimal solution.<br />

3.4 Other Related In<strong>for</strong>mation<br />

Other important topics to help the reader build some basic background in<strong>for</strong>mation<br />

in the computational biology field in general and the <strong>DNA</strong> sequencing<br />

in particular, are listed in the next few sub-sections.<br />

3.4.1 The Hamiltonian and Eulerian Paths<br />

The Hamiltonian and Eulerian Path approach are widely used methods to<br />

solve the <strong>DNA</strong> sequencing problem when the input spectrum has no errors.<br />

A Hamiltonian path in a graph is defined as a path that visits all vertices in<br />

the graph. To illustrate the similarity between finding a Hamiltonian path in<br />

a graph and finding the <strong>DNA</strong> sequence containing all fragments in S when<br />

there are no errors in the spectrum, let us define a graph G = (V,E), where<br />

V is the vertex set and E is the edge set, representing a spectrum S as


3 PROBLEM FORMULATION 14<br />

follows. The vertices are elements of the spectrum S. <strong>An</strong> edge between two<br />

vertices exists if and only if the corresponding fragments has an overlap of<br />

length l − 1. Since the problem of finding a Hamiltonian path is known to<br />

be NP-hard, it is unlikely to admit polynomial time algorithms. Researchers<br />

tried to trans<strong>for</strong>m the problem into another problem that can be solved in<br />

polynomial time [6]. Pavel Pevzner proposed a trans<strong>for</strong>mation of the graph<br />

into a new graph where the problem is equivalent to finding an Eulerian path<br />

[23].<br />

<strong>An</strong> Eulerian path in a graph G is a path that visits all edges in G. The<br />

idea is to try to reduce the fragment assembly problem to a variation of<br />

the classical Eulerian path problem. <strong>An</strong> Eulerian path is different from a<br />

Hamiltonian path, as there exist polynomial time algorithms <strong>for</strong> finding an<br />

Eulerian path in a graph if it exists. It proved to be very effective when<br />

assembling fragments that contain no errors [24]. In this paper, we used a<br />

similar approach in reconstructing newer fragments with longer length as will<br />

be shown in Section 4.<br />

3.4.2 Sequence Alignment<br />

Once results are obtained from the <strong>DNA</strong> <strong>Sequencing</strong> algorithm, a reliable<br />

method <strong>for</strong> determining the quality of the solution obtained is needed. One<br />

way of doing this is by aligning the result sequence with the optimal sequence,<br />

i.e., sequence alignment. The idea of sequence alignment is very simple. First,<br />

it compares each gene in the original sequence with the corresponding gene in<br />

the optimal sequence where the spectrum was obtained from. This seems to<br />

be an easy task but the challenge is to do it efficiently and quickly. Aligning<br />

sequences of different lengths is an issue in the sequence alginment process.<br />

Second, a scheme <strong>for</strong> determining the quality of the alignment is needed.<br />

This is done by using a scoring system. The scoring system assigns a bonus<br />

or a match value if there is a match between a character in a sequence and the<br />

corresponding character in the other sequence. A penalty or mismatch value<br />

is added if there is a mismatch between a character and its corresponding<br />

character. If the corresponding character is a gap, then the gap cost is added.


3 PROBLEM FORMULATION 15<br />

Gap insertions occur when inserting a space in the original sequence causes<br />

the cumulative score to increase. Gap extensions occur when one sequence<br />

is shorter than the other sequence. The shorter sequence is extended with<br />

gaps. Extension will not occur if it reduces the score of the alignment. As a<br />

result of the scoring system, each sequence will get a cumulative score which<br />

decreases in poorly matched regions and increases in the highly matched<br />

regions.<br />

One algorithm that is worth mentioning in the area of sequence alignment<br />

is the Smith-Waterman algorithm [27]. The Smith-Waterman alignment<br />

algorithm uses dynamic programming techniques. Dynamic programming<br />

techniques view the problem as a set of sub-problems. It then solves subproblems<br />

and uses the result to solve larger sub-problem till the solution to<br />

the original problem is obtained. Figure 5 shows the recurrence <strong>for</strong>mulation<br />

of dynamic programming <strong>for</strong> sequence alignment.<br />

⎧<br />

⎪⎨<br />

F(i − 1,j − 1) + s(x i ,y j ),<br />

F(i,j) = max F(i − 1,j) + d,<br />

⎪⎩<br />

F(i,j − 1) + d.<br />

Figure 5: Recurrence Formula <strong>for</strong> Sequence Aignment<br />

This recurrence represents the two sequences to be aligned. F(i,j) is<br />

defined as the cost of aligning the first i characters from one sequence with<br />

the first j characters from the other sequence. The initial value is trivial and<br />

can be determined quickly. The recurrence equation is applied repeatedly<br />

to fill the matrix of F(i,j). F(i,j) is the max three values that are already<br />

computed be<strong>for</strong>e, F(i − 1,j − 1), F(i − 1,j), F(i,j − 1). S(x i ,y j ) is the<br />

score <strong>for</strong> aligning gene x i with gene y j , d is the penalty <strong>for</strong> gap insertion or<br />

extension. Once the matrix is built, the result sequence can be reconstructed<br />

from it in a reverse order using additional data structure to store the path<br />

that was used when building the matrix.<br />

The main characteristics of the Smith-Waterman algorithm include: the<br />

result sequence can start and end anywhere in the original sequence. Mean-


3 PROBLEM FORMULATION 16<br />

ing, the new aligned sequence can start or end with either the start or end<br />

character from the original sequence or a gap, i.e., it can begin and end internally.<br />

This feature is important especially when the start and end point<br />

of the sequence are not known. The algorithm produces an optimal local<br />

alignment with the highest score. The next example will clarify how the<br />

Smith-Waterman sequence alignment algorithm works.<br />

Example 4. Consider the following two sequences, s and t, where s=TTCC<br />

and t= AATT. The dynamic programming matrix representing alignment <strong>for</strong><br />

those two sequences is shown in Table 1. The matrix was obtained using the<br />

recurrence <strong>for</strong>mula given in Figure 5, where s(x i ,y j ) is 0 if x i matches y j ,<br />

and 1 otherwise. The gap insertion or extension, d, is 1. In this case, the<br />

matrix represents the penalty <strong>for</strong> aligning those two sequences. To obtain<br />

the aligned sequence, one should start from the bottom right corner of the<br />

matrix and follow the path till it reaches the top left corner. Then, from<br />

Table 1, one possible sequence alignment <strong>for</strong> those two sequences could be<br />

with four replaces. <strong>An</strong>other possible alignment is to use two inserts and two<br />

deletes.<br />

Table 1: The Dynamic Programming Matrix representation <strong>for</strong> Example 4<br />

A A T T<br />

0 1 2 3 4<br />

T 1 1 2 2 3<br />

T 2 2 2 2 2<br />

C 3 3 3 3 3<br />

C 4 4 4 4 4<br />

s=TTCC<br />

t=AATT<br />

... with four proper replaces<br />

s=--TTCC<br />

t=AATT--<br />

... with two inserts and two deletes


4 ALGORITHM 17<br />

4 <strong>Algorithm</strong><br />

In this section we describe a genetic algorithm <strong>for</strong> solving the <strong>DNA</strong> sequencing<br />

problem when the input may have both positive and negative errors. We<br />

do not require that the starting fragment of the sequence be known as it is<br />

done in [6]. We use a steady–state genetic algorithm, together with a local<br />

optimization procedure, to help improve the per<strong>for</strong>mance of the algorithm.<br />

Additionally, we have a preprocessing step that improves the algorithm even<br />

further. The overall algorithm is given in Figure 6. In the following subsections<br />

we give more details of the algorithm.<br />

Sequence(S) // S is a spectrum<br />

preprocess(S)<br />

generate a random initial population P<br />

<strong>for</strong> each a ∈ P<br />

LocalOptimize(a)<br />

end<strong>for</strong><br />

repeat<br />

Select two parents p 1 and p 2<br />

u ←− crossover(p 1 , p 2 )<br />

LocalOptimize(u)<br />

mutate(u)<br />

replace(u, p 1 , p 2 , P)<br />

until (there is no improvement)<br />

return the best member of P<br />

Align output sequence;<br />

Figure 6: The Enhanced <strong>Genetic</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong> sequencing<br />

4.1 Preprocessing<br />

In general, the fewer fragments and longer fragments there are in the spectrum,<br />

the easier the problem is. The idea of preprocessing is to merge certain<br />

fragments together, thereby creating a new spectrum that has fewer<br />

and longer fragments. In this step we create long chains of fragments of the


4 ALGORITHM 18<br />

<strong>for</strong>m F 1 ...F k , where F i ’s are fragments, and the last l − 1 elements of F i<br />

match the first l − 1 elements of F i+1 . Our objective is to make k as large as<br />

possible. Each such chain of fragments is merged into one fragment in the<br />

new spectrum. The algorithm then works with this spectrum which has variable<br />

length fragments and a smaller number of fragments than the original<br />

spectrum.<br />

The preprocessing algorithm creates a chain by selecting an unused fragment<br />

in the spectrum and adding it to the chain. The fragment is then<br />

marked used. The algorithm extends the chain by selecting an unused fragment<br />

that has an overlap of l − 1 with the last fragment in the chain. If<br />

such a fragment exists, it is added to the chain and marked used, and the<br />

process is repeated. If there is no such fragment the chain is terminated. The<br />

algorithm then starts a new chain. The algorithm terminates when all fragments<br />

in the original spectrum have been used. The algorithm then merges<br />

the fragments in each chain to create a fragment <strong>for</strong> the new spectrum. The<br />

algorithm is basically trying to per<strong>for</strong>m a modified topological sort on the<br />

graph induced by the input spectrum. In this graph, the fragments are the<br />

vertices and an edge exists between two fragments if and only if they intersect<br />

in l − 1 positions. This algorithm can be efficiently implemented using<br />

dynamic programming technique. The algorithm <strong>for</strong> the preprocessing step<br />

is shown in Figure 7.<br />

Example 5. Suppose we have a <strong>DNA</strong> sequence CTAGACGTTC of length<br />

10. <strong>An</strong> ideal spectrum would consist of the following six fragments: CTAGA,<br />

TAGAC, AGACG, GACGT, ACGTT, and CGTTC, where we have assumed<br />

that the fragment length is 5. However, because of errors from the hybridization<br />

experiment, an input spectrum in this case may consist of the following<br />

fragments: {CTAGA, TAGAC, AGACG, TATCC, ACGTT, CGTTC},<br />

where the cardinality of the spectrum is 6. This spectrum differs from the<br />

ideal spectrum in that it does not contain the fragment GACGT (a negative<br />

error), instead it contains the fragment TATCC which is not a substring<br />

of the original <strong>DNA</strong> sequence (a positive error). Thus, this spectrum has


4 ALGORITHM 19<br />

Preprocessing(S) // S is a spectrum<br />

loop<br />

Start a new chain C<br />

<strong>for</strong> each fragment f ∈ S<br />

if (f is not used) and (right(C,l − 1)= left(f,l − 1)) then<br />

Append f to the end of C<br />

Mark f as used<br />

endif<br />

end<strong>for</strong><br />

exit when all fragments are used<br />

endloop<br />

Figure 7: The Preprocessing <strong>Algorithm</strong>.<br />

one negative error and one positive error. Using this spectrum as the input,<br />

the preprocessing algorithm would produce the following chains [CTAGA,<br />

TAGAC, AGAC], [TATCC], and [ACGTT, CGTTC], which yield the spectrum<br />

consisting of the three fragments f 1 , f 2 , and f 3 where f 1 =CTAGACG,<br />

f 2 =TATCC, and f 3 =ACGTTC. The genetic algorithm then takes f 1 , f 2 , and<br />

f 3 as input instead of the original spectrum, thus, improving the probability<br />

of finding the optimal answer and improve the running time by working with<br />

fewer fragments.<br />

4.2 Encoding and Initialization<br />

After the preprocessing step is completed, we work only with the new spectrum<br />

from the preprocessing step which has lower cardinality and longer<br />

fragments of variable length. We assume that fragments in this spectrum<br />

are indexed in some order. Each member of the population is a vector of<br />

fragment indexes representing a possible solution sequence to the problem.<br />

Given a vector u[1...m] of fragment indexes, the corresponding sequence is<br />

obtained by merging the fragments F u[1] ,F u[2] ,...,F u[m] in that order, where<br />

F i is the ith fragment in the spectrum. Here, two adjacent fragments are<br />

put together by overlapping them as much as possible. We also maintain the


4 ALGORITHM 20<br />

constraint that each fragment index appears at most once in the encoding<br />

of a sequence. <strong>An</strong> algorithm to repair the chromosome is given in Figure 8.<br />

The vectors in the population can be of variable length. In what follows, we<br />

refer to each member of the population as a vector or a sequence.<br />

<strong>An</strong> initial population of size 120 is generated at random. The size of the<br />

population remains constant throughout the algorithm. All the vectors in the<br />

initial population have the same size, i.e., each vector has the same number<br />

of fragment indexes. However, since the fragments are of variable length after<br />

the preprocessing step, the corresponding sequences have different lengths.<br />

Example 6. Suppose the spectrum obtained after preprocessing contains<br />

the fragments CTAGACG, TATCC, ACGTT, and the fragments are indexed<br />

in that ordered from 1 to 3. Then, the vector (1, 3) yields the sequence<br />

CTAGACGTT, and the vector (2, 1) yields the sequence TATCCTAGACG.<br />

4.3 Fitness<br />

The fitness of each sequence is calculated based on two factors: (i) the amount<br />

of overlap between adjacent fragments in the sequence, and (ii) the length<br />

of the sequence. The idea is that more overlap between adjacent fragments<br />

results in a shorter sequence. Also, if the length of the sequence is equal to<br />

n + l − 1, where n is the cardinality of the spectrum and l is the length of<br />

each fragment, a bonus value is added to the value of the fitness. Note that<br />

n+l−1 is the optimal length <strong>for</strong> a sequence that includes all fragments in the<br />

spectrum. More <strong>for</strong>mally, let u[1...k] be a vector representing a sequence U<br />

in the population. The fitness of U is defined as follows.<br />

f(U) = c ×<br />

k−1 ∑<br />

i=1<br />

where z is the bonus value defined by<br />

z =<br />

|F u[i] ∩ F u[i+1] | + z<br />

{ s, if |U| = n + l − 1,<br />

s/||U| − (n + l − 1)|, otherwise.


4 ALGORITHM 21<br />

For our experiment, c was set to 10 and s was 100. The fitness can be<br />

computed efficiently using dynamic programming technique.<br />

Example 7. Consider the previous example. Then, using the <strong>for</strong>mula <strong>for</strong><br />

calculating the fitness would result in the following:<br />

• Fitness(chromosome 13 )= [10 × 3] + 100/|(9 − 10)|=130<br />

• Fitness(chromosome 21 )= [10 × 1] + 100/|(11 − 10)|=110<br />

4.4 Parent Selection<br />

The parents are selected using the standard proportional selection method<br />

where sequences that have higher fitness have a better chance of being selected.<br />

The standard roulette wheel scheme is used in our algorithm [14].<br />

4.5 Crossover<br />

<strong>An</strong> offspring is constructed by selecting alternately from each parent after k<br />

cutpoints have been determined. Note that the members of the population<br />

are vectors of fragment indexes. This process may create offspring that contain<br />

duplicated fragment indexes. A repair algorithm, shown in Figure 8, is<br />

used to get rid of any repeated fragments and to ensure all fragments are represented<br />

within the offspring. The repair algorithm works by replacing the<br />

repeated fragments with fragments from the spectrum that are not currently<br />

in use by the sequence. We used a 3-point crossover in our testing.<br />

4.6 Mutation<br />

A sequence is mutated with a pre-determined probability. The mutation<br />

is done by randomly selecting two fragments in the sequence and swapping<br />

them. In our experiments, the mutation probability is set at 10%. We also<br />

used other values <strong>for</strong> testing the algorithm.


4 ALGORITHM 22<br />

Repair(C) // C is a chromosome<br />

Create an array a with size |C|<br />

Create two empty queues Q1 and Q2<br />

<strong>for</strong> each fragment f ∈ C<br />

Check the current place p <strong>for</strong> fragment in array a<br />

if a[p] is marked then<br />

add p to Q1[top]<br />

else<br />

mark a[p] as used<br />

endif<br />

end<strong>for</strong><br />

<strong>for</strong> all fragments f ∈ a where f is not used<br />

Q2[top] ←− f<br />

end<strong>for</strong><br />

update C: replace fragments f i ∈ (Q1) with fragments f j ∈ (Q2)<br />

Figure 8: The <strong>Algorithm</strong> to repair the Chromosome<br />

4.7 Local Optimization<br />

The local optimization algorithm has two steps. The first step is to scan the<br />

sequence sequentially and identify a pair of adjacent fragments, say x and<br />

y, with the smallest overlap. Then, we find the fragment, say z, which has<br />

the highest overlap with x. Then, we replace y with z. The vector is then<br />

repaired, if needed, to eliminate duplicated fragments.<br />

The second step is to rearrange the fragments in the sequence in the<br />

hope of improving its fitness value. This is done by first finding the two<br />

pairs of adjacent fragments that have the two smallest overlaps. Let s,t<br />

be the first pair and x,y be the second pair. That is, assume that the<br />

vector u = (a,...,s,t,...,x,y,...z). We then construct a new vector u ′ by<br />

swapping the fragments between t and x with the fragments from y to the<br />

end of u. Thus, u ′ = (a,...,s,y,...,z,t,...,x). If the fitness of u ′ is better<br />

than that of u, we replace u by u ′ . Otherwise, we keep u and discard u ′ .<br />

By using dynamic programming technique, the local optimization algorithm


4 ALGORITHM 23<br />

and the repair algorithm can be efficiently implemented.<br />

LocalOptimize(C) // C is a chromosome<br />

<strong>for</strong> each fragment f ∈ C<br />

find two fragments, x and y, with min intersection between them<br />

scan matrix M<br />

find fragment z with max intersection with x<br />

replace y with z<br />

repair(C)<br />

end<strong>for</strong><br />

<strong>for</strong> each fragment f ∈ C<br />

C 1 ←− C<br />

find two fragments, x and y, with smallest intersection between them<br />

find another two fragments, s and t, with second smallest<br />

intersection between them<br />

C 2 ←− swap([t ... x],[y ... end])<br />

if fitness(C 2 ) > fitness(C 1 )<br />

return (C 2 )<br />

else<br />

return (C 1 )<br />

endif<br />

end<strong>for</strong><br />

Figure 9: The <strong>Algorithm</strong> <strong>for</strong> the Local Optimization Process<br />

4.8 Replacement Scheme<br />

The following replacement scheme is used. If the fitness of the new offspring<br />

is larger than the fitness of the poorer of the two parents, then we replace<br />

that parent with the new offspring. Otherwise, we discard the new offspring.<br />

4.9 Stopping Condition<br />

The algorithm terminates if there is no improvement in the total fitness of the<br />

population in 400 consecutive generations, or if the number of generations<br />

exceeds 50,000. The values mentioned here were obtained and set as values


5 STANDARDIZING THE DATA 24<br />

that return the best compromise between the quality of the solution and the<br />

per<strong>for</strong>mance of the algorithm.<br />

5 Standardizing the Data<br />

Data play an important role in the computational biology field. Most algorithms<br />

in this field deal with large amount of data. So, obtaining the right<br />

set of data is crucial to developing good algorithms. The data have to be<br />

diverse and with different characteristics. In the area of <strong>DNA</strong> sequencing,<br />

we identified the following important characteristics when obtaining the test<br />

data. All data sets should come from different genomes. Each genome should<br />

have its own features. Spectra used in testing should have a variety of errors<br />

percentages both negative and positive, as well as having different fragment<br />

length.<br />

In order to obtain data with the characteristics mentioned above, many<br />

steps have to be taken. The first step is to obtain genomes that have those<br />

characteristics. We selected three different genomes from the GenBank [13].<br />

Table 2 shows the details of the genomes obtained and used in testing our<br />

new algorithm.<br />

Table 2: Genomes obtained from the GenBank<br />

Sequence<br />

Length (BP)<br />

Human immunodeficiency virus 2 (HIV) 10,359<br />

Drosophila melanogaster <strong>DNA</strong> sequence of white locus (Fly) 14,245<br />

Canis familiaris clone RP81-60B6 (Complete Dog genome) 165,116<br />

The second step is to develop a new algorithm that simulates the hybridization<br />

experiment and generates spectra. It generates spectra in two<br />

different methods. The first method is to simulate the Hybridization experiment<br />

by generating a spectrum with a specific upper bound on the error<br />

percentages. The second method is to generate a complete set of data with


5 STANDARDIZING THE DATA 25<br />

different parameters and factors such as: different fragment length, different<br />

error percentage, and different sequence length, . . . etc. Using these variations<br />

of test data in testing our new enhanced genetic algorithm ensured that<br />

it works fine <strong>for</strong> different genomes with different characteristics and with data<br />

that are more practical and more diverse. The spectrum generator algorithm<br />

is briefly described in Figure 10.<br />

Generate(Genome)<br />

Read length of output spectrum<br />

<strong>for</strong> each error combinations e 1 in {0,5,8,10,20}<br />

<strong>for</strong> each fragment length l 2 in {10,20,50}<br />

<strong>for</strong> i=1...10<br />

Generate output file name<br />

Randomly select a start point in the input sequence<br />

Generate all fragments of length l 2 in the spectrum<br />

Introduce e 1 % positive errors<br />

Introduce e 1 % negative errors<br />

end<strong>for</strong><br />

end<strong>for</strong><br />

end<strong>for</strong><br />

Figure 10: The Spectrum Generator <strong>Algorithm</strong>.<br />

The spectrum generator algorithm first determines the length of the sequence<br />

to be generated. Then, using the Mersenne Twister (MT) random<br />

number generator [20], it picks a random starting point and then reads a<br />

sequence with the desired length. It ensures that all generated sequences are<br />

different. The second step is to generate fragments with zero positive and<br />

negative errors. It then generates data with 5%, 8%, 10%, and 20% errors<br />

<strong>for</strong> negative, positive, and both types of errors. Table 3 shows different combinations<br />

of errors generated using the spectrum generator algorithm. The<br />

total number of error combinations generated by the algorithm is 13. The<br />

algorithm also generates data with different fragment length to study the<br />

effect of using longer fragments on the quality of the output solution. The<br />

data algorithm is able to generate fragments with any length and size. We


5 STANDARDIZING THE DATA 26<br />

Table 3: Positive and Negative error combinations used in testing the algorithm<br />

+0% +5% +8% +10% +20%<br />

-0% x x x x x<br />

-5% x x<br />

-8% x x<br />

-10% x x<br />

-20% x x<br />

tested the genetic algorithm with fragments of length 10, 20, and 50 genes.<br />

For each combination mentioned above, 10 different sequences, obtained from<br />

different positions within the parent genome, have been generated.<br />

The technique used in generating fragments with errors divides the process<br />

into two steps. The first step is to produce the positive errors. The<br />

second step is to produce the negative errors. The number of positive and<br />

negative errors fragments are calculated based on the percentage of errors<br />

in the output fragments. For positive errors, a random number that represents<br />

the fragment index in the fragment array is selected. Then, a random<br />

position within that fragment is selected in which the correct gene will be<br />

replaced with another random error gene to introduce the positive error.<br />

The negative error is much simpler because only a random fragment index is<br />

selected and then the fragment is removed from the final spectrum.<br />

<strong>An</strong>other source of data that we used when testing our new enhanced<br />

algorithm is from [7]. This data set has the following characteristics. It<br />

has a fragment length of 10, 20% negative errors, and 20% positive errors.<br />

Usually, in the hybridization experiment, the practical percentages of errors<br />

are in the range from 1% to 3% [23]. Nonetheless, our algorithm per<strong>for</strong>med<br />

very well when tested using this set of data as will be seen in the next section.


6 EXPERIMENTAL RESULTS 27<br />

6 Experimental Results<br />

In this section we first describe the per<strong>for</strong>mance of our algorithm in comparison<br />

with some existing algorithms <strong>for</strong> the <strong>DNA</strong> sequencing problem. We<br />

then show the result of testing our algorithm on an extensive set of data that<br />

we generated as mentioned be<strong>for</strong>e. Our algorithm was implemented in C++<br />

and was run on a PC with Pentium IV 2.4GHz Intel processor with 512MB<br />

of RAM.<br />

Our first set of test data is from [7][18]. We used it to compare our algorithm<br />

to the Tabu Search algorithm in [1] and the Hybrid <strong>Genetic</strong> <strong>Algorithm</strong><br />

in [7]. The data from this set consist of spectra having 100, 200, 300, 400,<br />

and 500 fragments. There are 40 spectra <strong>for</strong> each size, <strong>for</strong> a total of 200<br />

instances. The fragment length in all of these instances is 10. Each instance<br />

has 20% positive errors and 20% negative errors.<br />

(a) (b) (c)<br />

Figure 11: Comparison between our algorithm and others<br />

To determine the quality of our solution, we follow [7] and use the classical<br />

pairwise Smith-Waterman sequence alignment algorithm described previously<br />

to compare the output of the algorithm with the original sequences in<br />

which the spectra were generated from. We use two values from the output<br />

of the Smith-Waterman algorithm: the match percentage and the similarity


6 EXPERIMENTAL RESULTS 28<br />

score. In addition, as in [7], <strong>for</strong> each instance tested we include the number<br />

of times the algorithm finds the optimal answer. This number is called<br />

the optimum number. More <strong>for</strong>mally, the optimum number is the number<br />

of times the algorithm is able to reach the optimal answer within the input<br />

data set. Table 4 and Figure 11 summarize the results of the comparison,<br />

and show that our algorithm per<strong>for</strong>ms significantly better than the other two<br />

algorithms as the sequence length gets longer. More details can be seen in<br />

Figures 11 (a), (b) and (c) 1 .<br />

Even though the running times are available <strong>for</strong> all algorithms, we cannot<br />

compare the running times, since different machines were used. For our<br />

algorithm, the average running time was from 0.6 second <strong>for</strong> the smallest<br />

problem size to 15.1 seconds <strong>for</strong> the largest problem size tested. The Hybrid<br />

GA and Tabu Search algorithms were run on a PC with a Pentium II 300MHz<br />

processor and 256MB of RAM. The average running times range from 13.5<br />

seconds to 437.9 seconds <strong>for</strong> the Hybrid GA and from 14.1 seconds to 471.5<br />

seconds <strong>for</strong> the Tabu Search algorithm. More details can be found in Table 4.<br />

Table 4: Summary of results from our algorithm and others<br />

<strong>Algorithm</strong> 100 200 300 400 500<br />

Enhanced GA Average Similarity Score (pt) 106.7 200.6 291.6 352.2 451.6<br />

(Pentium IV, Average Similarity Score (%) 98.9 97.6 96.7 92.9 92.0<br />

2.4GHz, Running time (sec) 0.6 1.5 3.6 8.6 15.1<br />

512MB RAM) Optimum no. 29 26 22 13 13<br />

HGA Average Similarity Score (pt) 108.4 199.3 274.1 301.7 326.0<br />

(Pentium II, Average Similarity Score (%) 99.7 97.7 94.3 86.9 82.0<br />

300MHz, Running time (sec) 13.5 63.4 154.9 263.4 437.9<br />

256MB RAM) Optimum no. 40 31 20 9 5<br />

Tabu Search Average Similarity Score (pt) 108.4 184.1 196.6 229.5 235.1<br />

(Pentium II, Average Similarity Score (%) 99.7 94.0 81.8 78.1 73.1<br />

300MHz, Running time (sec) 14.1 60.8 177.7 258.3 471.5<br />

256MB RAM) Optimum no. 40 24 11 6 2<br />

1 The algorithms included in the comparison in Figure 11 are the following: our new Enhanced GA,<br />

the Hybrid GA, and the Tabu Search. Figure 11(a) shows the Match Percentage, Figure 11(b) shows the<br />

Optimum Number, and Figure 11(c) shows the Similarity Score.


6 EXPERIMENTAL RESULTS 29<br />

The second set of test data we used was generated by the spectrum generator<br />

algorithm using the three genomes obtained from the GenBank [13]<br />

referenced in Table 2. Table 3 lists all spectrum error combinations, positive<br />

and negative, that were used to test the algorithm. For each of the three<br />

genomes we tested the GA using 10 different sequences of each length in the<br />

set {100, 200, 300, 400, 500, 1000, 2000}. That is, <strong>for</strong> each genome we tested<br />

the algorithm using 70 different sequences. From each of these sequences we<br />

used spectra with fragments of length 10 and 20. For each fragment length,<br />

13 different combinations of positive and negative errors were used, ranging<br />

from 0 to 20% errors as shown in Table 3. Hence, we tested all three genomes<br />

using a total of 3 × 70 × 2 × 13 = 5, 460 spectra.<br />

We have also generated a similar set of spectra with fragments of length<br />

50. The algorithm was tested on all spectra of three different lengths: 10, 20,<br />

and 50. We observe that the longer the fragments are, the better the results<br />

are. In fact, with a fragment length of 50, our algorithm almost always found<br />

the optimal answers, and thus, we do not include the data <strong>for</strong> fragments of<br />

length 50 here. We used different fragment lengths since in practice different<br />

hybridization techniques may require different fragment lengths. Normally,<br />

the hybridization rate is better if the fragment length is longer. However, <strong>for</strong><br />

in situ hybridization, a small fragment length is required [21].<br />

As in the case of the first data set, we use the Smith-Waterman sequence<br />

alignment algorithm to determine the quality of the solutions returned by the<br />

algorithm. We used an implementation of the Smith-Waterman algorithm<br />

provided by Jie Li of Iowa <strong>State</strong> University [19]. Figures 12 and 13 show the<br />

per<strong>for</strong>mance and running time of our algorithm on the second set of data. In<br />

Figure 12, the graph shows the match percentage <strong>for</strong> spectra with fragment<br />

length 10. The x-axis shows the various error combinations in the input spectrum.<br />

The notation -a+b indicates spectra with a% negative error and %b<br />

positive error. Thirteen different error combinations were used to verify the<br />

per<strong>for</strong>mance of the new algorithm. These error combinations were selected<br />

because of many reasons. Previous researches in the same area used similar<br />

values <strong>for</strong> testing. We also wanted to provide error combinations that are


6 EXPERIMENTAL RESULTS 30<br />

uni<strong>for</strong>mly distributed over the range from 0 to 20. We found, by experimental<br />

results, that those values would be adequate to illustrate the strength of<br />

our new algorithm compared to existing other algorithms. Figure 13 shows<br />

the running time <strong>for</strong> spectra with a fragment length of 10.<br />

Figure 12: Plot of Match Percentage against Error Combinations (l=10)<br />

Figure 13: Plot of Running Time against Error Combinations (l=10)


6 EXPERIMENTAL RESULTS 31<br />

Figure 14 is the graph <strong>for</strong> spectra with a fragment length 20. For each<br />

error combination, the match percentage shown is the average over all spectra<br />

generated from all three genomes. In all cases, the match percentage is over<br />

90%. It can be observed that spectra with a longer fragment length appear<br />

to be easier to solve than the ones with a smaller fragment length. The same<br />

machine that we used to test the algorithm on the first data set was used<br />

<strong>for</strong> the second data set. Figure 15, with a fragment size of 20, shows that<br />

spectra with a higher error percentage seem to take longer than spectra with<br />

a smaller error percentage.<br />

Figure 14: Plot of Match Percentage against Error Combinations (l=20)<br />

The next few figures provide more in<strong>for</strong>mation about the per<strong>for</strong>mance<br />

of our new algorithm with respect to different types of input. Figure 16<br />

shows the match percentage against only the negative errors <strong>for</strong> all fragment<br />

lengths. It can be seen that the match percentages are in the range from<br />

96% up to 99%. The match percentage is considered very good given the<br />

cardinality of the input spectrum. The graph shows that in some cases the<br />

match percentage is lower even when the error percentage decreases. This<br />

is because the spectrum generator algorithm is a random algorithm, so a<br />

spectrum with fewer errors might be harder to reconstruct than another<br />

spectrum with more errors. Other types of errors, such as the repeated


6 EXPERIMENTAL RESULTS 32<br />

Figure 15: Plot of Running Time against Error Combinations (l=20)<br />

fragment error, also affect the match percentage.<br />

Figure 16: Plot of Match Percentage against Negative Errors<br />

Figure 17 shows that the running time is directly proportional to negative<br />

errors. The higher the negative error percentage, the longer the time needed


6 EXPERIMENTAL RESULTS 33<br />

to obtain the result. The plot in Figure 17 summarizes results <strong>for</strong> all fragment<br />

lengths.<br />

Figure 17: Plot of Running Time against Negative Errors<br />

The match percentage <strong>for</strong> positive errors, as shown in Figure 18, is between<br />

96% and 98%, which is considered very good given that the cardinality<br />

of the spectrum is approx 500 fragments. The plot represents positive errors<br />

in the spectrum versus the match percentages <strong>for</strong> all fragment lengths, all<br />

negative errors, and all spectra sizes.<br />

In general, the running time increases as the percentage of positive errors<br />

increases. Figure 19 shows that the running time <strong>for</strong> 0% positive errors was<br />

worse than the running time <strong>for</strong> 5% or 8%. The reason is because with short<br />

fragments, there exists another type of error in the spectrum. This error<br />

is known as the repeated fragments error. The repeated fragments error<br />

prohibits the genetic algorithm from being able to obtain optimal answers<br />

even when there are no other types of errors included. <strong>An</strong>other factor that<br />

affects the quality of the solution obtained is the nature of the input sequences<br />

and spectra. Some sequences are harder to reconstruct while others are easier<br />

to reconstruct.<br />

Figure 20 shows how the match percentage decreases as the length of


6 EXPERIMENTAL RESULTS 34<br />

Figure 18: Plot of Match Percentage against Positive Errors<br />

Figure 19: Plot of Running Time against Positive Errors<br />

the chromosome increases. Longer chromosomes have spectra with larger<br />

cardinality. As the cardinality of the input spectrum increases, the chance<br />

of obtaining the optimal answer decreases.<br />

It is clear from Figure 21 that the longer the chromosome length, the


6 EXPERIMENTAL RESULTS 35<br />

Figure 20: Plot of Match Percentage against Optimal Length<br />

longer the time needed to find the result. This is because longer fragments<br />

consist of more fragments and the cardinality of the spectrum is larger <strong>for</strong><br />

longer ones. Thus, larger spectra increase the running time needed to reconstruct<br />

and obtain the result.<br />

Figure 21: Plot of Running Time against Optimal Length


7 CONCLUSION 36<br />

Figure 22 shows that the longer the fragment, the better the match percentage.<br />

The match percentages are improved by increasing the fragment<br />

length because longer fragments have less chance of being repeated. Also,<br />

longer fragments decrease the effect of errors in the spectrum.<br />

Figure 22: Plot of Match Percentage against Fragment Length<br />

Figure 23 shows that the running time of the algorithm is improved by<br />

increasing the fragment length. This is obvious from the fact that increasing<br />

the fragment length decreases the cardinality of the spectrum. It also reduces<br />

the number of fragments each chromosome consists of, which causes the result<br />

to be obtained more quickly.<br />

Experimental results from the two data sets suggest that our algorithm<br />

per<strong>for</strong>ms very well against existing algorithms. It is also very robust against<br />

different combinations of errors.<br />

7 Conclusion<br />

This study has introduced a new enhanced genetic algorithm <strong>for</strong> the <strong>DNA</strong><br />

<strong>Sequencing</strong> problem. The results produced by the algorithm were very good<br />

and in many cases were optimal or close to optimal and were frequently better


REFERENCES 37<br />

Figure 23: Plot of Running Time against Fragment Length<br />

than existing algorithms. Taking into account the difference in speed of the<br />

machines on which the various algorithms were run, our algorithm seems to<br />

be comparable if not faster than existing algorithms. One area we did not<br />

cover in this research is the repeated fragments error. Repeated fragments<br />

can prohibit the algorithm from finding optimal answers even when there are<br />

no other types of errors in the spectrum. This type of error diminishes if the<br />

fragment length increases. We plan to per<strong>for</strong>m further investigation into the<br />

problem of repeated fragments.<br />

References<br />

[1] Blazewicz, J., P. Formanowicz, F. Glover, M. Kasprzak, and J. Weglarz,<br />

“<strong>An</strong> <strong>Improved</strong> Tabu Search <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong> with Errors,”<br />

Proceedings of the III Metaheuristics International Conference<br />

(MIC), 1999, pp. 69–75.<br />

[2] Blazewicz, J., P. Formanowicz, M. Kasprzak, W.T. Markiewicz, and<br />

J.Weglarz, “<strong>DNA</strong> <strong>Sequencing</strong> with Positive and Negative Errors,” Jour-


REFERENCES 38<br />

nal of Computational Biology 6, 1999, pp. 113–123.<br />

[3] Blazewicz, J., P. Formanowicz, M. Kasprzak, W. T. Markiewicz, and J.<br />

Weglarz, “Tabu Search <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong> with False Negatives and<br />

False Positives,” European Journal of Operational Research 125, 2000,<br />

pp. 257–265.<br />

[4] Blazewicz, J. and M. Kasprzak, “Complexity of <strong>DNA</strong> <strong>Sequencing</strong> by Hybridization,”<br />

Theoretical Computer Science, 290, 2003, pp. 1459-1473.<br />

[5] Blazewicz, J., A. Kaczmarek, M. Kasprzak, W. T. Markiewicz and J.<br />

Weglarz, “Sequential and Parallel <strong>Algorithm</strong>s <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong>,”<br />

Computer Applications in the Biosciences 13, 1997, pp. 151–158.<br />

[6] Blazewicz, J., J. Kaczmarek, M. Kasprzak, J. Weglarz and W. T.<br />

Markiewicz, “Sequential <strong>Algorithm</strong>s <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong>,” Computational<br />

Methods in Science and Technology, 1, 1996, pp. 31–42.<br />

[7] Blazewicz, J., M. Kasprzak and W. Kuroczycki, “Hybrid <strong>Genetic</strong> <strong>Algorithm</strong><br />

<strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong> with Errors,” Journal of Heuristics, 8, 2002,<br />

pp. 495–502.<br />

[8] Blum, A., T. Jiang, M. Li, J. Tromp and M. Yannakakis, “Linear Approximation<br />

of Shortest Superstrings,” Journal of the ACM, 41(4), 1994,<br />

pp. 630–647.<br />

[9] Bui, T. and W. Youssef, “<strong>An</strong> Enhanced <strong>Genetic</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong><br />

<strong>Sequencing</strong> by Hybridization with Positive and Negative Errors,” to appear<br />

in <strong>Genetic</strong> and Evolutionary Computation Conference (GECCO),<br />

June 26–30, 2004.<br />

[10] Cummings, M. R. Human Heredity: Principles and Issues, West Publishing<br />

Company, 1991.<br />

[11] Fogel, G. B. and K. Chellapilla, “Simulated <strong>Sequencing</strong> by Hybridization<br />

Using Evolutionary Programming,” Proc. of the IEEE Congress on<br />

Evolutionary Computation, CEC’99, 1999, pp. 445–452.


REFERENCES 39<br />

[12] Fogel, G. B., K. Chellapilla and D. B. Fogel, “Reconstruction of <strong>DNA</strong><br />

Sequence In<strong>for</strong>mation From a Simulated <strong>DNA</strong> Chip Using Evolutionary<br />

Programming,” Lecture Notes in Computer Science, edited by V. W.<br />

Porto, N. Saravanan, D. Waagen and A. E. Eiben, Vol. 1447, 1998, pp.<br />

429–436.<br />

[13] The Gene Bank, http://www.ncbi.nlm.nih.gov/Genbank<br />

[14] Goldberg, D. E., <strong>Genetic</strong> <strong>Algorithm</strong>s in Search, Optimization, and Machine<br />

Learning, Addison-Wesley, 1989.<br />

[15] Holland, J., Adaption in Natural and Artificial Systems, <strong>An</strong>n Arbor:<br />

University of Michigan Press, 1975.<br />

[16] Haan, N. M. and S. J. Godsill, “Sequential Methods For <strong>DNA</strong> <strong>Sequencing</strong>,”<br />

Department of Engineering, University of Cambridge, U.K., 2001<br />

[17] The Human Genome Project of the U.S. Department of Energy,<br />

http://www.ornl.gov/sci/techresources/Human Genome/home.shtml<br />

[18] Kasprzak, M., Personal communications, August 2003.<br />

[19] Li, J. “Implementation of Smith-Water Alignment <strong>Algorithm</strong>,” Iowa<br />

<strong>State</strong> University, Personal Communication, 2003.<br />

[20] Matsumoto, M. and T. Nishimura, “Mersenne Twister: A 623-<br />

Dimensionally Equidistributed Uni<strong>for</strong>m Pseudo-Random Number Generator,”<br />

ACM Transactions on Modeling and Computer Simulation,<br />

8(1), January 1998, pp. 3–30.<br />

[21] Nonradioactive In Situ Hybridization Application Manual, Technical<br />

Manual, Roche Applied Science.<br />

[22] Percus, A. G. and D. C. Torneyy, “Greedy <strong>Algorithm</strong>s <strong>for</strong> Optimized<br />

<strong>DNA</strong> <strong>Sequencing</strong>,” Technical Report, Los Alamos National Laboratory,<br />

Los Alamos, NM 87545.


REFERENCES 40<br />

[23] Pevzner, P. A., Computational Molecular Biology, <strong>An</strong> <strong>Algorithm</strong>ic Approach,<br />

The MIT Press, second printing 2001, Chapter 4, pp. 59-63.<br />

[24] Pevzner, P. A., H. Tang and M. S. Waterman, “<strong>An</strong> Eulerian Path Approach<br />

to <strong>DNA</strong> Fragment Assembly,” Department of Computer Science<br />

and Engineering, University of Cali<strong>for</strong>nia, San Diego, La Jolla, CA;<br />

and Departments of Mathematics and Biological Sciences, University of<br />

Southern Cali<strong>for</strong>nia, Los <strong>An</strong>geles, CA, June 7, 2001.<br />

[25] Phan, V. T. and S. Skiena, “Dealing with Errors in Interactive <strong>Sequencing</strong><br />

by Hybridization,” Ox<strong>for</strong>d University Press, 17(10), 2002, pp. 1-9.<br />

[26] Skiena, S., Personal communications, September 2003.<br />

[27] Waterman, M. S., Introduction to Computational Biology: Maps, Sequences<br />

and Genomes, Chapman & Hall, London, 1995.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!