An Improved Genetic Algorithm for DNA Sequencing - Penn State ...
An Improved Genetic Algorithm for DNA Sequencing - Penn State ... An Improved Genetic Algorithm for DNA Sequencing - Penn State ...
The Pennsylvania State University The Graduate School Capital College An Improved Genetic Algorithm Solving the DNA Sequencing Problem with Errors A Master’s Paper in Computer Science by Waleed Youssef c○2004 Waleed Youssef Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science March 2004
- Page 2 and 3: Abstract Genetic Algorithms have tu
- Page 4 and 5: 4.6 Mutation . . . . . . . . . . .
- Page 6 and 7: List of Figures 1 A Typical Structu
- Page 8 and 9: 1 INTRODUCTION 1 1 Introduction Det
- Page 10 and 11: 2 PRELIMINARIES 3 hold characterist
- Page 12 and 13: 2 PRELIMINARIES 5 2.3 Sequencing by
- Page 14 and 15: 2 PRELIMINARIES 7 on mixed (continu
- Page 16 and 17: 3 PROBLEM FORMULATION 9 then altern
- Page 18 and 19: 3 PROBLEM FORMULATION 11 structing
- Page 20 and 21: 3 PROBLEM FORMULATION 13 ACAGT.....
- Page 22 and 23: 3 PROBLEM FORMULATION 15 Gap insert
- Page 24 and 25: 4 ALGORITHM 17 4 Algorithm In this
- Page 26 and 27: 4 ALGORITHM 19 Preprocessing(S) //
- Page 28 and 29: 4 ALGORITHM 21 For our experiment,
- Page 30 and 31: 4 ALGORITHM 23 and the repair algor
- Page 32 and 33: 5 STANDARDIZING THE DATA 25 differe
- Page 34 and 35: 6 EXPERIMENTAL RESULTS 27 6 Experim
- Page 36 and 37: 6 EXPERIMENTAL RESULTS 29 The secon
- Page 38 and 39: 6 EXPERIMENTAL RESULTS 31 Figure 14
- Page 40 and 41: 6 EXPERIMENTAL RESULTS 33 to obtain
- Page 42 and 43: 6 EXPERIMENTAL RESULTS 35 Figure 20
- Page 44 and 45: REFERENCES 37 Figure 23: Plot of Ru
- Page 46 and 47: REFERENCES 39 [12] Fogel, G. B., K.
The <strong>Penn</strong>sylvania <strong>State</strong> University<br />
The Graduate School<br />
Capital College<br />
<strong>An</strong> <strong>Improved</strong> <strong>Genetic</strong> <strong>Algorithm</strong><br />
Solving the <strong>DNA</strong> <strong>Sequencing</strong><br />
Problem with Errors<br />
A Master’s Paper in<br />
Computer Science<br />
by<br />
Waleed Youssef<br />
c○2004 Waleed Youssef<br />
Submitted in Partial Fulfillment<br />
of the Requirements<br />
<strong>for</strong> the Degree of<br />
Master of Science<br />
March 2004
Abstract<br />
<strong>Genetic</strong> <strong>Algorithm</strong>s have turned out to be very effective in solving the computationally<br />
NP-hard problem of <strong>DNA</strong> <strong>Sequencing</strong>. In general, <strong>Genetic</strong> <strong>Algorithm</strong>s<br />
produce optimal or close to optimal solutions in polynomial time.<br />
In this research, we describe a new genetic algorithm <strong>for</strong> solving the <strong>DNA</strong><br />
sequencing problem. The algorithm allows the input spectrum to contain<br />
both positive and negative errors as could be expected from a hybridization<br />
experiment. The main features of the algorithm described here include a preprocessing<br />
step that reduces the size of the input spectrum using a dynamic<br />
programming technique and an efficient local optimization process. In experimental<br />
tests, the algorithm per<strong>for</strong>med very well against existing algorithms.<br />
It outper<strong>for</strong>med them in terms of match percentages. The running time,<br />
although better than existing algorithms, was not highlighted as an important<br />
factor because of different machines used in testing. The algorithm also<br />
per<strong>for</strong>med very well on large data sets generated from real genomes data.<br />
i
Table of Contents<br />
Abstract<br />
Acknowledgement<br />
List of Figures<br />
List of Tables<br />
i<br />
iv<br />
v<br />
vi<br />
1 Introduction 1<br />
2 Preliminaries 2<br />
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />
2.2 Basic In<strong>for</strong>mation . . . . . . . . . . . . . . . . . . . . . . . . . 4<br />
2.3 <strong>Sequencing</strong> by Hybridization . . . . . . . . . . . . . . . . . . . 5<br />
2.4 Fragments Assembly . . . . . . . . . . . . . . . . . . . . . . . 6<br />
2.5 <strong>Genetic</strong> <strong>Algorithm</strong>s . . . . . . . . . . . . . . . . . . . . . . . . 6<br />
3 Problem Formulation 9<br />
3.1 Describing the Problem . . . . . . . . . . . . . . . . . . . . . . 10<br />
3.2 Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
3.3 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . 12<br />
3.4 Other Related In<strong>for</strong>mation . . . . . . . . . . . . . . . . . . . . 13<br />
3.4.1 The Hamiltonian and Eulerian Paths . . . . . . . . . . 13<br />
3.4.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . 14<br />
4 <strong>Algorithm</strong> 17<br />
4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
4.2 Encoding and Initialization . . . . . . . . . . . . . . . . . . . . 19<br />
4.3 Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
4.4 Parent Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
4.5 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
ii
4.6 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />
4.7 Local Optimization . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
4.8 Replacement Scheme . . . . . . . . . . . . . . . . . . . . . . . 23<br />
4.9 Stopping Condition . . . . . . . . . . . . . . . . . . . . . . . . 23<br />
5 Standardizing the Data 24<br />
6 Experimental Results 27<br />
7 Conclusion 36<br />
iii
Acknowledgements<br />
Dr. T. Bui has been a valued advisor. He provided me with excellent<br />
insights and knowledge. His expertise has made this project a wonderful<br />
experience <strong>for</strong> me. I owe a good deal to him <strong>for</strong> making this project a<br />
success. My special thanks to the committee members: Dr. Q. Ding, Dr.<br />
P. Naumov, and Dr. L. Null <strong>for</strong> their positive comments and feedbacks.<br />
Also, thanks to everyone in the Computer Science department <strong>for</strong> the great<br />
learning experience I’ve ever had.<br />
I would like also to thank Dr. J. Blazewicz and Dr. M. Kasprzak, Institute<br />
of Computing Science, Poznan University of Technology, <strong>for</strong> providing me<br />
with the data used in [7], Dr. S. Skiena, Department of Computer Science,<br />
<strong>State</strong> University of New York, <strong>for</strong> useful discussions, and Jie Li, Iowa <strong>State</strong><br />
University <strong>for</strong> providing me with an implementation of the Smith-Waterman<br />
algorithm.<br />
A version of this paper appears in [9].<br />
iv
List of Figures<br />
1 A Typical Structure of Steady-<strong>State</strong> <strong>Genetic</strong> <strong>Algorithm</strong> . . . . 7<br />
2 A Typical Structure of Hybrid <strong>Genetic</strong> <strong>Algorithm</strong> . . . . . . . 8<br />
3 Reconstructing the sequence in case of no errors . . . . . . . . 11<br />
4 Reconstructing the sequence in case of errors in the spectrum 13<br />
5 Recurrence Formula <strong>for</strong> Sequence Aignment . . . . . . . . . . 15<br />
6 The Enhanced <strong>Genetic</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong> sequencing . . . . 17<br />
7 The Preprocessing <strong>Algorithm</strong>. . . . . . . . . . . . . . . . . . . 19<br />
8 The <strong>Algorithm</strong> to repair the Chromosome . . . . . . . . . . . 22<br />
9 The <strong>Algorithm</strong> <strong>for</strong> the Local Optimization Process . . . . . . . 23<br />
10 The Spectrum Generator <strong>Algorithm</strong>. . . . . . . . . . . . . . . 25<br />
11 Comparison between our algorithm and others . . . . . . . . . 27<br />
12 Plot of Match Percentage against Error Combinations (l=10) . 30<br />
13 Plot of Running Time against Error Combinations (l=10) . . 30<br />
14 Plot of Match Percentage against Error Combinations (l=20) . 31<br />
15 Plot of Running Time against Error Combinations (l=20) . . 32<br />
16 Plot of Match Percentage against Negative Errors . . . . . . . 32<br />
17 Plot of Running Time against Negative Errors . . . . . . . . . 33<br />
18 Plot of Match Percentage against Positive Errors . . . . . . . 34<br />
19 Plot of Running Time against Positive Errors . . . . . . . . . 34<br />
20 Plot of Match Percentage against Optimal Length . . . . . . . 35<br />
21 Plot of Running Time against Optimal Length . . . . . . . . . 35<br />
22 Plot of Match Percentage against Fragment Length . . . . . . 36<br />
23 Plot of Running Time against Fragment Length . . . . . . . . 37<br />
v
List of Tables<br />
1 The Dynamic Programming Matrix representation <strong>for</strong> Example 4 . . . . 16<br />
2 Genomes obtained from the GenBank . . . . . . . . . . . . . . . . . 24<br />
3 Positive and Negative error combinations used in testing the algorithm . 26<br />
4 Summary of results from our algorithm and others . . . . . . . . . . . 28<br />
vi
1 INTRODUCTION 1<br />
1 Introduction<br />
Determining the genome of living organisms has been a major research initiative<br />
world wide in the last few years. One of the principal steps in this<br />
endeavor is the sequencing of <strong>DNA</strong>. In<strong>for</strong>mally, <strong>DNA</strong> sequencing is the process<br />
of determining the correct order of nucleotides in a <strong>DNA</strong> segment. Many<br />
techniques have been developed <strong>for</strong> <strong>DNA</strong> sequencing. <strong>DNA</strong> sequencing experiments<br />
are typically per<strong>for</strong>med in two stages: shotgun sequencing and<br />
walking. In shotgun sequencing, many short, randomly selected fragments of<br />
a <strong>DNA</strong> segment are sequenced. Due to the stochastic nature of this process,<br />
there are parts of the <strong>DNA</strong> segment that are left unsequenced or insufficiently<br />
covered. These parts are then covered by a deterministic finishing process<br />
called walking [22].<br />
The two most popular methods <strong>for</strong> <strong>DNA</strong> sequencing are the Sanger<br />
method and the <strong>Sequencing</strong> by Hybridization (SBH) method [23]. In this<br />
research we consider only the SBH method. SBH follows the methodology<br />
known as “break, read, and assemble”. In this methodology a <strong>DNA</strong> sequence<br />
is partitioned into smaller size fragments. The fragments are then read using<br />
a fluorescent light. The assemble phase tries to retrieve the original sequence<br />
from the shorter length fragments, i.e., to determine the exact sequence of<br />
nucleotides of the <strong>DNA</strong> molecule.<br />
From an algorithmic point of view the <strong>DNA</strong> sequencing problem is the<br />
problem of constructing a chromosome that most likely contains all the <strong>DNA</strong><br />
fragments in a given input set, called the spectrum. The spectrum is usually<br />
obtained through some experiments such as a hybridization experiment. It<br />
should be noted that the fragments in a spectrum have overlaps and all fragments<br />
have the same length. If a spectrum contains all possible fragments of<br />
length l of a <strong>DNA</strong> sequence and there are no errors in the fragments, then<br />
there exist efficient algorithms <strong>for</strong> reconstructing the original <strong>DNA</strong> sequence<br />
from the spectrum. In general, however, there are errors in the spectrum,<br />
e.g., missing fragments or erroneous fragments, making the problem of reconstructing<br />
the original <strong>DNA</strong> sequence an NP-hard problem [4].
2 PRELIMINARIES 2<br />
In this study we present a genetic algorithm <strong>for</strong> the <strong>DNA</strong> sequencing<br />
problem. This algorithm differs from others in that it can efficiently handle<br />
different types of errors in the input. Additionally, the algorithm includes<br />
a preprocessing step that effectively reduces the size of the input, thereby<br />
reducing the running time of the algorithm. It also helps in improving the<br />
chance of getting the optimal answer. The idea of the preprocessing step can<br />
be extended to create a hierarchical structure that enables the algorithm to<br />
deal with much longer sequences. Experimental results show that this new<br />
algorithm outper<strong>for</strong>med other algorithms from [1] and [7]. We also per<strong>for</strong>med<br />
extensive test of the algorithm on data that we generated systematically<br />
from genomes obtained from the GenBank [13]. Data was generated using<br />
another algorithm we developed to simulate the Sequence by Hybridization<br />
experiment. To determine the quality of the results obtained, the Smith-<br />
Waterman algorithm <strong>for</strong> sequence alignment was used [27]. The results of<br />
these experiments show that our algorithm is very robust against a large<br />
range of data generated with different types of errors.<br />
The rest of this paper is organized as follows. Section 2 describes some<br />
common terminology and background in<strong>for</strong>mation. Section 3 defines the<br />
problem <strong>for</strong>mally, gives background in<strong>for</strong>mation about algorithmic methods<br />
and techniques used when developing the algorithm, and defines some techniques<br />
that were used be<strong>for</strong>e to solve the same problem. The algorithm is<br />
described in Section 4. Techniques used to obtain and standardize the data<br />
are described in Section 5. Experimental results comparing the per<strong>for</strong>mance<br />
of our algorithm against others are given in Section 6. Section 6 also includes<br />
results showing the per<strong>for</strong>mance of our algorithm on large data sets that we<br />
generated. Conclusions and future directions are given in Section 7.<br />
2 Preliminaries<br />
<strong>DNA</strong> contains the genetic codes that are passed from generation to generation.<br />
It determines many facets of how organisms develop. Over many years,<br />
biologists have tried to determine and analyze sequences of genomes that
2 PRELIMINARIES 3<br />
hold characteristics of all living things. Many mysteries have been identified<br />
and many secrets have been revealed. One of the most challenging technical<br />
projects in recent days is the Human Genome project. <strong>Sequencing</strong> the<br />
whole human genome will help reveal the estimated 30,000 to 35,000 human<br />
genes within our <strong>DNA</strong> as well as the regions controlling them. The resulting<br />
<strong>DNA</strong> sequence maps will be used by 21st century scientists to explore human<br />
biology and other complex phenomena [17].<br />
In the next few subsections, we will describe some of the terminologies<br />
needed <strong>for</strong> the rest of this paper. We will also give brief preliminary in<strong>for</strong>mation<br />
about <strong>DNA</strong> <strong>Sequencing</strong> in general and the <strong>Sequencing</strong> by Hybridization<br />
experiment in particular. Additional background in<strong>for</strong>mation will also be<br />
discussed.<br />
2.1 Definitions<br />
<strong>DNA</strong> can be defined as a string of symbols drawn from the set Σ={A,C,T,G},<br />
where A represents Adenine, C represents Cytosine, G represents Guanine,<br />
and T represents Thymine. Each symbol is known as a nucleotide. <strong>An</strong><br />
oligonucleotide is a short sequence of nucleotides. It is also known as a<br />
fragment. A sequence or strand is a string of larger number of nucleotides.<br />
Usually, genetic algorithms refer to the strand or sequence as a chromosome.<br />
A hybridization experiment is an experiment that takes a <strong>DNA</strong> strand and<br />
produces a copy of fragments of that strand. These fragments usually have<br />
overlap. The set of all fragments that results from a hybridization experiment<br />
is known as a spectrum. All fragments in a spectrum have the same length.<br />
Example 1.<br />
• A fragment of length 10: ACTGCTGGTT<br />
• A chromosome C: ACTGCTGGTGTCTGTACGAGGTACGTAGCA<br />
• Spectrum S of cardinality 21, obtained from chromosome C:<br />
{ACTGCTGGTG, CTGCTGGTGT, TGCTGGTGTC, GCTGGTGTCT,
2 PRELIMINARIES 4<br />
CTGGTGTCTG, TGGTGTCTGT, GGTGTCTGTA, GTGTCTGTAC,<br />
TGTCTGTACG, GTCTGTACGA, TCTGTACGAG, CTGTACGAGG,<br />
TGTACGAGGT, GTACGAGGTA, TACGAGGTAC, ACGAGGTACG,<br />
CGAGGTACGT, GAGGTACGTA, AGGTACGTAG, GGTACGTAGC,<br />
GTACGTAGCA}<br />
2.2 Basic In<strong>for</strong>mation<br />
<strong>DNA</strong> (deoxyribonucleic acid) consists of two strands, each of which contains<br />
nucleotides obtained from the set Σ={A, C, G,T} (Technically, there are<br />
other components in a <strong>DNA</strong> strand such as phosphates). The nucleotides in<br />
each strand are connected together in series. The two strands of the <strong>DNA</strong><br />
are twisted together into the famous double helix structure. Furthermore,<br />
each nucleotide in a strand is connected to a complementary nucleotide in<br />
the other strand, where A is paired with T and C is paired with G. Thus,<br />
each strand in a <strong>DNA</strong> completely determines the other. <strong>Sequencing</strong> techniques<br />
make use of this important characteristic to determine the original<br />
sequence of oligonucleotides by adding labeled complementary solutions to<br />
the experiment to determine the fragment that paired up with the labeled<br />
one.<br />
Three main areas of interest can be distinguished in the field of <strong>DNA</strong>:<br />
<strong>DNA</strong> sequencing, <strong>DNA</strong> assembling, and <strong>DNA</strong> mapping. <strong>DNA</strong> sequencing is<br />
the process of determining the original sequence of nucleotides from a set of<br />
<strong>DNA</strong> fragments. <strong>DNA</strong> assembling is the process of assembling the sequenced<br />
fragments into longer contigs. Finally, <strong>DNA</strong> mapping deals with the whole<br />
chromosomes and tries to place marked <strong>DNA</strong> fragments (usually genes) on<br />
certain chromosome region [5]. Our research concentrated mainly on <strong>DNA</strong><br />
sequencing. Also, we were able to extend the algorithm presented here to<br />
per<strong>for</strong>m some <strong>DNA</strong> assembly work as will be shown in the next few sections.
2 PRELIMINARIES 5<br />
2.3 <strong>Sequencing</strong> by Hybridization<br />
Hybridization is a parallel experiment with high throughput. It is superb <strong>for</strong><br />
limited regions of a genome. It is a procedure in which two single-stranded<br />
<strong>DNA</strong> molecules containing complementary sequences of nucleotides bond together.<br />
A probe is a <strong>DNA</strong> molecule that is fluorescently labeled. By testing<br />
whether a probe hybridizes to a given sequence, it is possible to determine<br />
whether the sequence contains a piece that is complementary to the probe.<br />
Techniques have been devised that make it possible to test the hybridization<br />
of a single probe to hundreds of different sequences in a single automated<br />
experiment. In the hybridization experiment, <strong>DNA</strong> arrays - also known as<br />
<strong>DNA</strong> chips - containing thousands of short fragments with length l (<strong>for</strong> example,<br />
l =10) and attached to a surface are applied to a solution containing<br />
the unknown fluorescent-labeled <strong>DNA</strong> fragment. After the reaction, one can<br />
obtain a set of oligonucleotides which are fragments of the examined <strong>DNA</strong><br />
sequence by reading a fluorescent image of the chip. Those fragments or<br />
oligonucleotides constitute the spectrum. The spectrum is read by exposing<br />
it to the light. The original sequence is then reconstructed using combinatorial<br />
algorithms that take as input the generated spectrum and return as<br />
output a sequence that best represents the set of input spectrum.<br />
The reliability of the fragment depends on its binding energy to the target.<br />
It is a function of many factors such as the length of the probe, the<br />
oligonucleotide contents of the probe, and similar sequences in the target.<br />
The length of the probes should not be too small if we want to be able to<br />
reconstruct the sequence efficiently. Short probes will overlap in size shorter<br />
than the probe length which means that if the probe is too short, it is almost<br />
impossible to reconstruct the original sequence. Long probes increase the<br />
hybridization reliability. However, fragments longer than a few hundred nucleotides<br />
cannot be sequenced reliably by current methods, so the fragments<br />
will typically be rather short. In addition, long fragments are unin<strong>for</strong>mative<br />
at the single nucleotide level. The fragment length also influences thermal<br />
stability.<br />
<strong>Sequencing</strong> by hybridization (SBH) is an elegant and efficient sequencing
2 PRELIMINARIES 6<br />
method in the case of error-free data. SBH is fast and convenient. However, it<br />
is limited to extremely short sequences as even sequencing an 8–base sequence<br />
implies an array with 4 8 = 65, 536 elements. Sequences of a hundred bases<br />
would require a currently infeasible array size. In general, sequencing might<br />
be accomplished with the entire set of 4 N probes of length N. However, in the<br />
original experiment, errors exist. Errors are correlated more with differences<br />
in melting temperature than purely random errors.<br />
2.4 Fragments Assembly<br />
Fragments Assembly is the process of constructing longer sequence from the<br />
input spectrum. Shorter sequences can be assembled using the <strong>DNA</strong> sequencing<br />
process. <strong>DNA</strong> sequencing is not very efficient <strong>for</strong> longer sequences.<br />
In general, the process of assembling large sequences can be divided into two<br />
main steps. The first step is to reconstruct longer sequences from input spectra.<br />
Reconstructing longer sequences from short fragments can be efficiently<br />
done using <strong>DNA</strong> sequencing algorithms. The second step is to align sequences<br />
obtained from the <strong>DNA</strong> sequencing process through sequence alignment algorithms.<br />
The process can be looked at as a tree representation where shorter<br />
fragments are the leaves and the goal is to move up the tree to obtain longer<br />
sequences until the root is reached and the optimal sequence is obtained.<br />
2.5 <strong>Genetic</strong> <strong>Algorithm</strong>s<br />
Be<strong>for</strong>e describing the problem more <strong>for</strong>mally, we provide some background<br />
in<strong>for</strong>mation on how genetic algorithms can be used to solve many NP-hard<br />
problems efficiently. <strong>Genetic</strong> algorithms were <strong>for</strong>mally introduced in 1975 by<br />
John Holland at University of Michigan [15]. In 1992 John Koza used genetic<br />
algorithms to evolve programs to per<strong>for</strong>m certain tasks. He called his<br />
method genetic programming. Since then, genetic algorithms have become<br />
more and more popular. The continuing price and per<strong>for</strong>mance improvements<br />
of computational systems have made them attractive <strong>for</strong> many types<br />
of optimization problems. In particular, genetic algorithms work very well
2 PRELIMINARIES 7<br />
on mixed (continuous and discrete) combinatorial problems. They are less<br />
susceptible to getting ‘stuck’ at local optima than gradient search methods,<br />
but they tend to be computationally expensive.<br />
To use a genetic algorithm, the solution to the problem must be represented<br />
as a genome (or chromosome). The genetic algorithm then creates a<br />
population of solutions and applies genetic operators such as mutation and<br />
crossover to evolve the solutions in order to find the best one(s). Many variations<br />
of genetic algorithms exist depending on the structure of the algorithm.<br />
A genetic algorithm that generates only one offspring per generation is called<br />
steady-state genetic algorithm, as opposed to a generational genetic algorithm<br />
that replaces the whole population, or a large subset of it, per generation. A<br />
typical structure of a steady-state genetic algorithm is given in Figure 1. If a<br />
local optimization step is added to the steady-state genetic algorithm, then<br />
the algorithm is said to be hybridized and the scheme is called hybrid genetic<br />
algorithm. A typical structure <strong>for</strong> the hybrid GA is shown in Figure 2. We<br />
provide a hybrid steady-state GA <strong>for</strong> the <strong>DNA</strong> sequencing problem.<br />
<strong>Genetic</strong> <strong>Algorithm</strong><br />
generate a random initial population P<br />
repeat<br />
Select two parents p 1 and p 2 from population<br />
offspring ←− crossover(p 1 , p 2 )<br />
mutate( offspring)<br />
if suited ( offspring) then<br />
replace(P, offspring)<br />
until (there is no improvement)<br />
return the best member of P<br />
Figure 1: A Typical Structure of Steady-<strong>State</strong> <strong>Genetic</strong> <strong>Algorithm</strong><br />
Below, we introduce some common and important terminology of genetic<br />
algorithms as well as some recommendations that should be taken into considerations<br />
when designing a new genetic algorithm. Those recommendations<br />
and tips will improve the evolution of the algorithm and eventually lead to<br />
obtaining better results.
2 PRELIMINARIES 8<br />
<strong>Genetic</strong> <strong>Algorithm</strong><br />
generate a random initial population P<br />
repeat<br />
Select two parents p 1 and p 2 from population<br />
offspring ←− crossover(p 1 , p 2 )<br />
mutate( offspring)<br />
localOptimize( offspring)<br />
if suited ( offspring) then<br />
replace(P, offspring)<br />
until (there is no improvement)<br />
return the best member of P<br />
Figure 2: A Typical Structure of Hybrid <strong>Genetic</strong> <strong>Algorithm</strong><br />
Population: A genetic algorithm starts with a set of initial solutions<br />
(chromosomes) called a population. The size of the population depends on<br />
the problem and on the encoding of the chromosomes. Larger populations<br />
usually need more time to evolve. Smaller populations might not be able to<br />
reach the optimal solution. They are more subject to being stuck at local<br />
optima. A good population size is about 50 to 100. The population size<br />
has a direct affect on per<strong>for</strong>mance; the larger the population, the slower the<br />
algorithm.<br />
Encoding: Encoding depends on the problem and also on the size of<br />
instance of the problem. There are no general possible ways of encoding the<br />
chromosome. <strong>An</strong> encoding that is best <strong>for</strong> one problem might not be suitable<br />
<strong>for</strong> other problems.<br />
Crossover: Crossover is the process of combining two parents to obtain<br />
a new member of the population called an offspring. Crossover depends on<br />
the chosen encoding and on the problem. One possible crossover operator is<br />
a k-point crossover.<br />
In the k-point crossover, two parent chromosomes are cut in exactly k<br />
positions resulting in k + 1 segments <strong>for</strong> each chromosome. Segments are
3 PROBLEM FORMULATION 9<br />
then alternatively concatenated resulting in two new offsprings. The two<br />
new offsprings are compared and the better one is returned.<br />
Mutation: The goal of mutation is to enhance the diversity of the population<br />
by introducing new members that are mutated from their parents.<br />
The mutation rate should be very low. The best rates seem to be about<br />
0.5%-5%. Increasing the mutation rate creates members in the population<br />
with new characteristics.<br />
Selection: Many schemes exist <strong>for</strong> the selection process. One of the<br />
simplest methods is the basic selection method. In this method, a random<br />
number representing a chromosome in the population is selected. Each<br />
member of the population has the same probability of being selected. <strong>An</strong>other<br />
commonly used method is the roulette wheel selection method. In<br />
this method, members with higher fitness from the population have a higher<br />
chance of being selected. There are also more sophisticated methods that<br />
change the parameters of selection during the run of the genetic algorithm.<br />
They behave very similar to simulated annealing. Choosing the right method<br />
to use is problem dependent.<br />
Local Optimization: The goal of the local optimization step is to try<br />
to enhance the current chromosome and obtain a better one that is either a<br />
local or global optimal solution to the original problem. The local optimization<br />
is problem dependent. There are no common techniques <strong>for</strong> per<strong>for</strong>ming<br />
local optimization. <strong>An</strong> optimization step that is good <strong>for</strong> one problem might<br />
not be good <strong>for</strong> another.<br />
3 Problem Formulation<br />
In this section we give some basic algorithmic background in<strong>for</strong>mation as<br />
well as a <strong>for</strong>mal description of the <strong>DNA</strong> sequencing problem.
3 PROBLEM FORMULATION 10<br />
3.1 Describing the Problem<br />
The <strong>DNA</strong> sequencing problem is the problem of determining a <strong>DNA</strong> strand<br />
based on a given spectrum. The problem can be modeled as follows. Let<br />
Σ = {A,C,T,G} be an alphabet. Here we consider a spectrum as a set of<br />
strings of length l over Σ. A spectrum consists of fragments of equal sizes. A<br />
spectrum is said to be ideal if the following condition is true <strong>for</strong> all but one<br />
fragment in the spectrum: the suffix of length l−1 in a fragment is a prefix of<br />
exactly one other fragment in the spectrum. The <strong>DNA</strong> sequencing problem<br />
can then be stated as the problem of constructing a string over Σ from a given<br />
spectrum (not necessarily an ideal spectrum), so that the resulting string is<br />
the shortest string that contains as many of the fragments in the spectrum<br />
as possible.<br />
3.2 Error Model<br />
Many variations of the sequencing problem exist depending on the model of<br />
error used and on other factors in the experiment. Errors occurring during<br />
the hybridization experiment play an important factor in reconstructing the<br />
original sequence. <strong>Algorithm</strong>s solving the <strong>DNA</strong> <strong>Sequencing</strong> problem usually<br />
consider one of two cases. The first case is to assume that the input spectrum<br />
is ideal, meaning that it has no errors. The other case is to deal with errors<br />
in the spectrum.<br />
In general the input spectrum is not an ideal one. The errors appearing in<br />
a spectrum are usually due to errors in the hybridization experiment. Errors<br />
can be classified as positive or negative. The spectrum has positive errors<br />
when it contains fragments that are not part of the original sequence. It<br />
has negative errors when it fails to contain some oligonucleotides. Certain<br />
errors are random, meaning that they may disappear when the experiment is<br />
repeated. However, many hybridization errors are systematic, meaning that<br />
they are likely to repeat each time the experiment is run [25][26].<br />
If there are no errors, the problem of <strong>DNA</strong> sequencing is similar to the<br />
Shortest Superstring problem [23], which is defined as the problem of recon-
3 PROBLEM FORMULATION 11<br />
structing a string given a collection of overlapped substrings. The Shortest<br />
Superstring problem seems to be much easier than the original problem of<br />
<strong>DNA</strong> <strong>Sequencing</strong> and there exist efficient algorithms <strong>for</strong> this problem [24].<br />
There also exists an approximation algorithm with an approximation factor<br />
of three, i.e., the superstring it produces is at most three times as long as the<br />
optimal shortest superstring [8]. To help illustrate the idea of <strong>DNA</strong> <strong>Sequencing</strong><br />
when the input spectra contain no errors, let us consider the following<br />
example.<br />
Example 2. Let the original sequence to be found be ACAGTGACTG.<br />
Let the fragment length be 5 i.e., l=5. Assume that the hybridization experiment<br />
has 0% positive error and 0% negative error. Then, the output from<br />
the experiment would be the set S, where S={ACAGT, CAGTG, AGTGA,<br />
GTGAC, TGACT, GACTG}. The cardinality of S is 6. In the case of no<br />
errors in the spectrum, each fragment intersects with another in exactly l −1<br />
positions. Thus, the total length can be calculated as n + l − 1. Hence, in<br />
this example, the optimal length will be 10. The overlap occurs in exactly<br />
4 positions. The final sequence can be determined as shown in Figure 3 as<br />
ACAGTGACTG.<br />
ACAGT.....<br />
.CAGTG....<br />
..AGTGA...<br />
...GTGAC..<br />
....TGACT.<br />
.....GACTG<br />
----------<br />
ACAGTGACTG<br />
Figure 3: Reconstructing the sequence in case of no errors<br />
The existence of errors in the input spectrum makes the problem of reconstructing<br />
the original sequence an NP-hard problem [4]. Missing fragments<br />
from the experiment turn the problem into the problem of finding the most
3 PROBLEM FORMULATION 12<br />
likely sequence [4]. The most likely sequence is the shortest one containing<br />
almost all the fragments as a substring. Some fragments might be excluded<br />
from the final result. Those excluded fragments are the ones that represent<br />
the positive errors in the experiment. Also, fragments might not be<br />
completely overlapped. Under normal situations, two fragments of length l<br />
intersect in l − 1 positions. However, because of negative errors, the longest<br />
overlap might not be of length l −1. <strong>An</strong>other source of difficulty exists when<br />
the spectrum contains repeated fragments. Most existing algorithms that<br />
allow <strong>for</strong> errors in the input spectrum put restrictions on the error model<br />
[5][11][12]. There are few algorithms that have no restriction on the input<br />
error model. Two such algorithms are in [1] and [7]. Our algorithm also puts<br />
no restrictions on the error model. We require only an upper bound <strong>for</strong> both<br />
the negative and positive errors in the spectrum. A comparison between our<br />
algorithm and those two algorithms is presented in Section 6.<br />
Example 3. Using the same <strong>DNA</strong> sequence as in Example 2, let us now assume<br />
there are errors in the input spectrum. Assume that fragment AGTGA<br />
is a negative error fragment, i.e., it is missing from the input sequence S. Instead,<br />
fragment AGTCA appears in the set S as a positive error. That is, it<br />
is not part of the final optimal sequence. Then S={ACAGT, CAGTG, GT-<br />
GAC, TGACT, GACTG, AGTCA}. The optimal length is calculated using<br />
the same <strong>for</strong>mula as the ideal spectrum except that negative errors are added<br />
to the <strong>for</strong>mula and positive errors are subtracted from it. Then, the optimal<br />
length would be n + l − 1 + 1 − 1, or 10. The algorithm solving the <strong>DNA</strong><br />
sequencing problem with error should detect the positive error fragment and<br />
try not to include it in the final solution. One possible solution is shown in<br />
Figure 4.<br />
3.3 Input and Output<br />
<strong>Algorithm</strong>s solving the <strong>DNA</strong> <strong>Sequencing</strong> problem take as input the spectrum<br />
of all fragments. The output is a <strong>DNA</strong> sequence that is the most likely one<br />
that includes all fragments. In the case of an ideal spectrum, the result
3 PROBLEM FORMULATION 13<br />
ACAGT.....<br />
.CAGTG....<br />
...GTGAC..<br />
....TGACT.<br />
.....GACTG<br />
----------<br />
ACCAGTACTG<br />
Figure 4: Reconstructing the sequence in case of errors in the spectrum<br />
sequence would be of length n + l − 1, where n is the cardinality of the<br />
spectrum and l is the length of the fragments. However, because of errors, this<br />
may not be always the case. The algorithm presented here deals with errors,<br />
so the output sequence would not necessarily be of length n+l −1. Negative<br />
errors may cause the output sequence to be shorter in length. Positive errors<br />
may cause it to be longer. In some other cases, those types of errors would<br />
mistakenly cause the algorithm to converge to a sequence that does not<br />
necessarily represent the optimal solution.<br />
3.4 Other Related In<strong>for</strong>mation<br />
Other important topics to help the reader build some basic background in<strong>for</strong>mation<br />
in the computational biology field in general and the <strong>DNA</strong> sequencing<br />
in particular, are listed in the next few sub-sections.<br />
3.4.1 The Hamiltonian and Eulerian Paths<br />
The Hamiltonian and Eulerian Path approach are widely used methods to<br />
solve the <strong>DNA</strong> sequencing problem when the input spectrum has no errors.<br />
A Hamiltonian path in a graph is defined as a path that visits all vertices in<br />
the graph. To illustrate the similarity between finding a Hamiltonian path in<br />
a graph and finding the <strong>DNA</strong> sequence containing all fragments in S when<br />
there are no errors in the spectrum, let us define a graph G = (V,E), where<br />
V is the vertex set and E is the edge set, representing a spectrum S as
3 PROBLEM FORMULATION 14<br />
follows. The vertices are elements of the spectrum S. <strong>An</strong> edge between two<br />
vertices exists if and only if the corresponding fragments has an overlap of<br />
length l − 1. Since the problem of finding a Hamiltonian path is known to<br />
be NP-hard, it is unlikely to admit polynomial time algorithms. Researchers<br />
tried to trans<strong>for</strong>m the problem into another problem that can be solved in<br />
polynomial time [6]. Pavel Pevzner proposed a trans<strong>for</strong>mation of the graph<br />
into a new graph where the problem is equivalent to finding an Eulerian path<br />
[23].<br />
<strong>An</strong> Eulerian path in a graph G is a path that visits all edges in G. The<br />
idea is to try to reduce the fragment assembly problem to a variation of<br />
the classical Eulerian path problem. <strong>An</strong> Eulerian path is different from a<br />
Hamiltonian path, as there exist polynomial time algorithms <strong>for</strong> finding an<br />
Eulerian path in a graph if it exists. It proved to be very effective when<br />
assembling fragments that contain no errors [24]. In this paper, we used a<br />
similar approach in reconstructing newer fragments with longer length as will<br />
be shown in Section 4.<br />
3.4.2 Sequence Alignment<br />
Once results are obtained from the <strong>DNA</strong> <strong>Sequencing</strong> algorithm, a reliable<br />
method <strong>for</strong> determining the quality of the solution obtained is needed. One<br />
way of doing this is by aligning the result sequence with the optimal sequence,<br />
i.e., sequence alignment. The idea of sequence alignment is very simple. First,<br />
it compares each gene in the original sequence with the corresponding gene in<br />
the optimal sequence where the spectrum was obtained from. This seems to<br />
be an easy task but the challenge is to do it efficiently and quickly. Aligning<br />
sequences of different lengths is an issue in the sequence alginment process.<br />
Second, a scheme <strong>for</strong> determining the quality of the alignment is needed.<br />
This is done by using a scoring system. The scoring system assigns a bonus<br />
or a match value if there is a match between a character in a sequence and the<br />
corresponding character in the other sequence. A penalty or mismatch value<br />
is added if there is a mismatch between a character and its corresponding<br />
character. If the corresponding character is a gap, then the gap cost is added.
3 PROBLEM FORMULATION 15<br />
Gap insertions occur when inserting a space in the original sequence causes<br />
the cumulative score to increase. Gap extensions occur when one sequence<br />
is shorter than the other sequence. The shorter sequence is extended with<br />
gaps. Extension will not occur if it reduces the score of the alignment. As a<br />
result of the scoring system, each sequence will get a cumulative score which<br />
decreases in poorly matched regions and increases in the highly matched<br />
regions.<br />
One algorithm that is worth mentioning in the area of sequence alignment<br />
is the Smith-Waterman algorithm [27]. The Smith-Waterman alignment<br />
algorithm uses dynamic programming techniques. Dynamic programming<br />
techniques view the problem as a set of sub-problems. It then solves subproblems<br />
and uses the result to solve larger sub-problem till the solution to<br />
the original problem is obtained. Figure 5 shows the recurrence <strong>for</strong>mulation<br />
of dynamic programming <strong>for</strong> sequence alignment.<br />
⎧<br />
⎪⎨<br />
F(i − 1,j − 1) + s(x i ,y j ),<br />
F(i,j) = max F(i − 1,j) + d,<br />
⎪⎩<br />
F(i,j − 1) + d.<br />
Figure 5: Recurrence Formula <strong>for</strong> Sequence Aignment<br />
This recurrence represents the two sequences to be aligned. F(i,j) is<br />
defined as the cost of aligning the first i characters from one sequence with<br />
the first j characters from the other sequence. The initial value is trivial and<br />
can be determined quickly. The recurrence equation is applied repeatedly<br />
to fill the matrix of F(i,j). F(i,j) is the max three values that are already<br />
computed be<strong>for</strong>e, F(i − 1,j − 1), F(i − 1,j), F(i,j − 1). S(x i ,y j ) is the<br />
score <strong>for</strong> aligning gene x i with gene y j , d is the penalty <strong>for</strong> gap insertion or<br />
extension. Once the matrix is built, the result sequence can be reconstructed<br />
from it in a reverse order using additional data structure to store the path<br />
that was used when building the matrix.<br />
The main characteristics of the Smith-Waterman algorithm include: the<br />
result sequence can start and end anywhere in the original sequence. Mean-
3 PROBLEM FORMULATION 16<br />
ing, the new aligned sequence can start or end with either the start or end<br />
character from the original sequence or a gap, i.e., it can begin and end internally.<br />
This feature is important especially when the start and end point<br />
of the sequence are not known. The algorithm produces an optimal local<br />
alignment with the highest score. The next example will clarify how the<br />
Smith-Waterman sequence alignment algorithm works.<br />
Example 4. Consider the following two sequences, s and t, where s=TTCC<br />
and t= AATT. The dynamic programming matrix representing alignment <strong>for</strong><br />
those two sequences is shown in Table 1. The matrix was obtained using the<br />
recurrence <strong>for</strong>mula given in Figure 5, where s(x i ,y j ) is 0 if x i matches y j ,<br />
and 1 otherwise. The gap insertion or extension, d, is 1. In this case, the<br />
matrix represents the penalty <strong>for</strong> aligning those two sequences. To obtain<br />
the aligned sequence, one should start from the bottom right corner of the<br />
matrix and follow the path till it reaches the top left corner. Then, from<br />
Table 1, one possible sequence alignment <strong>for</strong> those two sequences could be<br />
with four replaces. <strong>An</strong>other possible alignment is to use two inserts and two<br />
deletes.<br />
Table 1: The Dynamic Programming Matrix representation <strong>for</strong> Example 4<br />
A A T T<br />
0 1 2 3 4<br />
T 1 1 2 2 3<br />
T 2 2 2 2 2<br />
C 3 3 3 3 3<br />
C 4 4 4 4 4<br />
s=TTCC<br />
t=AATT<br />
... with four proper replaces<br />
s=--TTCC<br />
t=AATT--<br />
... with two inserts and two deletes
4 ALGORITHM 17<br />
4 <strong>Algorithm</strong><br />
In this section we describe a genetic algorithm <strong>for</strong> solving the <strong>DNA</strong> sequencing<br />
problem when the input may have both positive and negative errors. We<br />
do not require that the starting fragment of the sequence be known as it is<br />
done in [6]. We use a steady–state genetic algorithm, together with a local<br />
optimization procedure, to help improve the per<strong>for</strong>mance of the algorithm.<br />
Additionally, we have a preprocessing step that improves the algorithm even<br />
further. The overall algorithm is given in Figure 6. In the following subsections<br />
we give more details of the algorithm.<br />
Sequence(S) // S is a spectrum<br />
preprocess(S)<br />
generate a random initial population P<br />
<strong>for</strong> each a ∈ P<br />
LocalOptimize(a)<br />
end<strong>for</strong><br />
repeat<br />
Select two parents p 1 and p 2<br />
u ←− crossover(p 1 , p 2 )<br />
LocalOptimize(u)<br />
mutate(u)<br />
replace(u, p 1 , p 2 , P)<br />
until (there is no improvement)<br />
return the best member of P<br />
Align output sequence;<br />
Figure 6: The Enhanced <strong>Genetic</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong> sequencing<br />
4.1 Preprocessing<br />
In general, the fewer fragments and longer fragments there are in the spectrum,<br />
the easier the problem is. The idea of preprocessing is to merge certain<br />
fragments together, thereby creating a new spectrum that has fewer<br />
and longer fragments. In this step we create long chains of fragments of the
4 ALGORITHM 18<br />
<strong>for</strong>m F 1 ...F k , where F i ’s are fragments, and the last l − 1 elements of F i<br />
match the first l − 1 elements of F i+1 . Our objective is to make k as large as<br />
possible. Each such chain of fragments is merged into one fragment in the<br />
new spectrum. The algorithm then works with this spectrum which has variable<br />
length fragments and a smaller number of fragments than the original<br />
spectrum.<br />
The preprocessing algorithm creates a chain by selecting an unused fragment<br />
in the spectrum and adding it to the chain. The fragment is then<br />
marked used. The algorithm extends the chain by selecting an unused fragment<br />
that has an overlap of l − 1 with the last fragment in the chain. If<br />
such a fragment exists, it is added to the chain and marked used, and the<br />
process is repeated. If there is no such fragment the chain is terminated. The<br />
algorithm then starts a new chain. The algorithm terminates when all fragments<br />
in the original spectrum have been used. The algorithm then merges<br />
the fragments in each chain to create a fragment <strong>for</strong> the new spectrum. The<br />
algorithm is basically trying to per<strong>for</strong>m a modified topological sort on the<br />
graph induced by the input spectrum. In this graph, the fragments are the<br />
vertices and an edge exists between two fragments if and only if they intersect<br />
in l − 1 positions. This algorithm can be efficiently implemented using<br />
dynamic programming technique. The algorithm <strong>for</strong> the preprocessing step<br />
is shown in Figure 7.<br />
Example 5. Suppose we have a <strong>DNA</strong> sequence CTAGACGTTC of length<br />
10. <strong>An</strong> ideal spectrum would consist of the following six fragments: CTAGA,<br />
TAGAC, AGACG, GACGT, ACGTT, and CGTTC, where we have assumed<br />
that the fragment length is 5. However, because of errors from the hybridization<br />
experiment, an input spectrum in this case may consist of the following<br />
fragments: {CTAGA, TAGAC, AGACG, TATCC, ACGTT, CGTTC},<br />
where the cardinality of the spectrum is 6. This spectrum differs from the<br />
ideal spectrum in that it does not contain the fragment GACGT (a negative<br />
error), instead it contains the fragment TATCC which is not a substring<br />
of the original <strong>DNA</strong> sequence (a positive error). Thus, this spectrum has
4 ALGORITHM 19<br />
Preprocessing(S) // S is a spectrum<br />
loop<br />
Start a new chain C<br />
<strong>for</strong> each fragment f ∈ S<br />
if (f is not used) and (right(C,l − 1)= left(f,l − 1)) then<br />
Append f to the end of C<br />
Mark f as used<br />
endif<br />
end<strong>for</strong><br />
exit when all fragments are used<br />
endloop<br />
Figure 7: The Preprocessing <strong>Algorithm</strong>.<br />
one negative error and one positive error. Using this spectrum as the input,<br />
the preprocessing algorithm would produce the following chains [CTAGA,<br />
TAGAC, AGAC], [TATCC], and [ACGTT, CGTTC], which yield the spectrum<br />
consisting of the three fragments f 1 , f 2 , and f 3 where f 1 =CTAGACG,<br />
f 2 =TATCC, and f 3 =ACGTTC. The genetic algorithm then takes f 1 , f 2 , and<br />
f 3 as input instead of the original spectrum, thus, improving the probability<br />
of finding the optimal answer and improve the running time by working with<br />
fewer fragments.<br />
4.2 Encoding and Initialization<br />
After the preprocessing step is completed, we work only with the new spectrum<br />
from the preprocessing step which has lower cardinality and longer<br />
fragments of variable length. We assume that fragments in this spectrum<br />
are indexed in some order. Each member of the population is a vector of<br />
fragment indexes representing a possible solution sequence to the problem.<br />
Given a vector u[1...m] of fragment indexes, the corresponding sequence is<br />
obtained by merging the fragments F u[1] ,F u[2] ,...,F u[m] in that order, where<br />
F i is the ith fragment in the spectrum. Here, two adjacent fragments are<br />
put together by overlapping them as much as possible. We also maintain the
4 ALGORITHM 20<br />
constraint that each fragment index appears at most once in the encoding<br />
of a sequence. <strong>An</strong> algorithm to repair the chromosome is given in Figure 8.<br />
The vectors in the population can be of variable length. In what follows, we<br />
refer to each member of the population as a vector or a sequence.<br />
<strong>An</strong> initial population of size 120 is generated at random. The size of the<br />
population remains constant throughout the algorithm. All the vectors in the<br />
initial population have the same size, i.e., each vector has the same number<br />
of fragment indexes. However, since the fragments are of variable length after<br />
the preprocessing step, the corresponding sequences have different lengths.<br />
Example 6. Suppose the spectrum obtained after preprocessing contains<br />
the fragments CTAGACG, TATCC, ACGTT, and the fragments are indexed<br />
in that ordered from 1 to 3. Then, the vector (1, 3) yields the sequence<br />
CTAGACGTT, and the vector (2, 1) yields the sequence TATCCTAGACG.<br />
4.3 Fitness<br />
The fitness of each sequence is calculated based on two factors: (i) the amount<br />
of overlap between adjacent fragments in the sequence, and (ii) the length<br />
of the sequence. The idea is that more overlap between adjacent fragments<br />
results in a shorter sequence. Also, if the length of the sequence is equal to<br />
n + l − 1, where n is the cardinality of the spectrum and l is the length of<br />
each fragment, a bonus value is added to the value of the fitness. Note that<br />
n+l−1 is the optimal length <strong>for</strong> a sequence that includes all fragments in the<br />
spectrum. More <strong>for</strong>mally, let u[1...k] be a vector representing a sequence U<br />
in the population. The fitness of U is defined as follows.<br />
f(U) = c ×<br />
k−1 ∑<br />
i=1<br />
where z is the bonus value defined by<br />
z =<br />
|F u[i] ∩ F u[i+1] | + z<br />
{ s, if |U| = n + l − 1,<br />
s/||U| − (n + l − 1)|, otherwise.
4 ALGORITHM 21<br />
For our experiment, c was set to 10 and s was 100. The fitness can be<br />
computed efficiently using dynamic programming technique.<br />
Example 7. Consider the previous example. Then, using the <strong>for</strong>mula <strong>for</strong><br />
calculating the fitness would result in the following:<br />
• Fitness(chromosome 13 )= [10 × 3] + 100/|(9 − 10)|=130<br />
• Fitness(chromosome 21 )= [10 × 1] + 100/|(11 − 10)|=110<br />
4.4 Parent Selection<br />
The parents are selected using the standard proportional selection method<br />
where sequences that have higher fitness have a better chance of being selected.<br />
The standard roulette wheel scheme is used in our algorithm [14].<br />
4.5 Crossover<br />
<strong>An</strong> offspring is constructed by selecting alternately from each parent after k<br />
cutpoints have been determined. Note that the members of the population<br />
are vectors of fragment indexes. This process may create offspring that contain<br />
duplicated fragment indexes. A repair algorithm, shown in Figure 8, is<br />
used to get rid of any repeated fragments and to ensure all fragments are represented<br />
within the offspring. The repair algorithm works by replacing the<br />
repeated fragments with fragments from the spectrum that are not currently<br />
in use by the sequence. We used a 3-point crossover in our testing.<br />
4.6 Mutation<br />
A sequence is mutated with a pre-determined probability. The mutation<br />
is done by randomly selecting two fragments in the sequence and swapping<br />
them. In our experiments, the mutation probability is set at 10%. We also<br />
used other values <strong>for</strong> testing the algorithm.
4 ALGORITHM 22<br />
Repair(C) // C is a chromosome<br />
Create an array a with size |C|<br />
Create two empty queues Q1 and Q2<br />
<strong>for</strong> each fragment f ∈ C<br />
Check the current place p <strong>for</strong> fragment in array a<br />
if a[p] is marked then<br />
add p to Q1[top]<br />
else<br />
mark a[p] as used<br />
endif<br />
end<strong>for</strong><br />
<strong>for</strong> all fragments f ∈ a where f is not used<br />
Q2[top] ←− f<br />
end<strong>for</strong><br />
update C: replace fragments f i ∈ (Q1) with fragments f j ∈ (Q2)<br />
Figure 8: The <strong>Algorithm</strong> to repair the Chromosome<br />
4.7 Local Optimization<br />
The local optimization algorithm has two steps. The first step is to scan the<br />
sequence sequentially and identify a pair of adjacent fragments, say x and<br />
y, with the smallest overlap. Then, we find the fragment, say z, which has<br />
the highest overlap with x. Then, we replace y with z. The vector is then<br />
repaired, if needed, to eliminate duplicated fragments.<br />
The second step is to rearrange the fragments in the sequence in the<br />
hope of improving its fitness value. This is done by first finding the two<br />
pairs of adjacent fragments that have the two smallest overlaps. Let s,t<br />
be the first pair and x,y be the second pair. That is, assume that the<br />
vector u = (a,...,s,t,...,x,y,...z). We then construct a new vector u ′ by<br />
swapping the fragments between t and x with the fragments from y to the<br />
end of u. Thus, u ′ = (a,...,s,y,...,z,t,...,x). If the fitness of u ′ is better<br />
than that of u, we replace u by u ′ . Otherwise, we keep u and discard u ′ .<br />
By using dynamic programming technique, the local optimization algorithm
4 ALGORITHM 23<br />
and the repair algorithm can be efficiently implemented.<br />
LocalOptimize(C) // C is a chromosome<br />
<strong>for</strong> each fragment f ∈ C<br />
find two fragments, x and y, with min intersection between them<br />
scan matrix M<br />
find fragment z with max intersection with x<br />
replace y with z<br />
repair(C)<br />
end<strong>for</strong><br />
<strong>for</strong> each fragment f ∈ C<br />
C 1 ←− C<br />
find two fragments, x and y, with smallest intersection between them<br />
find another two fragments, s and t, with second smallest<br />
intersection between them<br />
C 2 ←− swap([t ... x],[y ... end])<br />
if fitness(C 2 ) > fitness(C 1 )<br />
return (C 2 )<br />
else<br />
return (C 1 )<br />
endif<br />
end<strong>for</strong><br />
Figure 9: The <strong>Algorithm</strong> <strong>for</strong> the Local Optimization Process<br />
4.8 Replacement Scheme<br />
The following replacement scheme is used. If the fitness of the new offspring<br />
is larger than the fitness of the poorer of the two parents, then we replace<br />
that parent with the new offspring. Otherwise, we discard the new offspring.<br />
4.9 Stopping Condition<br />
The algorithm terminates if there is no improvement in the total fitness of the<br />
population in 400 consecutive generations, or if the number of generations<br />
exceeds 50,000. The values mentioned here were obtained and set as values
5 STANDARDIZING THE DATA 24<br />
that return the best compromise between the quality of the solution and the<br />
per<strong>for</strong>mance of the algorithm.<br />
5 Standardizing the Data<br />
Data play an important role in the computational biology field. Most algorithms<br />
in this field deal with large amount of data. So, obtaining the right<br />
set of data is crucial to developing good algorithms. The data have to be<br />
diverse and with different characteristics. In the area of <strong>DNA</strong> sequencing,<br />
we identified the following important characteristics when obtaining the test<br />
data. All data sets should come from different genomes. Each genome should<br />
have its own features. Spectra used in testing should have a variety of errors<br />
percentages both negative and positive, as well as having different fragment<br />
length.<br />
In order to obtain data with the characteristics mentioned above, many<br />
steps have to be taken. The first step is to obtain genomes that have those<br />
characteristics. We selected three different genomes from the GenBank [13].<br />
Table 2 shows the details of the genomes obtained and used in testing our<br />
new algorithm.<br />
Table 2: Genomes obtained from the GenBank<br />
Sequence<br />
Length (BP)<br />
Human immunodeficiency virus 2 (HIV) 10,359<br />
Drosophila melanogaster <strong>DNA</strong> sequence of white locus (Fly) 14,245<br />
Canis familiaris clone RP81-60B6 (Complete Dog genome) 165,116<br />
The second step is to develop a new algorithm that simulates the hybridization<br />
experiment and generates spectra. It generates spectra in two<br />
different methods. The first method is to simulate the Hybridization experiment<br />
by generating a spectrum with a specific upper bound on the error<br />
percentages. The second method is to generate a complete set of data with
5 STANDARDIZING THE DATA 25<br />
different parameters and factors such as: different fragment length, different<br />
error percentage, and different sequence length, . . . etc. Using these variations<br />
of test data in testing our new enhanced genetic algorithm ensured that<br />
it works fine <strong>for</strong> different genomes with different characteristics and with data<br />
that are more practical and more diverse. The spectrum generator algorithm<br />
is briefly described in Figure 10.<br />
Generate(Genome)<br />
Read length of output spectrum<br />
<strong>for</strong> each error combinations e 1 in {0,5,8,10,20}<br />
<strong>for</strong> each fragment length l 2 in {10,20,50}<br />
<strong>for</strong> i=1...10<br />
Generate output file name<br />
Randomly select a start point in the input sequence<br />
Generate all fragments of length l 2 in the spectrum<br />
Introduce e 1 % positive errors<br />
Introduce e 1 % negative errors<br />
end<strong>for</strong><br />
end<strong>for</strong><br />
end<strong>for</strong><br />
Figure 10: The Spectrum Generator <strong>Algorithm</strong>.<br />
The spectrum generator algorithm first determines the length of the sequence<br />
to be generated. Then, using the Mersenne Twister (MT) random<br />
number generator [20], it picks a random starting point and then reads a<br />
sequence with the desired length. It ensures that all generated sequences are<br />
different. The second step is to generate fragments with zero positive and<br />
negative errors. It then generates data with 5%, 8%, 10%, and 20% errors<br />
<strong>for</strong> negative, positive, and both types of errors. Table 3 shows different combinations<br />
of errors generated using the spectrum generator algorithm. The<br />
total number of error combinations generated by the algorithm is 13. The<br />
algorithm also generates data with different fragment length to study the<br />
effect of using longer fragments on the quality of the output solution. The<br />
data algorithm is able to generate fragments with any length and size. We
5 STANDARDIZING THE DATA 26<br />
Table 3: Positive and Negative error combinations used in testing the algorithm<br />
+0% +5% +8% +10% +20%<br />
-0% x x x x x<br />
-5% x x<br />
-8% x x<br />
-10% x x<br />
-20% x x<br />
tested the genetic algorithm with fragments of length 10, 20, and 50 genes.<br />
For each combination mentioned above, 10 different sequences, obtained from<br />
different positions within the parent genome, have been generated.<br />
The technique used in generating fragments with errors divides the process<br />
into two steps. The first step is to produce the positive errors. The<br />
second step is to produce the negative errors. The number of positive and<br />
negative errors fragments are calculated based on the percentage of errors<br />
in the output fragments. For positive errors, a random number that represents<br />
the fragment index in the fragment array is selected. Then, a random<br />
position within that fragment is selected in which the correct gene will be<br />
replaced with another random error gene to introduce the positive error.<br />
The negative error is much simpler because only a random fragment index is<br />
selected and then the fragment is removed from the final spectrum.<br />
<strong>An</strong>other source of data that we used when testing our new enhanced<br />
algorithm is from [7]. This data set has the following characteristics. It<br />
has a fragment length of 10, 20% negative errors, and 20% positive errors.<br />
Usually, in the hybridization experiment, the practical percentages of errors<br />
are in the range from 1% to 3% [23]. Nonetheless, our algorithm per<strong>for</strong>med<br />
very well when tested using this set of data as will be seen in the next section.
6 EXPERIMENTAL RESULTS 27<br />
6 Experimental Results<br />
In this section we first describe the per<strong>for</strong>mance of our algorithm in comparison<br />
with some existing algorithms <strong>for</strong> the <strong>DNA</strong> sequencing problem. We<br />
then show the result of testing our algorithm on an extensive set of data that<br />
we generated as mentioned be<strong>for</strong>e. Our algorithm was implemented in C++<br />
and was run on a PC with Pentium IV 2.4GHz Intel processor with 512MB<br />
of RAM.<br />
Our first set of test data is from [7][18]. We used it to compare our algorithm<br />
to the Tabu Search algorithm in [1] and the Hybrid <strong>Genetic</strong> <strong>Algorithm</strong><br />
in [7]. The data from this set consist of spectra having 100, 200, 300, 400,<br />
and 500 fragments. There are 40 spectra <strong>for</strong> each size, <strong>for</strong> a total of 200<br />
instances. The fragment length in all of these instances is 10. Each instance<br />
has 20% positive errors and 20% negative errors.<br />
(a) (b) (c)<br />
Figure 11: Comparison between our algorithm and others<br />
To determine the quality of our solution, we follow [7] and use the classical<br />
pairwise Smith-Waterman sequence alignment algorithm described previously<br />
to compare the output of the algorithm with the original sequences in<br />
which the spectra were generated from. We use two values from the output<br />
of the Smith-Waterman algorithm: the match percentage and the similarity
6 EXPERIMENTAL RESULTS 28<br />
score. In addition, as in [7], <strong>for</strong> each instance tested we include the number<br />
of times the algorithm finds the optimal answer. This number is called<br />
the optimum number. More <strong>for</strong>mally, the optimum number is the number<br />
of times the algorithm is able to reach the optimal answer within the input<br />
data set. Table 4 and Figure 11 summarize the results of the comparison,<br />
and show that our algorithm per<strong>for</strong>ms significantly better than the other two<br />
algorithms as the sequence length gets longer. More details can be seen in<br />
Figures 11 (a), (b) and (c) 1 .<br />
Even though the running times are available <strong>for</strong> all algorithms, we cannot<br />
compare the running times, since different machines were used. For our<br />
algorithm, the average running time was from 0.6 second <strong>for</strong> the smallest<br />
problem size to 15.1 seconds <strong>for</strong> the largest problem size tested. The Hybrid<br />
GA and Tabu Search algorithms were run on a PC with a Pentium II 300MHz<br />
processor and 256MB of RAM. The average running times range from 13.5<br />
seconds to 437.9 seconds <strong>for</strong> the Hybrid GA and from 14.1 seconds to 471.5<br />
seconds <strong>for</strong> the Tabu Search algorithm. More details can be found in Table 4.<br />
Table 4: Summary of results from our algorithm and others<br />
<strong>Algorithm</strong> 100 200 300 400 500<br />
Enhanced GA Average Similarity Score (pt) 106.7 200.6 291.6 352.2 451.6<br />
(Pentium IV, Average Similarity Score (%) 98.9 97.6 96.7 92.9 92.0<br />
2.4GHz, Running time (sec) 0.6 1.5 3.6 8.6 15.1<br />
512MB RAM) Optimum no. 29 26 22 13 13<br />
HGA Average Similarity Score (pt) 108.4 199.3 274.1 301.7 326.0<br />
(Pentium II, Average Similarity Score (%) 99.7 97.7 94.3 86.9 82.0<br />
300MHz, Running time (sec) 13.5 63.4 154.9 263.4 437.9<br />
256MB RAM) Optimum no. 40 31 20 9 5<br />
Tabu Search Average Similarity Score (pt) 108.4 184.1 196.6 229.5 235.1<br />
(Pentium II, Average Similarity Score (%) 99.7 94.0 81.8 78.1 73.1<br />
300MHz, Running time (sec) 14.1 60.8 177.7 258.3 471.5<br />
256MB RAM) Optimum no. 40 24 11 6 2<br />
1 The algorithms included in the comparison in Figure 11 are the following: our new Enhanced GA,<br />
the Hybrid GA, and the Tabu Search. Figure 11(a) shows the Match Percentage, Figure 11(b) shows the<br />
Optimum Number, and Figure 11(c) shows the Similarity Score.
6 EXPERIMENTAL RESULTS 29<br />
The second set of test data we used was generated by the spectrum generator<br />
algorithm using the three genomes obtained from the GenBank [13]<br />
referenced in Table 2. Table 3 lists all spectrum error combinations, positive<br />
and negative, that were used to test the algorithm. For each of the three<br />
genomes we tested the GA using 10 different sequences of each length in the<br />
set {100, 200, 300, 400, 500, 1000, 2000}. That is, <strong>for</strong> each genome we tested<br />
the algorithm using 70 different sequences. From each of these sequences we<br />
used spectra with fragments of length 10 and 20. For each fragment length,<br />
13 different combinations of positive and negative errors were used, ranging<br />
from 0 to 20% errors as shown in Table 3. Hence, we tested all three genomes<br />
using a total of 3 × 70 × 2 × 13 = 5, 460 spectra.<br />
We have also generated a similar set of spectra with fragments of length<br />
50. The algorithm was tested on all spectra of three different lengths: 10, 20,<br />
and 50. We observe that the longer the fragments are, the better the results<br />
are. In fact, with a fragment length of 50, our algorithm almost always found<br />
the optimal answers, and thus, we do not include the data <strong>for</strong> fragments of<br />
length 50 here. We used different fragment lengths since in practice different<br />
hybridization techniques may require different fragment lengths. Normally,<br />
the hybridization rate is better if the fragment length is longer. However, <strong>for</strong><br />
in situ hybridization, a small fragment length is required [21].<br />
As in the case of the first data set, we use the Smith-Waterman sequence<br />
alignment algorithm to determine the quality of the solutions returned by the<br />
algorithm. We used an implementation of the Smith-Waterman algorithm<br />
provided by Jie Li of Iowa <strong>State</strong> University [19]. Figures 12 and 13 show the<br />
per<strong>for</strong>mance and running time of our algorithm on the second set of data. In<br />
Figure 12, the graph shows the match percentage <strong>for</strong> spectra with fragment<br />
length 10. The x-axis shows the various error combinations in the input spectrum.<br />
The notation -a+b indicates spectra with a% negative error and %b<br />
positive error. Thirteen different error combinations were used to verify the<br />
per<strong>for</strong>mance of the new algorithm. These error combinations were selected<br />
because of many reasons. Previous researches in the same area used similar<br />
values <strong>for</strong> testing. We also wanted to provide error combinations that are
6 EXPERIMENTAL RESULTS 30<br />
uni<strong>for</strong>mly distributed over the range from 0 to 20. We found, by experimental<br />
results, that those values would be adequate to illustrate the strength of<br />
our new algorithm compared to existing other algorithms. Figure 13 shows<br />
the running time <strong>for</strong> spectra with a fragment length of 10.<br />
Figure 12: Plot of Match Percentage against Error Combinations (l=10)<br />
Figure 13: Plot of Running Time against Error Combinations (l=10)
6 EXPERIMENTAL RESULTS 31<br />
Figure 14 is the graph <strong>for</strong> spectra with a fragment length 20. For each<br />
error combination, the match percentage shown is the average over all spectra<br />
generated from all three genomes. In all cases, the match percentage is over<br />
90%. It can be observed that spectra with a longer fragment length appear<br />
to be easier to solve than the ones with a smaller fragment length. The same<br />
machine that we used to test the algorithm on the first data set was used<br />
<strong>for</strong> the second data set. Figure 15, with a fragment size of 20, shows that<br />
spectra with a higher error percentage seem to take longer than spectra with<br />
a smaller error percentage.<br />
Figure 14: Plot of Match Percentage against Error Combinations (l=20)<br />
The next few figures provide more in<strong>for</strong>mation about the per<strong>for</strong>mance<br />
of our new algorithm with respect to different types of input. Figure 16<br />
shows the match percentage against only the negative errors <strong>for</strong> all fragment<br />
lengths. It can be seen that the match percentages are in the range from<br />
96% up to 99%. The match percentage is considered very good given the<br />
cardinality of the input spectrum. The graph shows that in some cases the<br />
match percentage is lower even when the error percentage decreases. This<br />
is because the spectrum generator algorithm is a random algorithm, so a<br />
spectrum with fewer errors might be harder to reconstruct than another<br />
spectrum with more errors. Other types of errors, such as the repeated
6 EXPERIMENTAL RESULTS 32<br />
Figure 15: Plot of Running Time against Error Combinations (l=20)<br />
fragment error, also affect the match percentage.<br />
Figure 16: Plot of Match Percentage against Negative Errors<br />
Figure 17 shows that the running time is directly proportional to negative<br />
errors. The higher the negative error percentage, the longer the time needed
6 EXPERIMENTAL RESULTS 33<br />
to obtain the result. The plot in Figure 17 summarizes results <strong>for</strong> all fragment<br />
lengths.<br />
Figure 17: Plot of Running Time against Negative Errors<br />
The match percentage <strong>for</strong> positive errors, as shown in Figure 18, is between<br />
96% and 98%, which is considered very good given that the cardinality<br />
of the spectrum is approx 500 fragments. The plot represents positive errors<br />
in the spectrum versus the match percentages <strong>for</strong> all fragment lengths, all<br />
negative errors, and all spectra sizes.<br />
In general, the running time increases as the percentage of positive errors<br />
increases. Figure 19 shows that the running time <strong>for</strong> 0% positive errors was<br />
worse than the running time <strong>for</strong> 5% or 8%. The reason is because with short<br />
fragments, there exists another type of error in the spectrum. This error<br />
is known as the repeated fragments error. The repeated fragments error<br />
prohibits the genetic algorithm from being able to obtain optimal answers<br />
even when there are no other types of errors included. <strong>An</strong>other factor that<br />
affects the quality of the solution obtained is the nature of the input sequences<br />
and spectra. Some sequences are harder to reconstruct while others are easier<br />
to reconstruct.<br />
Figure 20 shows how the match percentage decreases as the length of
6 EXPERIMENTAL RESULTS 34<br />
Figure 18: Plot of Match Percentage against Positive Errors<br />
Figure 19: Plot of Running Time against Positive Errors<br />
the chromosome increases. Longer chromosomes have spectra with larger<br />
cardinality. As the cardinality of the input spectrum increases, the chance<br />
of obtaining the optimal answer decreases.<br />
It is clear from Figure 21 that the longer the chromosome length, the
6 EXPERIMENTAL RESULTS 35<br />
Figure 20: Plot of Match Percentage against Optimal Length<br />
longer the time needed to find the result. This is because longer fragments<br />
consist of more fragments and the cardinality of the spectrum is larger <strong>for</strong><br />
longer ones. Thus, larger spectra increase the running time needed to reconstruct<br />
and obtain the result.<br />
Figure 21: Plot of Running Time against Optimal Length
7 CONCLUSION 36<br />
Figure 22 shows that the longer the fragment, the better the match percentage.<br />
The match percentages are improved by increasing the fragment<br />
length because longer fragments have less chance of being repeated. Also,<br />
longer fragments decrease the effect of errors in the spectrum.<br />
Figure 22: Plot of Match Percentage against Fragment Length<br />
Figure 23 shows that the running time of the algorithm is improved by<br />
increasing the fragment length. This is obvious from the fact that increasing<br />
the fragment length decreases the cardinality of the spectrum. It also reduces<br />
the number of fragments each chromosome consists of, which causes the result<br />
to be obtained more quickly.<br />
Experimental results from the two data sets suggest that our algorithm<br />
per<strong>for</strong>ms very well against existing algorithms. It is also very robust against<br />
different combinations of errors.<br />
7 Conclusion<br />
This study has introduced a new enhanced genetic algorithm <strong>for</strong> the <strong>DNA</strong><br />
<strong>Sequencing</strong> problem. The results produced by the algorithm were very good<br />
and in many cases were optimal or close to optimal and were frequently better
REFERENCES 37<br />
Figure 23: Plot of Running Time against Fragment Length<br />
than existing algorithms. Taking into account the difference in speed of the<br />
machines on which the various algorithms were run, our algorithm seems to<br />
be comparable if not faster than existing algorithms. One area we did not<br />
cover in this research is the repeated fragments error. Repeated fragments<br />
can prohibit the algorithm from finding optimal answers even when there are<br />
no other types of errors in the spectrum. This type of error diminishes if the<br />
fragment length increases. We plan to per<strong>for</strong>m further investigation into the<br />
problem of repeated fragments.<br />
References<br />
[1] Blazewicz, J., P. Formanowicz, F. Glover, M. Kasprzak, and J. Weglarz,<br />
“<strong>An</strong> <strong>Improved</strong> Tabu Search <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong> with Errors,”<br />
Proceedings of the III Metaheuristics International Conference<br />
(MIC), 1999, pp. 69–75.<br />
[2] Blazewicz, J., P. Formanowicz, M. Kasprzak, W.T. Markiewicz, and<br />
J.Weglarz, “<strong>DNA</strong> <strong>Sequencing</strong> with Positive and Negative Errors,” Jour-
REFERENCES 38<br />
nal of Computational Biology 6, 1999, pp. 113–123.<br />
[3] Blazewicz, J., P. Formanowicz, M. Kasprzak, W. T. Markiewicz, and J.<br />
Weglarz, “Tabu Search <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong> with False Negatives and<br />
False Positives,” European Journal of Operational Research 125, 2000,<br />
pp. 257–265.<br />
[4] Blazewicz, J. and M. Kasprzak, “Complexity of <strong>DNA</strong> <strong>Sequencing</strong> by Hybridization,”<br />
Theoretical Computer Science, 290, 2003, pp. 1459-1473.<br />
[5] Blazewicz, J., A. Kaczmarek, M. Kasprzak, W. T. Markiewicz and J.<br />
Weglarz, “Sequential and Parallel <strong>Algorithm</strong>s <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong>,”<br />
Computer Applications in the Biosciences 13, 1997, pp. 151–158.<br />
[6] Blazewicz, J., J. Kaczmarek, M. Kasprzak, J. Weglarz and W. T.<br />
Markiewicz, “Sequential <strong>Algorithm</strong>s <strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong>,” Computational<br />
Methods in Science and Technology, 1, 1996, pp. 31–42.<br />
[7] Blazewicz, J., M. Kasprzak and W. Kuroczycki, “Hybrid <strong>Genetic</strong> <strong>Algorithm</strong><br />
<strong>for</strong> <strong>DNA</strong> <strong>Sequencing</strong> with Errors,” Journal of Heuristics, 8, 2002,<br />
pp. 495–502.<br />
[8] Blum, A., T. Jiang, M. Li, J. Tromp and M. Yannakakis, “Linear Approximation<br />
of Shortest Superstrings,” Journal of the ACM, 41(4), 1994,<br />
pp. 630–647.<br />
[9] Bui, T. and W. Youssef, “<strong>An</strong> Enhanced <strong>Genetic</strong> <strong>Algorithm</strong> <strong>for</strong> <strong>DNA</strong><br />
<strong>Sequencing</strong> by Hybridization with Positive and Negative Errors,” to appear<br />
in <strong>Genetic</strong> and Evolutionary Computation Conference (GECCO),<br />
June 26–30, 2004.<br />
[10] Cummings, M. R. Human Heredity: Principles and Issues, West Publishing<br />
Company, 1991.<br />
[11] Fogel, G. B. and K. Chellapilla, “Simulated <strong>Sequencing</strong> by Hybridization<br />
Using Evolutionary Programming,” Proc. of the IEEE Congress on<br />
Evolutionary Computation, CEC’99, 1999, pp. 445–452.
REFERENCES 39<br />
[12] Fogel, G. B., K. Chellapilla and D. B. Fogel, “Reconstruction of <strong>DNA</strong><br />
Sequence In<strong>for</strong>mation From a Simulated <strong>DNA</strong> Chip Using Evolutionary<br />
Programming,” Lecture Notes in Computer Science, edited by V. W.<br />
Porto, N. Saravanan, D. Waagen and A. E. Eiben, Vol. 1447, 1998, pp.<br />
429–436.<br />
[13] The Gene Bank, http://www.ncbi.nlm.nih.gov/Genbank<br />
[14] Goldberg, D. E., <strong>Genetic</strong> <strong>Algorithm</strong>s in Search, Optimization, and Machine<br />
Learning, Addison-Wesley, 1989.<br />
[15] Holland, J., Adaption in Natural and Artificial Systems, <strong>An</strong>n Arbor:<br />
University of Michigan Press, 1975.<br />
[16] Haan, N. M. and S. J. Godsill, “Sequential Methods For <strong>DNA</strong> <strong>Sequencing</strong>,”<br />
Department of Engineering, University of Cambridge, U.K., 2001<br />
[17] The Human Genome Project of the U.S. Department of Energy,<br />
http://www.ornl.gov/sci/techresources/Human Genome/home.shtml<br />
[18] Kasprzak, M., Personal communications, August 2003.<br />
[19] Li, J. “Implementation of Smith-Water Alignment <strong>Algorithm</strong>,” Iowa<br />
<strong>State</strong> University, Personal Communication, 2003.<br />
[20] Matsumoto, M. and T. Nishimura, “Mersenne Twister: A 623-<br />
Dimensionally Equidistributed Uni<strong>for</strong>m Pseudo-Random Number Generator,”<br />
ACM Transactions on Modeling and Computer Simulation,<br />
8(1), January 1998, pp. 3–30.<br />
[21] Nonradioactive In Situ Hybridization Application Manual, Technical<br />
Manual, Roche Applied Science.<br />
[22] Percus, A. G. and D. C. Torneyy, “Greedy <strong>Algorithm</strong>s <strong>for</strong> Optimized<br />
<strong>DNA</strong> <strong>Sequencing</strong>,” Technical Report, Los Alamos National Laboratory,<br />
Los Alamos, NM 87545.
REFERENCES 40<br />
[23] Pevzner, P. A., Computational Molecular Biology, <strong>An</strong> <strong>Algorithm</strong>ic Approach,<br />
The MIT Press, second printing 2001, Chapter 4, pp. 59-63.<br />
[24] Pevzner, P. A., H. Tang and M. S. Waterman, “<strong>An</strong> Eulerian Path Approach<br />
to <strong>DNA</strong> Fragment Assembly,” Department of Computer Science<br />
and Engineering, University of Cali<strong>for</strong>nia, San Diego, La Jolla, CA;<br />
and Departments of Mathematics and Biological Sciences, University of<br />
Southern Cali<strong>for</strong>nia, Los <strong>An</strong>geles, CA, June 7, 2001.<br />
[25] Phan, V. T. and S. Skiena, “Dealing with Errors in Interactive <strong>Sequencing</strong><br />
by Hybridization,” Ox<strong>for</strong>d University Press, 17(10), 2002, pp. 1-9.<br />
[26] Skiena, S., Personal communications, September 2003.<br />
[27] Waterman, M. S., Introduction to Computational Biology: Maps, Sequences<br />
and Genomes, Chapman & Hall, London, 1995.