An Improved Genetic Algorithm for DNA Sequencing - Penn State ...

An Improved Genetic Algorithm for DNA Sequencing - Penn State ... An Improved Genetic Algorithm for DNA Sequencing - Penn State ...

from cs.hbg.psu.edu More from this publisher

01.06.2015 Views

The Pennsylvania State University The Graduate School Capital College An Improved Genetic Algorithm Solving the DNA Sequencing Problem with Errors A Master’s Paper in Computer Science by Waleed Youssef c○2004 Waleed Youssef Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science March 2004

The Pennsylvania State University

The Graduate School

Capital College

An Improved Genetic Algorithm

Solving the DNA Sequencing

Problem with Errors

A Master’s Paper in

Computer Science

by

Waleed Youssef

c○2004 Waleed Youssef

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science

March 2004

Abstract

Genetic Algorithms have turned out to be very effective in solving the computationally

NP-hard problem of DNA Sequencing. In general, Genetic Algorithms

produce optimal or close to optimal solutions in polynomial time.

In this research, we describe a new genetic algorithm for solving the DNA

sequencing problem. The algorithm allows the input spectrum to contain

both positive and negative errors as could be expected from a hybridization

experiment. The main features of the algorithm described here include a preprocessing

step that reduces the size of the input spectrum using a dynamic

programming technique and an efficient local optimization process. In experimental

tests, the algorithm performed very well against existing algorithms.

It outperformed them in terms of match percentages. The running time,

although better than existing algorithms, was not highlighted as an important

factor because of different machines used in testing. The algorithm also

performed very well on large data sets generated from real genomes data.

Table of Contents

Abstract

Acknowledgement

List of Figures

List of Tables

i

iv

v

vi

1 Introduction 1

2 Preliminaries 2

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Basic Information . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Sequencing by Hybridization . . . . . . . . . . . . . . . . . . . 5

2.4 Fragments Assembly . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Problem Formulation 9

3.1 Describing the Problem . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Other Related Information . . . . . . . . . . . . . . . . . . . . 13

3.4.1 The Hamiltonian and Eulerian Paths . . . . . . . . . . 13

3.4.2 Sequence Alignment . . . . . . . . . . . . . . . . . . . 14

4 Algorithm 17

4.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Encoding and Initialization . . . . . . . . . . . . . . . . . . . . 19

4.3 Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Parent Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.6 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.7 Local Optimization . . . . . . . . . . . . . . . . . . . . . . . . 22

4.8 Replacement Scheme . . . . . . . . . . . . . . . . . . . . . . . 23

4.9 Stopping Condition . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Standardizing the Data 24

6 Experimental Results 27

7 Conclusion 36

iii

Acknowledgements

Dr. T. Bui has been a valued advisor. He provided me with excellent

insights and knowledge. His expertise has made this project a wonderful

experience for me. I owe a good deal to him for making this project a

success. My special thanks to the committee members: Dr. Q. Ding, Dr.

P. Naumov, and Dr. L. Null for their positive comments and feedbacks.

Also, thanks to everyone in the Computer Science department for the great

learning experience I’ve ever had.

I would like also to thank Dr. J. Blazewicz and Dr. M. Kasprzak, Institute

of Computing Science, Poznan University of Technology, for providing me

with the data used in [7], Dr. S. Skiena, Department of Computer Science,

State University of New York, for useful discussions, and Jie Li, Iowa State

University for providing me with an implementation of the Smith-Waterman

algorithm.

A version of this paper appears in [9].

List of Figures

1 A Typical Structure of Steady-State Genetic Algorithm . . . . 7

2 A Typical Structure of Hybrid Genetic Algorithm . . . . . . . 8

3 Reconstructing the sequence in case of no errors . . . . . . . . 11

4 Reconstructing the sequence in case of errors in the spectrum 13

5 Recurrence Formula for Sequence Aignment . . . . . . . . . . 15

6 The Enhanced Genetic Algorithm for DNA sequencing . . . . 17

7 The Preprocessing Algorithm. . . . . . . . . . . . . . . . . . . 19

8 The Algorithm to repair the Chromosome . . . . . . . . . . . 22

9 The Algorithm for the Local Optimization Process . . . . . . . 23

10 The Spectrum Generator Algorithm. . . . . . . . . . . . . . . 25

11 Comparison between our algorithm and others . . . . . . . . . 27

12 Plot of Match Percentage against Error Combinations (l=10) . 30

13 Plot of Running Time against Error Combinations (l=10) . . 30

14 Plot of Match Percentage against Error Combinations (l=20) . 31

15 Plot of Running Time against Error Combinations (l=20) . . 32

16 Plot of Match Percentage against Negative Errors . . . . . . . 32

17 Plot of Running Time against Negative Errors . . . . . . . . . 33

18 Plot of Match Percentage against Positive Errors . . . . . . . 34

19 Plot of Running Time against Positive Errors . . . . . . . . . 34

20 Plot of Match Percentage against Optimal Length . . . . . . . 35

21 Plot of Running Time against Optimal Length . . . . . . . . . 35

22 Plot of Match Percentage against Fragment Length . . . . . . 36

23 Plot of Running Time against Fragment Length . . . . . . . . 37

List of Tables

1 The Dynamic Programming Matrix representation for Example 4 . . . . 16

2 Genomes obtained from the GenBank . . . . . . . . . . . . . . . . . 24

3 Positive and Negative error combinations used in testing the algorithm . 26

4 Summary of results from our algorithm and others . . . . . . . . . . . 28

1 INTRODUCTION 1

1 Introduction

Determining the genome of living organisms has been a major research initiative

world wide in the last few years. One of the principal steps in this

endeavor is the sequencing of DNA. Informally, DNA sequencing is the process

of determining the correct order of nucleotides in a DNA segment. Many

techniques have been developed for DNA sequencing. DNA sequencing experiments

are typically performed in two stages: shotgun sequencing and

walking. In shotgun sequencing, many short, randomly selected fragments of

a DNA segment are sequenced. Due to the stochastic nature of this process,

there are parts of the DNA segment that are left unsequenced or insufficiently

covered. These parts are then covered by a deterministic finishing process

called walking [22].

The two most popular methods for DNA sequencing are the Sanger

method and the Sequencing by Hybridization (SBH) method [23]. In this

research we consider only the SBH method. SBH follows the methodology

known as “break, read, and assemble”. In this methodology a DNA sequence

is partitioned into smaller size fragments. The fragments are then read using

a fluorescent light. The assemble phase tries to retrieve the original sequence

from the shorter length fragments, i.e., to determine the exact sequence of

nucleotides of the DNA molecule.

From an algorithmic point of view the DNA sequencing problem is the

problem of constructing a chromosome that most likely contains all the DNA

fragments in a given input set, called the spectrum. The spectrum is usually

obtained through some experiments such as a hybridization experiment. It

should be noted that the fragments in a spectrum have overlaps and all fragments

have the same length. If a spectrum contains all possible fragments of

length l of a DNA sequence and there are no errors in the fragments, then

there exist efficient algorithms for reconstructing the original DNA sequence

from the spectrum. In general, however, there are errors in the spectrum,

e.g., missing fragments or erroneous fragments, making the problem of reconstructing

the original DNA sequence an NP-hard problem [4].

2 PRELIMINARIES 2

In this study we present a genetic algorithm for the DNA sequencing

problem. This algorithm differs from others in that it can efficiently handle

different types of errors in the input. Additionally, the algorithm includes

a preprocessing step that effectively reduces the size of the input, thereby

reducing the running time of the algorithm. It also helps in improving the

chance of getting the optimal answer. The idea of the preprocessing step can

be extended to create a hierarchical structure that enables the algorithm to

deal with much longer sequences. Experimental results show that this new

algorithm outperformed other algorithms from [1] and [7]. We also performed

extensive test of the algorithm on data that we generated systematically

from genomes obtained from the GenBank [13]. Data was generated using

another algorithm we developed to simulate the Sequence by Hybridization

experiment. To determine the quality of the results obtained, the Smith-

Waterman algorithm for sequence alignment was used [27]. The results of

these experiments show that our algorithm is very robust against a large

range of data generated with different types of errors.

The rest of this paper is organized as follows. Section 2 describes some

common terminology and background information. Section 3 defines the

problem formally, gives background information about algorithmic methods

and techniques used when developing the algorithm, and defines some techniques

that were used before to solve the same problem. The algorithm is

described in Section 4. Techniques used to obtain and standardize the data

are described in Section 5. Experimental results comparing the performance

of our algorithm against others are given in Section 6. Section 6 also includes

results showing the performance of our algorithm on large data sets that we

generated. Conclusions and future directions are given in Section 7.

2 Preliminaries

DNA contains the genetic codes that are passed from generation to generation.

It determines many facets of how organisms develop. Over many years,

biologists have tried to determine and analyze sequences of genomes that

2 PRELIMINARIES 3

hold characteristics of all living things. Many mysteries have been identified

and many secrets have been revealed. One of the most challenging technical

projects in recent days is the Human Genome project. Sequencing the

whole human genome will help reveal the estimated 30,000 to 35,000 human

genes within our DNA as well as the regions controlling them. The resulting

DNA sequence maps will be used by 21st century scientists to explore human

biology and other complex phenomena [17].

In the next few subsections, we will describe some of the terminologies

needed for the rest of this paper. We will also give brief preliminary information

about DNA Sequencing in general and the Sequencing by Hybridization

experiment in particular. Additional background information will also be

discussed.

2.1 Definitions

DNA can be defined as a string of symbols drawn from the set Σ={A,C,T,G},

where A represents Adenine, C represents Cytosine, G represents Guanine,

and T represents Thymine. Each symbol is known as a nucleotide. An

oligonucleotide is a short sequence of nucleotides. It is also known as a

fragment. A sequence or strand is a string of larger number of nucleotides.

Usually, genetic algorithms refer to the strand or sequence as a chromosome.

A hybridization experiment is an experiment that takes a DNA strand and

produces a copy of fragments of that strand. These fragments usually have

overlap. The set of all fragments that results from a hybridization experiment

is known as a spectrum. All fragments in a spectrum have the same length.

Example 1.

• A fragment of length 10: ACTGCTGGTT

• A chromosome C: ACTGCTGGTGTCTGTACGAGGTACGTAGCA

• Spectrum S of cardinality 21, obtained from chromosome C:

{ACTGCTGGTG, CTGCTGGTGT, TGCTGGTGTC, GCTGGTGTCT,

2 PRELIMINARIES 4

CTGGTGTCTG, TGGTGTCTGT, GGTGTCTGTA, GTGTCTGTAC,

TGTCTGTACG, GTCTGTACGA, TCTGTACGAG, CTGTACGAGG,

TGTACGAGGT, GTACGAGGTA, TACGAGGTAC, ACGAGGTACG,

CGAGGTACGT, GAGGTACGTA, AGGTACGTAG, GGTACGTAGC,

GTACGTAGCA}

2.2 Basic Information

DNA (deoxyribonucleic acid) consists of two strands, each of which contains

nucleotides obtained from the set Σ={A, C, G,T} (Technically, there are

other components in a DNA strand such as phosphates). The nucleotides in

each strand are connected together in series. The two strands of the DNA

are twisted together into the famous double helix structure. Furthermore,

each nucleotide in a strand is connected to a complementary nucleotide in

the other strand, where A is paired with T and C is paired with G. Thus,

each strand in a DNA completely determines the other. Sequencing techniques

make use of this important characteristic to determine the original

sequence of oligonucleotides by adding labeled complementary solutions to

the experiment to determine the fragment that paired up with the labeled

one.

Three main areas of interest can be distinguished in the field of DNA:

DNA sequencing, DNA assembling, and DNA mapping. DNA sequencing is

the process of determining the original sequence of nucleotides from a set of

DNA fragments. DNA assembling is the process of assembling the sequenced

fragments into longer contigs. Finally, DNA mapping deals with the whole

chromosomes and tries to place marked DNA fragments (usually genes) on

certain chromosome region [5]. Our research concentrated mainly on DNA

sequencing. Also, we were able to extend the algorithm presented here to

perform some DNA assembly work as will be shown in the next few sections.

2 PRELIMINARIES 5

2.3 Sequencing by Hybridization

Hybridization is a parallel experiment with high throughput. It is superb for

limited regions of a genome. It is a procedure in which two single-stranded

DNA molecules containing complementary sequences of nucleotides bond together.

A probe is a DNA molecule that is fluorescently labeled. By testing

whether a probe hybridizes to a given sequence, it is possible to determine

whether the sequence contains a piece that is complementary to the probe.

Techniques have been devised that make it possible to test the hybridization

of a single probe to hundreds of different sequences in a single automated

experiment. In the hybridization experiment, DNA arrays - also known as

DNA chips - containing thousands of short fragments with length l (for example,

l =10) and attached to a surface are applied to a solution containing

the unknown fluorescent-labeled DNA fragment. After the reaction, one can

obtain a set of oligonucleotides which are fragments of the examined DNA

sequence by reading a fluorescent image of the chip. Those fragments or

oligonucleotides constitute the spectrum. The spectrum is read by exposing

it to the light. The original sequence is then reconstructed using combinatorial

algorithms that take as input the generated spectrum and return as

output a sequence that best represents the set of input spectrum.

The reliability of the fragment depends on its binding energy to the target.

It is a function of many factors such as the length of the probe, the

oligonucleotide contents of the probe, and similar sequences in the target.

The length of the probes should not be too small if we want to be able to

reconstruct the sequence efficiently. Short probes will overlap in size shorter

than the probe length which means that if the probe is too short, it is almost

impossible to reconstruct the original sequence. Long probes increase the

hybridization reliability. However, fragments longer than a few hundred nucleotides

cannot be sequenced reliably by current methods, so the fragments

will typically be rather short. In addition, long fragments are uninformative

at the single nucleotide level. The fragment length also influences thermal

stability.

Sequencing by hybridization (SBH) is an elegant and efficient sequencing

2 PRELIMINARIES 6

method in the case of error-free data. SBH is fast and convenient. However, it

is limited to extremely short sequences as even sequencing an 8–base sequence

implies an array with 4 8 = 65, 536 elements. Sequences of a hundred bases

would require a currently infeasible array size. In general, sequencing might

be accomplished with the entire set of 4 N probes of length N. However, in the

original experiment, errors exist. Errors are correlated more with differences

in melting temperature than purely random errors.

2.4 Fragments Assembly

Fragments Assembly is the process of constructing longer sequence from the

input spectrum. Shorter sequences can be assembled using the DNA sequencing

process. DNA sequencing is not very efficient for longer sequences.

In general, the process of assembling large sequences can be divided into two

main steps. The first step is to reconstruct longer sequences from input spectra.

Reconstructing longer sequences from short fragments can be efficiently

done using DNA sequencing algorithms. The second step is to align sequences

obtained from the DNA sequencing process through sequence alignment algorithms.

The process can be looked at as a tree representation where shorter

fragments are the leaves and the goal is to move up the tree to obtain longer

sequences until the root is reached and the optimal sequence is obtained.

2.5 Genetic Algorithms

Before describing the problem more formally, we provide some background

information on how genetic algorithms can be used to solve many NP-hard

problems efficiently. Genetic algorithms were formally introduced in 1975 by

John Holland at University of Michigan [15]. In 1992 John Koza used genetic

algorithms to evolve programs to perform certain tasks. He called his

method genetic programming. Since then, genetic algorithms have become

more and more popular. The continuing price and performance improvements

of computational systems have made them attractive for many types

of optimization problems. In particular, genetic algorithms work very well

2 PRELIMINARIES 7

on mixed (continuous and discrete) combinatorial problems. They are less

susceptible to getting ‘stuck’ at local optima than gradient search methods,

but they tend to be computationally expensive.

To use a genetic algorithm, the solution to the problem must be represented

as a genome (or chromosome). The genetic algorithm then creates a

population of solutions and applies genetic operators such as mutation and

crossover to evolve the solutions in order to find the best one(s). Many variations

of genetic algorithms exist depending on the structure of the algorithm.

A genetic algorithm that generates only one offspring per generation is called

steady-state genetic algorithm, as opposed to a generational genetic algorithm

that replaces the whole population, or a large subset of it, per generation. A

typical structure of a steady-state genetic algorithm is given in Figure 1. If a

local optimization step is added to the steady-state genetic algorithm, then

the algorithm is said to be hybridized and the scheme is called hybrid genetic

algorithm. A typical structure for the hybrid GA is shown in Figure 2. We

provide a hybrid steady-state GA for the DNA sequencing problem.

Genetic Algorithm

generate a random initial population P

repeat

Select two parents p 1 and p 2 from population

offspring ←− crossover(p 1 , p 2 )

mutate( offspring)

if suited ( offspring) then

replace(P, offspring)

until (there is no improvement)

return the best member of P

Figure 1: A Typical Structure of Steady-State Genetic Algorithm

Below, we introduce some common and important terminology of genetic

algorithms as well as some recommendations that should be taken into considerations

when designing a new genetic algorithm. Those recommendations

and tips will improve the evolution of the algorithm and eventually lead to

obtaining better results.

2 PRELIMINARIES 8

Genetic Algorithm

generate a random initial population P

repeat

Select two parents p 1 and p 2 from population

offspring ←− crossover(p 1 , p 2 )

mutate( offspring)

localOptimize( offspring)

if suited ( offspring) then

replace(P, offspring)

until (there is no improvement)

return the best member of P

Figure 2: A Typical Structure of Hybrid Genetic Algorithm

Population: A genetic algorithm starts with a set of initial solutions

(chromosomes) called a population. The size of the population depends on

the problem and on the encoding of the chromosomes. Larger populations

usually need more time to evolve. Smaller populations might not be able to

reach the optimal solution. They are more subject to being stuck at local

optima. A good population size is about 50 to 100. The population size

has a direct affect on performance; the larger the population, the slower the

algorithm.

Encoding: Encoding depends on the problem and also on the size of

instance of the problem. There are no general possible ways of encoding the

chromosome. An encoding that is best for one problem might not be suitable

for other problems.

Crossover: Crossover is the process of combining two parents to obtain

a new member of the population called an offspring. Crossover depends on

the chosen encoding and on the problem. One possible crossover operator is

a k-point crossover.

In the k-point crossover, two parent chromosomes are cut in exactly k

positions resulting in k + 1 segments for each chromosome. Segments are

3 PROBLEM FORMULATION 9

then alternatively concatenated resulting in two new offsprings. The two

new offsprings are compared and the better one is returned.

Mutation: The goal of mutation is to enhance the diversity of the population

by introducing new members that are mutated from their parents.

The mutation rate should be very low. The best rates seem to be about

0.5%-5%. Increasing the mutation rate creates members in the population

with new characteristics.

Selection: Many schemes exist for the selection process. One of the

simplest methods is the basic selection method. In this method, a random

number representing a chromosome in the population is selected. Each

member of the population has the same probability of being selected. Another

commonly used method is the roulette wheel selection method. In

this method, members with higher fitness from the population have a higher

chance of being selected. There are also more sophisticated methods that

change the parameters of selection during the run of the genetic algorithm.

They behave very similar to simulated annealing. Choosing the right method

to use is problem dependent.

Local Optimization: The goal of the local optimization step is to try

to enhance the current chromosome and obtain a better one that is either a

local or global optimal solution to the original problem. The local optimization

is problem dependent. There are no common techniques for performing

local optimization. An optimization step that is good for one problem might

not be good for another.

3 Problem Formulation

In this section we give some basic algorithmic background information as

well as a formal description of the DNA sequencing problem.

3 PROBLEM FORMULATION 10

3.1 Describing the Problem

The DNA sequencing problem is the problem of determining a DNA strand

based on a given spectrum. The problem can be modeled as follows. Let

Σ = {A,C,T,G} be an alphabet. Here we consider a spectrum as a set of

strings of length l over Σ. A spectrum consists of fragments of equal sizes. A

spectrum is said to be ideal if the following condition is true for all but one

fragment in the spectrum: the suffix of length l−1 in a fragment is a prefix of

exactly one other fragment in the spectrum. The DNA sequencing problem

can then be stated as the problem of constructing a string over Σ from a given

spectrum (not necessarily an ideal spectrum), so that the resulting string is

the shortest string that contains as many of the fragments in the spectrum

as possible.

3.2 Error Model

Many variations of the sequencing problem exist depending on the model of

error used and on other factors in the experiment. Errors occurring during

the hybridization experiment play an important factor in reconstructing the

original sequence. Algorithms solving the DNA Sequencing problem usually

consider one of two cases. The first case is to assume that the input spectrum

is ideal, meaning that it has no errors. The other case is to deal with errors

in the spectrum.

In general the input spectrum is not an ideal one. The errors appearing in

a spectrum are usually due to errors in the hybridization experiment. Errors

can be classified as positive or negative. The spectrum has positive errors

when it contains fragments that are not part of the original sequence. It

has negative errors when it fails to contain some oligonucleotides. Certain

errors are random, meaning that they may disappear when the experiment is

repeated. However, many hybridization errors are systematic, meaning that

they are likely to repeat each time the experiment is run [25][26].

If there are no errors, the problem of DNA sequencing is similar to the

Shortest Superstring problem [23], which is defined as the problem of recon-

3 PROBLEM FORMULATION 11

structing a string given a collection of overlapped substrings. The Shortest

Superstring problem seems to be much easier than the original problem of

DNA Sequencing and there exist efficient algorithms for this problem [24].

There also exists an approximation algorithm with an approximation factor

of three, i.e., the superstring it produces is at most three times as long as the

optimal shortest superstring [8]. To help illustrate the idea of DNA Sequencing

when the input spectra contain no errors, let us consider the following

example.

Example 2. Let the original sequence to be found be ACAGTGACTG.

Let the fragment length be 5 i.e., l=5. Assume that the hybridization experiment

has 0% positive error and 0% negative error. Then, the output from

the experiment would be the set S, where S={ACAGT, CAGTG, AGTGA,

GTGAC, TGACT, GACTG}. The cardinality of S is 6. In the case of no

errors in the spectrum, each fragment intersects with another in exactly l −1

positions. Thus, the total length can be calculated as n + l − 1. Hence, in

this example, the optimal length will be 10. The overlap occurs in exactly

4 positions. The final sequence can be determined as shown in Figure 3 as

ACAGTGACTG.

ACAGT.....

.CAGTG....

..AGTGA...

...GTGAC..

....TGACT.

.....GACTG

----------

ACAGTGACTG

Figure 3: Reconstructing the sequence in case of no errors

The existence of errors in the input spectrum makes the problem of reconstructing

the original sequence an NP-hard problem [4]. Missing fragments

from the experiment turn the problem into the problem of finding the most

3 PROBLEM FORMULATION 12

likely sequence [4]. The most likely sequence is the shortest one containing

almost all the fragments as a substring. Some fragments might be excluded

from the final result. Those excluded fragments are the ones that represent

the positive errors in the experiment. Also, fragments might not be

completely overlapped. Under normal situations, two fragments of length l

intersect in l − 1 positions. However, because of negative errors, the longest

overlap might not be of length l −1. Another source of difficulty exists when

the spectrum contains repeated fragments. Most existing algorithms that

allow for errors in the input spectrum put restrictions on the error model

[5][11][12]. There are few algorithms that have no restriction on the input

error model. Two such algorithms are in [1] and [7]. Our algorithm also puts

no restrictions on the error model. We require only an upper bound for both

the negative and positive errors in the spectrum. A comparison between our

algorithm and those two algorithms is presented in Section 6.

Example 3. Using the same DNA sequence as in Example 2, let us now assume

there are errors in the input spectrum. Assume that fragment AGTGA

is a negative error fragment, i.e., it is missing from the input sequence S. Instead,

fragment AGTCA appears in the set S as a positive error. That is, it

is not part of the final optimal sequence. Then S={ACAGT, CAGTG, GT-

GAC, TGACT, GACTG, AGTCA}. The optimal length is calculated using

the same formula as the ideal spectrum except that negative errors are added

to the formula and positive errors are subtracted from it. Then, the optimal

length would be n + l − 1 + 1 − 1, or 10. The algorithm solving the DNA

sequencing problem with error should detect the positive error fragment and

try not to include it in the final solution. One possible solution is shown in

Figure 4.

3.3 Input and Output

Algorithms solving the DNA Sequencing problem take as input the spectrum

of all fragments. The output is a DNA sequence that is the most likely one

that includes all fragments. In the case of an ideal spectrum, the result

3 PROBLEM FORMULATION 13

ACAGT.....

.CAGTG....

...GTGAC..

....TGACT.

.....GACTG

----------

ACCAGTACTG

Figure 4: Reconstructing the sequence in case of errors in the spectrum

sequence would be of length n + l − 1, where n is the cardinality of the

spectrum and l is the length of the fragments. However, because of errors, this

may not be always the case. The algorithm presented here deals with errors,

so the output sequence would not necessarily be of length n+l −1. Negative

errors may cause the output sequence to be shorter in length. Positive errors

may cause it to be longer. In some other cases, those types of errors would

mistakenly cause the algorithm to converge to a sequence that does not

necessarily represent the optimal solution.

3.4 Other Related Information

Other important topics to help the reader build some basic background information

in the computational biology field in general and the DNA sequencing

in particular, are listed in the next few sub-sections.

3.4.1 The Hamiltonian and Eulerian Paths

The Hamiltonian and Eulerian Path approach are widely used methods to

solve the DNA sequencing problem when the input spectrum has no errors.

A Hamiltonian path in a graph is defined as a path that visits all vertices in

the graph. To illustrate the similarity between finding a Hamiltonian path in

a graph and finding the DNA sequence containing all fragments in S when

there are no errors in the spectrum, let us define a graph G = (V,E), where

V is the vertex set and E is the edge set, representing a spectrum S as

3 PROBLEM FORMULATION 14

follows. The vertices are elements of the spectrum S. An edge between two

vertices exists if and only if the corresponding fragments has an overlap of

length l − 1. Since the problem of finding a Hamiltonian path is known to

be NP-hard, it is unlikely to admit polynomial time algorithms. Researchers

tried to transform the problem into another problem that can be solved in

polynomial time [6]. Pavel Pevzner proposed a transformation of the graph

into a new graph where the problem is equivalent to finding an Eulerian path

[23].

An Eulerian path in a graph G is a path that visits all edges in G. The

idea is to try to reduce the fragment assembly problem to a variation of

the classical Eulerian path problem. An Eulerian path is different from a

Hamiltonian path, as there exist polynomial time algorithms for finding an

Eulerian path in a graph if it exists. It proved to be very effective when

assembling fragments that contain no errors [24]. In this paper, we used a

similar approach in reconstructing newer fragments with longer length as will

be shown in Section 4.

3.4.2 Sequence Alignment

Once results are obtained from the DNA Sequencing algorithm, a reliable

method for determining the quality of the solution obtained is needed. One

way of doing this is by aligning the result sequence with the optimal sequence,

i.e., sequence alignment. The idea of sequence alignment is very simple. First,

it compares each gene in the original sequence with the corresponding gene in

the optimal sequence where the spectrum was obtained from. This seems to

be an easy task but the challenge is to do it efficiently and quickly. Aligning

sequences of different lengths is an issue in the sequence alginment process.

Second, a scheme for determining the quality of the alignment is needed.

This is done by using a scoring system. The scoring system assigns a bonus

or a match value if there is a match between a character in a sequence and the

corresponding character in the other sequence. A penalty or mismatch value

is added if there is a mismatch between a character and its corresponding

character. If the corresponding character is a gap, then the gap cost is added.

3 PROBLEM FORMULATION 15

Gap insertions occur when inserting a space in the original sequence causes

the cumulative score to increase. Gap extensions occur when one sequence

is shorter than the other sequence. The shorter sequence is extended with

gaps. Extension will not occur if it reduces the score of the alignment. As a

result of the scoring system, each sequence will get a cumulative score which

decreases in poorly matched regions and increases in the highly matched

regions.

One algorithm that is worth mentioning in the area of sequence alignment

is the Smith-Waterman algorithm [27]. The Smith-Waterman alignment

algorithm uses dynamic programming techniques. Dynamic programming

techniques view the problem as a set of sub-problems. It then solves subproblems

and uses the result to solve larger sub-problem till the solution to

the original problem is obtained. Figure 5 shows the recurrence formulation

of dynamic programming for sequence alignment.

⎧

⎪⎨

F(i − 1,j − 1) + s(x i ,y j ),

F(i,j) = max F(i − 1,j) + d,

⎪⎩

F(i,j − 1) + d.

Figure 5: Recurrence Formula for Sequence Aignment

This recurrence represents the two sequences to be aligned. F(i,j) is

defined as the cost of aligning the first i characters from one sequence with

the first j characters from the other sequence. The initial value is trivial and

can be determined quickly. The recurrence equation is applied repeatedly

to fill the matrix of F(i,j). F(i,j) is the max three values that are already

computed before, F(i − 1,j − 1), F(i − 1,j), F(i,j − 1). S(x i ,y j ) is the

score for aligning gene x i with gene y j , d is the penalty for gap insertion or

extension. Once the matrix is built, the result sequence can be reconstructed

from it in a reverse order using additional data structure to store the path

that was used when building the matrix.

The main characteristics of the Smith-Waterman algorithm include: the

result sequence can start and end anywhere in the original sequence. Mean-

3 PROBLEM FORMULATION 16

ing, the new aligned sequence can start or end with either the start or end

character from the original sequence or a gap, i.e., it can begin and end internally.

This feature is important especially when the start and end point

of the sequence are not known. The algorithm produces an optimal local

alignment with the highest score. The next example will clarify how the

Smith-Waterman sequence alignment algorithm works.

Example 4. Consider the following two sequences, s and t, where s=TTCC

and t= AATT. The dynamic programming matrix representing alignment for

those two sequences is shown in Table 1. The matrix was obtained using the

recurrence formula given in Figure 5, where s(x i ,y j ) is 0 if x i matches y j ,

and 1 otherwise. The gap insertion or extension, d, is 1. In this case, the

matrix represents the penalty for aligning those two sequences. To obtain

the aligned sequence, one should start from the bottom right corner of the

matrix and follow the path till it reaches the top left corner. Then, from

Table 1, one possible sequence alignment for those two sequences could be

with four replaces. Another possible alignment is to use two inserts and two

deletes.

Table 1: The Dynamic Programming Matrix representation for Example 4

A A T T

0 1 2 3 4

T 1 1 2 2 3

T 2 2 2 2 2

C 3 3 3 3 3

C 4 4 4 4 4

s=TTCC

t=AATT

... with four proper replaces

s=--TTCC

t=AATT--

... with two inserts and two deletes

4 ALGORITHM 17

4 Algorithm

In this section we describe a genetic algorithm for solving the DNA sequencing

problem when the input may have both positive and negative errors. We

do not require that the starting fragment of the sequence be known as it is

done in [6]. We use a steady–state genetic algorithm, together with a local

optimization procedure, to help improve the performance of the algorithm.

Additionally, we have a preprocessing step that improves the algorithm even

further. The overall algorithm is given in Figure 6. In the following subsections

we give more details of the algorithm.

Sequence(S) // S is a spectrum

preprocess(S)

generate a random initial population P

for each a ∈ P

LocalOptimize(a)

endfor

repeat

Select two parents p 1 and p 2

u ←− crossover(p 1 , p 2 )

LocalOptimize(u)

mutate(u)

replace(u, p 1 , p 2 , P)

until (there is no improvement)

return the best member of P

Align output sequence;

Figure 6: The Enhanced Genetic Algorithm for DNA sequencing

4.1 Preprocessing

In general, the fewer fragments and longer fragments there are in the spectrum,

the easier the problem is. The idea of preprocessing is to merge certain

fragments together, thereby creating a new spectrum that has fewer

and longer fragments. In this step we create long chains of fragments of the

4 ALGORITHM 18

form F 1 ...F k , where F i ’s are fragments, and the last l − 1 elements of F i

match the first l − 1 elements of F i+1 . Our objective is to make k as large as

possible. Each such chain of fragments is merged into one fragment in the

new spectrum. The algorithm then works with this spectrum which has variable

length fragments and a smaller number of fragments than the original

spectrum.

The preprocessing algorithm creates a chain by selecting an unused fragment

in the spectrum and adding it to the chain. The fragment is then

marked used. The algorithm extends the chain by selecting an unused fragment

that has an overlap of l − 1 with the last fragment in the chain. If

such a fragment exists, it is added to the chain and marked used, and the

process is repeated. If there is no such fragment the chain is terminated. The

algorithm then starts a new chain. The algorithm terminates when all fragments

in the original spectrum have been used. The algorithm then merges

the fragments in each chain to create a fragment for the new spectrum. The

algorithm is basically trying to perform a modified topological sort on the

graph induced by the input spectrum. In this graph, the fragments are the

vertices and an edge exists between two fragments if and only if they intersect

in l − 1 positions. This algorithm can be efficiently implemented using

dynamic programming technique. The algorithm for the preprocessing step

is shown in Figure 7.

Example 5. Suppose we have a DNA sequence CTAGACGTTC of length

10. An ideal spectrum would consist of the following six fragments: CTAGA,

TAGAC, AGACG, GACGT, ACGTT, and CGTTC, where we have assumed

that the fragment length is 5. However, because of errors from the hybridization

experiment, an input spectrum in this case may consist of the following

fragments: {CTAGA, TAGAC, AGACG, TATCC, ACGTT, CGTTC},

where the cardinality of the spectrum is 6. This spectrum differs from the

ideal spectrum in that it does not contain the fragment GACGT (a negative

error), instead it contains the fragment TATCC which is not a substring

of the original DNA sequence (a positive error). Thus, this spectrum has

4 ALGORITHM 19

Preprocessing(S) // S is a spectrum

loop

Start a new chain C

for each fragment f ∈ S

if (f is not used) and (right(C,l − 1)= left(f,l − 1)) then

Append f to the end of C

Mark f as used

endif

endfor

exit when all fragments are used

endloop

Figure 7: The Preprocessing Algorithm.

one negative error and one positive error. Using this spectrum as the input,

the preprocessing algorithm would produce the following chains [CTAGA,

TAGAC, AGAC], [TATCC], and [ACGTT, CGTTC], which yield the spectrum

consisting of the three fragments f 1 , f 2 , and f 3 where f 1 =CTAGACG,

f 2 =TATCC, and f 3 =ACGTTC. The genetic algorithm then takes f 1 , f 2 , and

f 3 as input instead of the original spectrum, thus, improving the probability

of finding the optimal answer and improve the running time by working with

fewer fragments.

4.2 Encoding and Initialization

After the preprocessing step is completed, we work only with the new spectrum

from the preprocessing step which has lower cardinality and longer

fragments of variable length. We assume that fragments in this spectrum

are indexed in some order. Each member of the population is a vector of

fragment indexes representing a possible solution sequence to the problem.

Given a vector u[1...m] of fragment indexes, the corresponding sequence is

obtained by merging the fragments F u[1] ,F u[2] ,...,F u[m] in that order, where

F i is the ith fragment in the spectrum. Here, two adjacent fragments are

put together by overlapping them as much as possible. We also maintain the

4 ALGORITHM 20

constraint that each fragment index appears at most once in the encoding

of a sequence. An algorithm to repair the chromosome is given in Figure 8.

The vectors in the population can be of variable length. In what follows, we

refer to each member of the population as a vector or a sequence.

An initial population of size 120 is generated at random. The size of the

population remains constant throughout the algorithm. All the vectors in the

initial population have the same size, i.e., each vector has the same number

of fragment indexes. However, since the fragments are of variable length after

the preprocessing step, the corresponding sequences have different lengths.

Example 6. Suppose the spectrum obtained after preprocessing contains

the fragments CTAGACG, TATCC, ACGTT, and the fragments are indexed

in that ordered from 1 to 3. Then, the vector (1, 3) yields the sequence

CTAGACGTT, and the vector (2, 1) yields the sequence TATCCTAGACG.

4.3 Fitness

The fitness of each sequence is calculated based on two factors: (i) the amount

of overlap between adjacent fragments in the sequence, and (ii) the length

of the sequence. The idea is that more overlap between adjacent fragments

results in a shorter sequence. Also, if the length of the sequence is equal to

n + l − 1, where n is the cardinality of the spectrum and l is the length of

each fragment, a bonus value is added to the value of the fitness. Note that

n+l−1 is the optimal length for a sequence that includes all fragments in the

spectrum. More formally, let u[1...k] be a vector representing a sequence U

in the population. The fitness of U is defined as follows.

f(U) = c ×

k−1 ∑

i=1

where z is the bonus value defined by

z =

|F u[i] ∩ F u[i+1] | + z

{ s, if |U| = n + l − 1,

s/||U| − (n + l − 1)|, otherwise.

4 ALGORITHM 21

For our experiment, c was set to 10 and s was 100. The fitness can be

computed efficiently using dynamic programming technique.

Example 7. Consider the previous example. Then, using the formula for

calculating the fitness would result in the following:

• Fitness(chromosome 13 )= [10 × 3] + 100/|(9 − 10)|=130

• Fitness(chromosome 21 )= [10 × 1] + 100/|(11 − 10)|=110

4.4 Parent Selection

The parents are selected using the standard proportional selection method

where sequences that have higher fitness have a better chance of being selected.

The standard roulette wheel scheme is used in our algorithm [14].

4.5 Crossover

An offspring is constructed by selecting alternately from each parent after k

cutpoints have been determined. Note that the members of the population

are vectors of fragment indexes. This process may create offspring that contain

duplicated fragment indexes. A repair algorithm, shown in Figure 8, is

used to get rid of any repeated fragments and to ensure all fragments are represented

within the offspring. The repair algorithm works by replacing the

repeated fragments with fragments from the spectrum that are not currently

in use by the sequence. We used a 3-point crossover in our testing.

4.6 Mutation

A sequence is mutated with a pre-determined probability. The mutation

is done by randomly selecting two fragments in the sequence and swapping

them. In our experiments, the mutation probability is set at 10%. We also

used other values for testing the algorithm.

4 ALGORITHM 22

Repair(C) // C is a chromosome

Create an array a with size |C|

Create two empty queues Q1 and Q2

for each fragment f ∈ C

Check the current place p for fragment in array a

if a[p] is marked then

add p to Q1[top]

else

mark a[p] as used

endif

endfor

for all fragments f ∈ a where f is not used

Q2[top] ←− f

endfor

update C: replace fragments f i ∈ (Q1) with fragments f j ∈ (Q2)

Figure 8: The Algorithm to repair the Chromosome

4.7 Local Optimization

The local optimization algorithm has two steps. The first step is to scan the

sequence sequentially and identify a pair of adjacent fragments, say x and

y, with the smallest overlap. Then, we find the fragment, say z, which has

the highest overlap with x. Then, we replace y with z. The vector is then

repaired, if needed, to eliminate duplicated fragments.

The second step is to rearrange the fragments in the sequence in the

hope of improving its fitness value. This is done by first finding the two

pairs of adjacent fragments that have the two smallest overlaps. Let s,t

be the first pair and x,y be the second pair. That is, assume that the

vector u = (a,...,s,t,...,x,y,...z). We then construct a new vector u ′ by

swapping the fragments between t and x with the fragments from y to the

end of u. Thus, u ′ = (a,...,s,y,...,z,t,...,x). If the fitness of u ′ is better

than that of u, we replace u by u ′ . Otherwise, we keep u and discard u ′ .

By using dynamic programming technique, the local optimization algorithm

4 ALGORITHM 23

and the repair algorithm can be efficiently implemented.

LocalOptimize(C) // C is a chromosome

for each fragment f ∈ C

find two fragments, x and y, with min intersection between them

scan matrix M

find fragment z with max intersection with x

replace y with z

repair(C)

endfor

for each fragment f ∈ C

C 1 ←− C

find two fragments, x and y, with smallest intersection between them

find another two fragments, s and t, with second smallest

intersection between them

C 2 ←− swap([t ... x],[y ... end])

if fitness(C 2 ) > fitness(C 1 )

return (C 2 )

else

return (C 1 )

endif

endfor

Figure 9: The Algorithm for the Local Optimization Process

4.8 Replacement Scheme

The following replacement scheme is used. If the fitness of the new offspring

is larger than the fitness of the poorer of the two parents, then we replace

that parent with the new offspring. Otherwise, we discard the new offspring.

4.9 Stopping Condition

The algorithm terminates if there is no improvement in the total fitness of the

population in 400 consecutive generations, or if the number of generations

exceeds 50,000. The values mentioned here were obtained and set as values

5 STANDARDIZING THE DATA 24

that return the best compromise between the quality of the solution and the

performance of the algorithm.

5 Standardizing the Data

Data play an important role in the computational biology field. Most algorithms

in this field deal with large amount of data. So, obtaining the right

set of data is crucial to developing good algorithms. The data have to be

diverse and with different characteristics. In the area of DNA sequencing,

we identified the following important characteristics when obtaining the test

data. All data sets should come from different genomes. Each genome should

have its own features. Spectra used in testing should have a variety of errors

percentages both negative and positive, as well as having different fragment

length.

In order to obtain data with the characteristics mentioned above, many

steps have to be taken. The first step is to obtain genomes that have those

characteristics. We selected three different genomes from the GenBank [13].

Table 2 shows the details of the genomes obtained and used in testing our

new algorithm.

Table 2: Genomes obtained from the GenBank

Sequence

Length (BP)

Human immunodeficiency virus 2 (HIV) 10,359

Drosophila melanogaster DNA sequence of white locus (Fly) 14,245

Canis familiaris clone RP81-60B6 (Complete Dog genome) 165,116

The second step is to develop a new algorithm that simulates the hybridization

experiment and generates spectra. It generates spectra in two

different methods. The first method is to simulate the Hybridization experiment

by generating a spectrum with a specific upper bound on the error

percentages. The second method is to generate a complete set of data with

5 STANDARDIZING THE DATA 25

different parameters and factors such as: different fragment length, different

error percentage, and different sequence length, . . . etc. Using these variations

of test data in testing our new enhanced genetic algorithm ensured that

it works fine for different genomes with different characteristics and with data

that are more practical and more diverse. The spectrum generator algorithm

is briefly described in Figure 10.

Generate(Genome)

Read length of output spectrum

for each error combinations e 1 in {0,5,8,10,20}

for each fragment length l 2 in {10,20,50}

for i=1...10

Generate output file name

Randomly select a start point in the input sequence

Generate all fragments of length l 2 in the spectrum

Introduce e 1 % positive errors

Introduce e 1 % negative errors

endfor

Figure 10: The Spectrum Generator Algorithm.

The spectrum generator algorithm first determines the length of the sequence

to be generated. Then, using the Mersenne Twister (MT) random

number generator [20], it picks a random starting point and then reads a

sequence with the desired length. It ensures that all generated sequences are

different. The second step is to generate fragments with zero positive and

negative errors. It then generates data with 5%, 8%, 10%, and 20% errors

for negative, positive, and both types of errors. Table 3 shows different combinations

of errors generated using the spectrum generator algorithm. The

total number of error combinations generated by the algorithm is 13. The

algorithm also generates data with different fragment length to study the

effect of using longer fragments on the quality of the output solution. The

data algorithm is able to generate fragments with any length and size. We

5 STANDARDIZING THE DATA 26

Table 3: Positive and Negative error combinations used in testing the algorithm

+0% +5% +8% +10% +20%

-0% x x x x x

-5% x x

-8% x x

-10% x x

-20% x x

tested the genetic algorithm with fragments of length 10, 20, and 50 genes.

For each combination mentioned above, 10 different sequences, obtained from

different positions within the parent genome, have been generated.

The technique used in generating fragments with errors divides the process

into two steps. The first step is to produce the positive errors. The

second step is to produce the negative errors. The number of positive and

negative errors fragments are calculated based on the percentage of errors

in the output fragments. For positive errors, a random number that represents

the fragment index in the fragment array is selected. Then, a random

position within that fragment is selected in which the correct gene will be

replaced with another random error gene to introduce the positive error.

The negative error is much simpler because only a random fragment index is

selected and then the fragment is removed from the final spectrum.

Another source of data that we used when testing our new enhanced

algorithm is from [7]. This data set has the following characteristics. It

has a fragment length of 10, 20% negative errors, and 20% positive errors.

Usually, in the hybridization experiment, the practical percentages of errors

are in the range from 1% to 3% [23]. Nonetheless, our algorithm performed

very well when tested using this set of data as will be seen in the next section.

6 EXPERIMENTAL RESULTS 27

6 Experimental Results

In this section we first describe the performance of our algorithm in comparison

with some existing algorithms for the DNA sequencing problem. We

then show the result of testing our algorithm on an extensive set of data that

we generated as mentioned before. Our algorithm was implemented in C++

and was run on a PC with Pentium IV 2.4GHz Intel processor with 512MB

of RAM.

Our first set of test data is from [7][18]. We used it to compare our algorithm

to the Tabu Search algorithm in [1] and the Hybrid Genetic Algorithm

in [7]. The data from this set consist of spectra having 100, 200, 300, 400,

and 500 fragments. There are 40 spectra for each size, for a total of 200

instances. The fragment length in all of these instances is 10. Each instance

has 20% positive errors and 20% negative errors.

(a) (b) (c)

Figure 11: Comparison between our algorithm and others

To determine the quality of our solution, we follow [7] and use the classical

pairwise Smith-Waterman sequence alignment algorithm described previously

to compare the output of the algorithm with the original sequences in

which the spectra were generated from. We use two values from the output

of the Smith-Waterman algorithm: the match percentage and the similarity

6 EXPERIMENTAL RESULTS 28

score. In addition, as in [7], for each instance tested we include the number

of times the algorithm finds the optimal answer. This number is called

the optimum number. More formally, the optimum number is the number

of times the algorithm is able to reach the optimal answer within the input

data set. Table 4 and Figure 11 summarize the results of the comparison,

and show that our algorithm performs significantly better than the other two

algorithms as the sequence length gets longer. More details can be seen in

Figures 11 (a), (b) and (c) 1 .

Even though the running times are available for all algorithms, we cannot

compare the running times, since different machines were used. For our

algorithm, the average running time was from 0.6 second for the smallest

problem size to 15.1 seconds for the largest problem size tested. The Hybrid

GA and Tabu Search algorithms were run on a PC with a Pentium II 300MHz

processor and 256MB of RAM. The average running times range from 13.5

seconds to 437.9 seconds for the Hybrid GA and from 14.1 seconds to 471.5

seconds for the Tabu Search algorithm. More details can be found in Table 4.

Table 4: Summary of results from our algorithm and others

Algorithm 100 200 300 400 500

Enhanced GA Average Similarity Score (pt) 106.7 200.6 291.6 352.2 451.6

(Pentium IV, Average Similarity Score (%) 98.9 97.6 96.7 92.9 92.0

2.4GHz, Running time (sec) 0.6 1.5 3.6 8.6 15.1

512MB RAM) Optimum no. 29 26 22 13 13

HGA Average Similarity Score (pt) 108.4 199.3 274.1 301.7 326.0

(Pentium II, Average Similarity Score (%) 99.7 97.7 94.3 86.9 82.0

300MHz, Running time (sec) 13.5 63.4 154.9 263.4 437.9

256MB RAM) Optimum no. 40 31 20 9 5

Tabu Search Average Similarity Score (pt) 108.4 184.1 196.6 229.5 235.1

(Pentium II, Average Similarity Score (%) 99.7 94.0 81.8 78.1 73.1

300MHz, Running time (sec) 14.1 60.8 177.7 258.3 471.5

256MB RAM) Optimum no. 40 24 11 6 2

1 The algorithms included in the comparison in Figure 11 are the following: our new Enhanced GA,

the Hybrid GA, and the Tabu Search. Figure 11(a) shows the Match Percentage, Figure 11(b) shows the

Optimum Number, and Figure 11(c) shows the Similarity Score.

6 EXPERIMENTAL RESULTS 29

The second set of test data we used was generated by the spectrum generator

algorithm using the three genomes obtained from the GenBank [13]

referenced in Table 2. Table 3 lists all spectrum error combinations, positive

and negative, that were used to test the algorithm. For each of the three

genomes we tested the GA using 10 different sequences of each length in the

set {100, 200, 300, 400, 500, 1000, 2000}. That is, for each genome we tested

the algorithm using 70 different sequences. From each of these sequences we

used spectra with fragments of length 10 and 20. For each fragment length,

13 different combinations of positive and negative errors were used, ranging

from 0 to 20% errors as shown in Table 3. Hence, we tested all three genomes

using a total of 3 × 70 × 2 × 13 = 5, 460 spectra.

We have also generated a similar set of spectra with fragments of length

50. The algorithm was tested on all spectra of three different lengths: 10, 20,

and 50. We observe that the longer the fragments are, the better the results

are. In fact, with a fragment length of 50, our algorithm almost always found

the optimal answers, and thus, we do not include the data for fragments of

length 50 here. We used different fragment lengths since in practice different

hybridization techniques may require different fragment lengths. Normally,

the hybridization rate is better if the fragment length is longer. However, for

in situ hybridization, a small fragment length is required [21].

As in the case of the first data set, we use the Smith-Waterman sequence

alignment algorithm to determine the quality of the solutions returned by the

algorithm. We used an implementation of the Smith-Waterman algorithm

provided by Jie Li of Iowa State University [19]. Figures 12 and 13 show the

performance and running time of our algorithm on the second set of data. In

Figure 12, the graph shows the match percentage for spectra with fragment

length 10. The x-axis shows the various error combinations in the input spectrum.

The notation -a+b indicates spectra with a% negative error and %b

positive error. Thirteen different error combinations were used to verify the

performance of the new algorithm. These error combinations were selected

because of many reasons. Previous researches in the same area used similar

values for testing. We also wanted to provide error combinations that are

6 EXPERIMENTAL RESULTS 30

uniformly distributed over the range from 0 to 20. We found, by experimental

results, that those values would be adequate to illustrate the strength of

our new algorithm compared to existing other algorithms. Figure 13 shows

the running time for spectra with a fragment length of 10.

Figure 12: Plot of Match Percentage against Error Combinations (l=10)

Figure 13: Plot of Running Time against Error Combinations (l=10)

6 EXPERIMENTAL RESULTS 31

Figure 14 is the graph for spectra with a fragment length 20. For each

error combination, the match percentage shown is the average over all spectra

generated from all three genomes. In all cases, the match percentage is over

90%. It can be observed that spectra with a longer fragment length appear

to be easier to solve than the ones with a smaller fragment length. The same

machine that we used to test the algorithm on the first data set was used

for the second data set. Figure 15, with a fragment size of 20, shows that

spectra with a higher error percentage seem to take longer than spectra with

a smaller error percentage.

Figure 14: Plot of Match Percentage against Error Combinations (l=20)

The next few figures provide more information about the performance

of our new algorithm with respect to different types of input. Figure 16

shows the match percentage against only the negative errors for all fragment

lengths. It can be seen that the match percentages are in the range from

96% up to 99%. The match percentage is considered very good given the

cardinality of the input spectrum. The graph shows that in some cases the

match percentage is lower even when the error percentage decreases. This

is because the spectrum generator algorithm is a random algorithm, so a

spectrum with fewer errors might be harder to reconstruct than another

spectrum with more errors. Other types of errors, such as the repeated

6 EXPERIMENTAL RESULTS 32

Figure 15: Plot of Running Time against Error Combinations (l=20)

fragment error, also affect the match percentage.

Figure 16: Plot of Match Percentage against Negative Errors

Figure 17 shows that the running time is directly proportional to negative

errors. The higher the negative error percentage, the longer the time needed

6 EXPERIMENTAL RESULTS 33

to obtain the result. The plot in Figure 17 summarizes results for all fragment

lengths.

Figure 17: Plot of Running Time against Negative Errors

The match percentage for positive errors, as shown in Figure 18, is between

96% and 98%, which is considered very good given that the cardinality

of the spectrum is approx 500 fragments. The plot represents positive errors

in the spectrum versus the match percentages for all fragment lengths, all

negative errors, and all spectra sizes.

In general, the running time increases as the percentage of positive errors

increases. Figure 19 shows that the running time for 0% positive errors was

worse than the running time for 5% or 8%. The reason is because with short

fragments, there exists another type of error in the spectrum. This error

is known as the repeated fragments error. The repeated fragments error

prohibits the genetic algorithm from being able to obtain optimal answers

even when there are no other types of errors included. Another factor that

affects the quality of the solution obtained is the nature of the input sequences

and spectra. Some sequences are harder to reconstruct while others are easier

to reconstruct.

Figure 20 shows how the match percentage decreases as the length of

6 EXPERIMENTAL RESULTS 34

Figure 18: Plot of Match Percentage against Positive Errors

Figure 19: Plot of Running Time against Positive Errors

the chromosome increases. Longer chromosomes have spectra with larger

cardinality. As the cardinality of the input spectrum increases, the chance

of obtaining the optimal answer decreases.

It is clear from Figure 21 that the longer the chromosome length, the

6 EXPERIMENTAL RESULTS 35

Figure 20: Plot of Match Percentage against Optimal Length

longer the time needed to find the result. This is because longer fragments

consist of more fragments and the cardinality of the spectrum is larger for

longer ones. Thus, larger spectra increase the running time needed to reconstruct

and obtain the result.

Figure 21: Plot of Running Time against Optimal Length

7 CONCLUSION 36

Figure 22 shows that the longer the fragment, the better the match percentage.

The match percentages are improved by increasing the fragment

length because longer fragments have less chance of being repeated. Also,

longer fragments decrease the effect of errors in the spectrum.

Figure 22: Plot of Match Percentage against Fragment Length

Figure 23 shows that the running time of the algorithm is improved by

increasing the fragment length. This is obvious from the fact that increasing

the fragment length decreases the cardinality of the spectrum. It also reduces

the number of fragments each chromosome consists of, which causes the result

to be obtained more quickly.

Experimental results from the two data sets suggest that our algorithm

performs very well against existing algorithms. It is also very robust against

different combinations of errors.

7 Conclusion

This study has introduced a new enhanced genetic algorithm for the DNA

Sequencing problem. The results produced by the algorithm were very good

and in many cases were optimal or close to optimal and were frequently better

REFERENCES 37

Figure 23: Plot of Running Time against Fragment Length

than existing algorithms. Taking into account the difference in speed of the

machines on which the various algorithms were run, our algorithm seems to

be comparable if not faster than existing algorithms. One area we did not

cover in this research is the repeated fragments error. Repeated fragments

can prohibit the algorithm from finding optimal answers even when there are

no other types of errors in the spectrum. This type of error diminishes if the

fragment length increases. We plan to perform further investigation into the

problem of repeated fragments.

References

[1] Blazewicz, J., P. Formanowicz, F. Glover, M. Kasprzak, and J. Weglarz,

“An Improved Tabu Search Algorithm for DNA Sequencing with Errors,”

Proceedings of the III Metaheuristics International Conference

(MIC), 1999, pp. 69–75.

[2] Blazewicz, J., P. Formanowicz, M. Kasprzak, W.T. Markiewicz, and

J.Weglarz, “DNA Sequencing with Positive and Negative Errors,” Jour-

REFERENCES 38

nal of Computational Biology 6, 1999, pp. 113–123.

[3] Blazewicz, J., P. Formanowicz, M. Kasprzak, W. T. Markiewicz, and J.

Weglarz, “Tabu Search for DNA Sequencing with False Negatives and

False Positives,” European Journal of Operational Research 125, 2000,

pp. 257–265.

[4] Blazewicz, J. and M. Kasprzak, “Complexity of DNA Sequencing by Hybridization,”

Theoretical Computer Science, 290, 2003, pp. 1459-1473.

[5] Blazewicz, J., A. Kaczmarek, M. Kasprzak, W. T. Markiewicz and J.

Weglarz, “Sequential and Parallel Algorithms for DNA Sequencing,”

Computer Applications in the Biosciences 13, 1997, pp. 151–158.

[6] Blazewicz, J., J. Kaczmarek, M. Kasprzak, J. Weglarz and W. T.

Markiewicz, “Sequential Algorithms for DNA Sequencing,” Computational

Methods in Science and Technology, 1, 1996, pp. 31–42.

[7] Blazewicz, J., M. Kasprzak and W. Kuroczycki, “Hybrid Genetic Algorithm

for DNA Sequencing with Errors,” Journal of Heuristics, 8, 2002,

pp. 495–502.

[8] Blum, A., T. Jiang, M. Li, J. Tromp and M. Yannakakis, “Linear Approximation

of Shortest Superstrings,” Journal of the ACM, 41(4), 1994,

pp. 630–647.

[9] Bui, T. and W. Youssef, “An Enhanced Genetic Algorithm for DNA

Sequencing by Hybridization with Positive and Negative Errors,” to appear

in Genetic and Evolutionary Computation Conference (GECCO),

June 26–30, 2004.

[10] Cummings, M. R. Human Heredity: Principles and Issues, West Publishing

Company, 1991.

[11] Fogel, G. B. and K. Chellapilla, “Simulated Sequencing by Hybridization

Using Evolutionary Programming,” Proc. of the IEEE Congress on

Evolutionary Computation, CEC’99, 1999, pp. 445–452.

REFERENCES 39

[12] Fogel, G. B., K. Chellapilla and D. B. Fogel, “Reconstruction of DNA

Sequence Information From a Simulated DNA Chip Using Evolutionary

Programming,” Lecture Notes in Computer Science, edited by V. W.

Porto, N. Saravanan, D. Waagen and A. E. Eiben, Vol. 1447, 1998, pp.

429–436.

[13] The Gene Bank, http://www.ncbi.nlm.nih.gov/Genbank

[14] Goldberg, D. E., Genetic Algorithms in Search, Optimization, and Machine

Learning, Addison-Wesley, 1989.

[15] Holland, J., Adaption in Natural and Artificial Systems, Ann Arbor:

University of Michigan Press, 1975.

[16] Haan, N. M. and S. J. Godsill, “Sequential Methods For DNA Sequencing,”

Department of Engineering, University of Cambridge, U.K., 2001

[17] The Human Genome Project of the U.S. Department of Energy,

http://www.ornl.gov/sci/techresources/Human Genome/home.shtml

[18] Kasprzak, M., Personal communications, August 2003.

[19] Li, J. “Implementation of Smith-Water Alignment Algorithm,” Iowa

State University, Personal Communication, 2003.

[20] Matsumoto, M. and T. Nishimura, “Mersenne Twister: A 623-

Dimensionally Equidistributed Uniform Pseudo-Random Number Generator,”

ACM Transactions on Modeling and Computer Simulation,

8(1), January 1998, pp. 3–30.

[21] Nonradioactive In Situ Hybridization Application Manual, Technical

Manual, Roche Applied Science.

[22] Percus, A. G. and D. C. Torneyy, “Greedy Algorithms for Optimized

DNA Sequencing,” Technical Report, Los Alamos National Laboratory,

Los Alamos, NM 87545.

REFERENCES 40

[23] Pevzner, P. A., Computational Molecular Biology, An Algorithmic Approach,

The MIT Press, second printing 2001, Chapter 4, pp. 59-63.

[24] Pevzner, P. A., H. Tang and M. S. Waterman, “An Eulerian Path Approach

to DNA Fragment Assembly,” Department of Computer Science

and Engineering, University of California, San Diego, La Jolla, CA;

and Departments of Mathematics and Biological Sciences, University of

Southern California, Los Angeles, CA, June 7, 2001.

[25] Phan, V. T. and S. Skiena, “Dealing with Errors in Interactive Sequencing

by Hybridization,” Oxford University Press, 17(10), 2002, pp. 1-9.

[26] Skiena, S., Personal communications, September 2003.

[27] Waterman, M. S., Introduction to Computational Biology: Maps, Sequences

and Genomes, Chapman & Hall, London, 1995.

An Improved Genetic Algorithm for DNA Sequencing - Penn State ...

An Improved Genetic Algorithm for DNA Sequencing - Penn State ... ... View more An Improved Genetic Algorithm for DNA Sequencing - Penn State ...

Delete template?

Save as template ?

An Improved Genetic Algorithm for DNA Sequencing - Penn State ... An Improved Genetic Algorithm for DNA Sequencing - Penn State ...