On the Analysis of Optical Mapping Data - University of Wisconsin ...

More documents

Recommendations

Info

62 4.2 Methods Motivation: We start with optical maps obtained from a sampled genome and a reference map representing a ‘normal’ genome, usually derived in silico from a reference sequence. If a region within the reference has increased (resp. decreased) copy number in the sample, more (fewer) maps will originate from it on average compared to if it had normal copy number. Our goal is to detect such regions. At any locus along the genome, the number of optical maps overlapping it is a local measure of coverage depth. Intuitively, aberrant copy number should be reflected in a systematic change in this coverage depth. However, using this measure directly is problematic due to spatial dependences. Instead, we summarize the location of each map by its midpoint. These locations can then be viewed as independent random variables. Alignment: A fundamental prerequisite in our approach is to identify where in the in silico map, if at all, an optical map originated. This is an instance of the general alignment problem, which is usually approached by defining a score function that assigns a numeric score to each potential alignment and then searching for the alignment that maximizes this score. Dynamic Programming (DP) algorithms based on additive score functions have been used extensively in DNA and protein sequence alignment (Durbin et al., 1998), and with suitable modifications, they can be used to align restriction maps as well (Huang and Waterman, 1992). Chapters 1 and 3 discuss optical map alignment and related issues in some detail. For the purposes of this chapter, the goal of the alignment step is simply to infer the location of a given map. We assume that a reasonably sensitive alignment scheme with low false positive rate is available. Thinning: Consider the location (midpoint) of a randomly chosen optical map with respect to the reference genome. For shotgun optical mapping, it is natural to model this location as uniformly distributed over the underlying genome. Ignoring edge effects, it is equivalent to view map locations as realizations of a homogeneous Poisson process (Lander and Waterman,
63 1988). However, due to sampling errors and alignment efficiencies, the observed locations are better modeled as a non-homogeneous Poisson process (NHPP). Not all maps are successfully aligned, and we observe locations, up to errors in the alignment, only for those that do align. Alignment algorithms are not uniformly sensitive, and the chance that the origin of a map is correctly identified depends on the origin. For example, maps from regions where markers are sparse are less likely to align, while maps with rare patterns score higher and are more likely to align (see Figure 3.10). Copy number alteration in a region of the sampled genome is manifested as further change in the non-homogeneous alignment rate. To detect fluctuations that are due to copy number changes, we must first adjust for the normal fluctuations due to the inherent but unknown non-homogeneity. Notation: The availability of optical map data from normal cells can be used to establish normal variation in the absence of CNP, but this requires some preliminary work. Formally, let {Y 1 < · · · < Y n } be aligned locations of the optical maps of interest, and {X 1 < · · · < X m } be similar aligned locations for ‘normal’ optical maps. We model {X k } m 1 as realizations of a non-homogeneous Poisson process with rate λ 0 (·) and {Y k } n 1 as realizations of a nonhomogeneous Poisson process with rate λ(·). If the two genomes are identical, we have λ(c) = κλ 0 (c), where c denotes position along the reference genome, and κ represents a constant of proportionality reflecting the possibility of different overall coverage in the two genomes. κ can be estimated in this case by the ratio n/m of the number of aligned maps. If copy number alterations exist in the genome, κ is defined to be the constant of proportionality in regions of normal copy number. Interval counts: It is natural to work with interval counts when dealing with Poisson process data, so we discretize the sample space into intervals G i = (a i , b i ], i = 1, . . .,L tiling the reference genome. The choice of G i ’s is important and is discussed below. Assume for the moment that λ 0 and κ are known. For the i th interval, define the counts N i = ∑ k I {ai
Page 1 and 2:
ON THE ANALYSIS OF OPTICAL MAPPING
Page 3 and 4:
To my parents. i
Page 5 and 6:
DISCARD THIS PAGE
Page 7 and 8:
iv Page 3.3 Results . . . . . . . .
Page 9 and 10:
v LIST OF TABLES Table Page 1.1 Sum
Page 11 and 12:
vi LIST OF FIGURES Figure Page 1.1
Page 13 and 14:
ON THE ANALYSIS OF OPTICAL MAPPING
Page 15 and 16:
1 Chapter 1 Overview of Optical Map
Page 17 and 18:
3 hard, do not always have a unique
Page 19 and 20:
5 for microbial and other small gen
Page 21 and 22:
7 Figure 1.2 Close-up of a typical
Page 23 and 24:
9 0.96 0.98 1.00 1.02 1.04 Offset a
Page 25 and 26: 11 direct glimpse at the underlying
Page 27 and 28: 13 which is not surprising since we
Page 29 and 30: 15 Gapped alignments: The above des
Page 31 and 32: Figure 1.5 A visualization of align
Page 33 and 34: 19 Assembly: For these examples, th
Page 35 and 36: 21 Chapter 2 Modeling Optical Map D
Page 37 and 38: 23 Alternatively, it can be thought
Page 39 and 40: 25 and V (X i ) = E(V (Y i R i |R i
Page 41 and 42: 27 affect inference. If necessary,
Page 43 and 44: 29 Quantiles of fragment lengths (K
Page 45 and 46: 31 as a function of the parameters.
Page 47 and 48: 33 by rejecting maps that do not al
Page 49 and 50: 35 30 0.700 − 0.005 0 50 100 150
Page 51 and 52: 37 Chapter 3 Significance of Optica
Page 53 and 54: 39 using optical mapping data from
Page 55 and 56: 41 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8
Page 57 and 58: 43 3.3.2 Simplifications Direct app
Page 59 and 60: 45 Mean spurious score 0 −10 −2
Page 61 and 62: 47 3.3.3 Simulation Given a generat
Page 63 and 64: 49 3.4 Discussion 3.4.1 Uses Alignm
Page 65 and 66: 51 The ability to simulate from the
Page 67 and 68: 53 maps, where the separation betwe
Page 69 and 70: Figure 3.10 Schematic representatio
Page 71 and 72: 57 especially a short noisy one, to
Page 73 and 74: 59 Test statistics: Variability due
Page 75: 61 in sequence assembly and validat
Page 79 and 80: 65 and rate parameters Λ i = E(N i
Page 81 and 82: 67 with mean µ k for the k th stat
Page 83 and 84: 69 Estimated Copy Number in simulat
Page 85 and 86: 71 Posterior probabilities 1.0 0.8
Page 87 and 88: 73 (a) Observed counts and decoded
Page 89 and 90: 75 Conclusion: Copy number alterati
Page 91 and 92: 77 well in its current form, but th
Page 93 and 94: 79 Change in score 15 10 5 0 0.9 1.
Page 95 and 96: 81 will rarely be homozygous. It ma
Page 97 and 98: 83 E.T. Dimalanta, A. Lim, R. Runnh
Page 99 and 100: 85 Appendix A: Score functions for
Page 101 and 102: 87 Appendix B: Hidden Markov Model
Page 103 and 104: 89 which can be shown to have highe
show all

On the Analysis of Optical Mapping Data - University of Wisconsin ...

Create successful ePaper yourself

Delete template?

Save as template?