On the Analysis of Optical Mapping Data - University of Wisconsin ...

More documents

Recommendations

Info

12 restriction map can be derived in silico by identifying the enzyme recognition pattern in the reference sequence, and the primary goal of optical mapping is to determine how the genome under study differs from the reference copy in terms of their respective restriction maps. Such differences can be due to errors in the sequence, especially in the early stages of sequencing, but more importantly, they can reflect real biological variation. In either case, these broad goals are often tackled by breaking them down into smaller, more tractable problems. Algorithmic challenges: Optical mapping has been very successful in obtaining restriction maps of relatively small genomes (e.g. microbes). A critical component of this success has been algorithmic research in the 1990’s specifically aimed at optical mapping data, notably the work of Anantharaman et al. (1999) leading to the Gentig assembly software. With recent technological advances, the focus has shifted to larger genomes. The primary challenge introduced by this shift is scalability. Computational methods that work well for microbial genomes may fail for large genomes due to memory and speed limits of existing computational systems. Since mammalian genomes differ in size from microbial genomes by several orders of magnitude, the relative coverage may be far less. Careful statistical analysis is thus critical in making full use of the available data. New methods are also required to take advantage of in silico maps when they are available. It should be noted that restriction maps have many fundamental similarities with sequence data, and algorithms developed for sequence analysis can often be adapted to work with optical maps (e.g. Huang and Waterman, 1992). Validation: Due to the nature of optical mapping data, it is rarely possible to know the true answer except in very special circumstances. It is therefore natural to use simulation to validate algorithmic techniques. While this has been implicitly acknowledged in much of the algorithmic work on optical mapping, we think that the stochastic model used in simulation itself deserves closer attention. With the large data sets that are now available, we can also hope to use the data to validate models, at least in some limited ways. In particular, we have found graphical diagnostics to be particularly useful in model checking (see Section 2.3),
13 which is not surprising since well designed graphs can usually convey complex information more effectively than numerical summaries. 1.3.4 Algorithms Problems in optical mapping are often approached indirectly by trying to answer simpler, more specific ones. This is not uncommon in computational biology, where the complexity of a problem may make a holistic solution difficult. Two algorithmic questions that play a recurrent role in many of these approaches are alignment and assembly. Each tries to answer a particular problem; however, it is often more useful to think of these as tools rather than solutions. Here, we give an overview of these two fundamental computational tasks. Alignment The problem of alignment is to detect association or overlap between two or more restriction maps. Such association is measured by a score function which assigns a numerical measure of goodness to any potential alignment. Of course different score functions may be used and much rests on choosing a suitable score function. Waterman et al. (1984) presented a score function for restriction map comparison, which was subsequently extended by Huang and Waterman (1992). Valouev et al. (2006) have developed scores functions for the comparison problem specifically in the context of optical mapping. These score functions have been derived as model-based likelihood ratio test statistics, although this is not strictly necessary (Appendix A). Given a suitable score function, dynamic programming is used to efficiently search for optimal alignments. In the context of alignment against a reference, for example, every individual optical map must be scored across the genome. Alignment algorithms for nucleotide sequence data, such as the Needleman-Wunsch and Smith-Waterman algorithms, can be adapted to work with restriction maps. Certain modifications are required to enable such use; these are described by Valouev et al. (2006).
Page 1 and 2: ON THE ANALYSIS OF OPTICAL MAPPING
Page 3 and 4: To my parents. i
Page 5 and 6: DISCARD THIS PAGE
Page 7 and 8: iv Page 3.3 Results . . . . . . . .
Page 9 and 10: v LIST OF TABLES Table Page 1.1 Sum
Page 11 and 12: vi LIST OF FIGURES Figure Page 1.1
Page 13 and 14: ON THE ANALYSIS OF OPTICAL MAPPING
Page 15 and 16: 1 Chapter 1 Overview of Optical Map
Page 17 and 18: 3 hard, do not always have a unique
Page 19 and 20: 5 for microbial and other small gen
Page 21 and 22: 7 Figure 1.2 Close-up of a typical
Page 23 and 24: 9 0.96 0.98 1.00 1.02 1.04 Offset a
Page 25: 11 direct glimpse at the underlying
Page 29 and 30: 15 Gapped alignments: The above des
Page 31 and 32: Figure 1.5 A visualization of align
Page 33 and 34: 19 Assembly: For these examples, th
Page 35 and 36: 21 Chapter 2 Modeling Optical Map D
Page 37 and 38: 23 Alternatively, it can be thought
Page 39 and 40: 25 and V (X i ) = E(V (Y i R i |R i
Page 41 and 42: 27 affect inference. If necessary,
Page 43 and 44: 29 Quantiles of fragment lengths (K
Page 45 and 46: 31 as a function of the parameters.
Page 47 and 48: 33 by rejecting maps that do not al
Page 49 and 50: 35 30 0.700 − 0.005 0 50 100 150
Page 51 and 52: 37 Chapter 3 Significance of Optica
Page 53 and 54: 39 using optical mapping data from
Page 55 and 56: 41 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8
Page 57 and 58: 43 3.3.2 Simplifications Direct app
Page 59 and 60: 45 Mean spurious score 0 −10 −2
Page 61 and 62: 47 3.3.3 Simulation Given a generat
Page 63 and 64: 49 3.4 Discussion 3.4.1 Uses Alignm
Page 65 and 66: 51 The ability to simulate from the
Page 67 and 68: 53 maps, where the separation betwe
Page 69 and 70: Figure 3.10 Schematic representatio
Page 71 and 72: 57 especially a short noisy one, to
Page 73 and 74: 59 Test statistics: Variability due
Page 75 and 76: 61 in sequence assembly and validat
Page 77 and 78:
63 1988). However, due to sampling
Page 79 and 80:
65 and rate parameters Λ i = E(N i
Page 81 and 82:
67 with mean µ k for the k th stat
Page 83 and 84:
69 Estimated Copy Number in simulat
Page 85 and 86:
71 Posterior probabilities 1.0 0.8
Page 87 and 88:
73 (a) Observed counts and decoded
Page 89 and 90:
75 Conclusion: Copy number alterati
Page 91 and 92:
77 well in its current form, but th
Page 93 and 94:
79 Change in score 15 10 5 0 0.9 1.
Page 95 and 96:
81 will rarely be homozygous. It ma
Page 97 and 98:
83 E.T. Dimalanta, A. Lim, R. Runnh
Page 99 and 100:
85 Appendix A: Score functions for
Page 101 and 102:
87 Appendix B: Hidden Markov Model
Page 103 and 104:
89 which can be shown to have highe
show all

On the Analysis of Optical Mapping Data - University of Wisconsin ...

Create successful ePaper yourself

Delete template?

Save as template?