On the Analysis of Optical Mapping Data - University of Wisconsin ...
On the Analysis of Optical Mapping Data - University of Wisconsin ...
On the Analysis of Optical Mapping Data - University of Wisconsin ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
40<br />
is rejected at level α if S > c α where F 0 (c α |M) = 1 − α. The advantage <strong>of</strong> this formulation<br />
is that given a choice <strong>of</strong> P G , we can in principle simulate from F 0 (·|M) to obtain a suitable<br />
cut<strong>of</strong>f, without requiring any probabilistic model for <strong>the</strong> optical map M. An effective choice<br />
<strong>of</strong> P G is given by random permutations <strong>of</strong> <strong>the</strong> reference ˜G. This preserves characteristics <strong>of</strong><br />
<strong>the</strong> reference that are known to affect <strong>the</strong> spurious score distribution, namely <strong>the</strong> number<br />
and lengths <strong>of</strong> fragments. Permuting <strong>the</strong> order <strong>of</strong> fragments is also reasonable given <strong>the</strong><br />
additive nature <strong>of</strong> score functions, which essentially reward matches in order. Formally, if we<br />
assume that <strong>the</strong> fragment lengths defining G are i.i.d. from some distribution in a family F,<br />
permutation can be viewed as sampling from P G conditional on <strong>the</strong> set <strong>of</strong> fragment lengths in<br />
˜G, which is sufficient for F. Such tests are <strong>of</strong>ten called permutation tests (Cox and Hinkley,<br />
1979, Chapter 6). See Figure 3.1 for a graphical justification <strong>of</strong> <strong>the</strong> i.i.d. assumption.<br />
3.3 Results<br />
3.3.1 Exploration<br />
We use optical map data from GM07535, a diploid normal human lymphoblastoid cell line,<br />
for illustration. The data consists <strong>of</strong> 206796 optical maps longer than 300 Kb. These maps<br />
are aligned against an in silico reference map derived from Build 35 <strong>of</strong> <strong>the</strong> human genome<br />
sequence (International Human Genome Sequencing Consortium, 2004), with sequence gaps<br />
replaced by <strong>the</strong>ir estimated length. We use a score function implemented in <strong>the</strong> SOMA<br />
s<strong>of</strong>tware suite with parameters that have been extensively used with optical map data. The<br />
actual score function, henceforth referred to as <strong>the</strong> SOMA score, is described in Appendix<br />
A. In addition to <strong>the</strong> best alignment scores against <strong>the</strong> in silico reference, we consider best<br />
scores for each map against several independent random permutations <strong>of</strong> <strong>the</strong> reference. The<br />
permutations are done separately for every chromosome, thus retaining <strong>the</strong> total length and<br />
number <strong>of</strong> fragments within each. For <strong>the</strong> most part, we restrict our attention to ungapped<br />
global alignments.<br />
In <strong>the</strong>ory, we can approximate <strong>the</strong> conditional null distribution F 0 (·|M) by sampling from<br />
it an arbitrary number <strong>of</strong> times. In practice, each such sample involves a permutation <strong>of</strong>