29.07.2014 Views

On the Analysis of Optical Mapping Data - University of Wisconsin ...

On the Analysis of Optical Mapping Data - University of Wisconsin ...

On the Analysis of Optical Mapping Data - University of Wisconsin ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

57<br />

especially a short noisy one, to originate from somewhere in <strong>the</strong> reference but have its<br />

optimal alignment somewhere else. The null hypo<strong>the</strong>sis <strong>of</strong> independence is not true in such<br />

a case, yet we would not want to declare <strong>the</strong> optimal alignment significant. Thus, it may be<br />

reasonable to define <strong>the</strong> best spurious score <strong>of</strong> M against ˜G as <strong>the</strong> maximum score among<br />

alignments that are not <strong>the</strong> true alignment. This is <strong>of</strong> course not observable, since we<br />

have no way <strong>of</strong> knowing <strong>the</strong> true alignment, or even whe<strong>the</strong>r it exists at all. There are<br />

o<strong>the</strong>r problems with this definition; e.g. what makes an alignment sufficiently different from<br />

<strong>the</strong> true alignment? Should alignments to incorrect but homologous regions be considered<br />

spurious? By formulating <strong>the</strong> problem as a test <strong>of</strong> independence, <strong>the</strong>se issues are avoided.<br />

O<strong>the</strong>r methods: Valouev et al. (2006) suggest an approach to determine significance that<br />

is similar to ours in principle, but is completely model-based. They postulate that <strong>the</strong><br />

fragment lengths in <strong>the</strong> reference genome ˜G are i.i.d. exponential variates, and describe a<br />

conditional model for optical maps given <strong>the</strong> reference. These are <strong>the</strong>n used to formally derive<br />

<strong>the</strong> marginal distribution <strong>of</strong> optical maps, which reduces to an i.i.d. exponential distribution<br />

for <strong>the</strong> optical map fragment lengths, but with a different rate. Cut<strong>of</strong>fs are obtained by<br />

simulating both reference and optical maps under <strong>the</strong> null hypo<strong>the</strong>sis <strong>of</strong> independence. This<br />

is a perfectly valid approach, but may be sensitive to parameter estimates as well as model<br />

misspecification, which is a legitimate concern since <strong>the</strong>ir conditional model excludes certain<br />

known sources <strong>of</strong> noise, namely desorption and scaling (see Chapter 2). Our conditional<br />

non-parametric approach bypasses <strong>the</strong>se concerns.<br />

Direct approach vs regression: Estimating <strong>the</strong> mean spurious score µ(M) separately<br />

for each map is usually feasible and more powerful than regression. However, for alignments<br />

involving only part <strong>of</strong> an optical map, a cut<strong>of</strong>f based on <strong>the</strong> full map is not appropriate. This<br />

is a concern particularly for overlap matches, where alignments overhanging at <strong>the</strong> boundary<br />

<strong>of</strong> <strong>the</strong> reference map are allowed. The regression approach can still be used in such cases by<br />

considering only <strong>the</strong> aligned portion <strong>of</strong> <strong>the</strong> map. The regression on N and L as used above<br />

is <strong>of</strong> course not <strong>the</strong> only possible model, but Table 3.1 suggests that it explains most <strong>of</strong> <strong>the</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!