29.07.2014 Views

On the Analysis of Optical Mapping Data - University of Wisconsin ...

On the Analysis of Optical Mapping Data - University of Wisconsin ...

On the Analysis of Optical Mapping Data - University of Wisconsin ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

53<br />

maps, where <strong>the</strong> separation between spurious and real scores is much more clear. Comparison<br />

with Figure 3.8 reveals an interesting point, namely that for a fair proportion <strong>of</strong> real optical<br />

maps with high information content, <strong>the</strong> optimal score with <strong>the</strong> real reference is more likely<br />

to have risen from <strong>the</strong> spurious score distribution than <strong>the</strong> real one. This could be due to<br />

<strong>the</strong> maps being <strong>of</strong> low quality, but could also be a reflection <strong>of</strong> real differences between <strong>the</strong><br />

reference map and <strong>the</strong> actual genome. Maps <strong>of</strong> <strong>the</strong> latter type are <strong>of</strong> particular interest as<br />

<strong>the</strong>y contain possibly novel information about <strong>the</strong> underlying genome. This fact can be used<br />

to develop a filter to obtain a smaller subset <strong>of</strong> maps relatively richer in “interesting” maps.<br />

<strong>On</strong>e possible approach to calibrate such a filter is described below. The usefulness <strong>of</strong> such<br />

filters is yet to be explored.<br />

Thinning: Even if all declared alignments were correct, <strong>the</strong> set <strong>of</strong> inferred locations would<br />

only be a subset <strong>of</strong> <strong>the</strong> full set <strong>of</strong> true shotgun locations because not all maps are successfully<br />

aligned. The probability that a map will be successfully aligned depends on <strong>the</strong> origin <strong>of</strong> <strong>the</strong><br />

map, its length and <strong>the</strong> errors involved (Figure 3.10). Averaging out <strong>the</strong> length and error<br />

distributions, this probability can be expressed as a location specific truncation probability.<br />

This random truncation can be thought <strong>of</strong> as a thinning <strong>of</strong> <strong>the</strong> true coverage process, which is<br />

usually modeled as a homogeneous Poisson process (Lander and Waterman, 1988). A good<br />

estimate <strong>of</strong> <strong>the</strong> thinning rate is necessary to normalize observed coverage, which can, for<br />

example, be used to study copy number alterations in <strong>the</strong> underlying genome (Chapter 4).<br />

This estimation has traditionally been done by Monte Carlo simulation <strong>of</strong> noisy maps from<br />

a normal reference map, followed by alignment, thus replicating <strong>the</strong> pipeline actual optical<br />

maps go through. The most time consuming step in this process is alignment. In view <strong>of</strong> <strong>the</strong><br />

discussion above, we may expect to be able to model <strong>the</strong> probability <strong>of</strong> a map being aligned<br />

as a function <strong>of</strong> ψ(M). In particular, for <strong>the</strong> SOMA score and ungapped global alignment,<br />

we fit <strong>the</strong> following logistic regression model to <strong>the</strong> alignments <strong>of</strong> simulated maps used in<br />

Figure 3.9:<br />

P ( aligned | M ) =<br />

eα+β<br />

log ψ(M)<br />

1 + e α+β log ψ(M) (3.2)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!