29.07.2014 Views

On the Analysis of Optical Mapping Data - University of Wisconsin ...

On the Analysis of Optical Mapping Data - University of Wisconsin ...

On the Analysis of Optical Mapping Data - University of Wisconsin ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Q(θ|θ t ) = ∑ Π<br />

88<br />

Baum-Welch updates: The parameter estimation step <strong>of</strong> <strong>the</strong> HMM needs some calculations<br />

specific to <strong>the</strong> negative binomial model. To state <strong>the</strong> results, we first need some<br />

notation, which roughly follows Durbin et al. (1998). The data is assumed to be a sequence<br />

(x i ) L i=1<br />

<strong>of</strong> observed hits in L successive intervals along <strong>the</strong> genome, with <strong>the</strong> corresponding<br />

sequence <strong>of</strong> random variables being denoted by (X i ) L i=1<br />

. In practice, <strong>the</strong> data will be in<br />

<strong>the</strong> form <strong>of</strong> several sequences (e.g. one for every chromosome). The derivations done below<br />

generalize trivially to this situation, provided we assume that all <strong>the</strong> sequences are generated<br />

by <strong>the</strong> same model. Each observation X i has an associated hidden state Π i . We will <strong>of</strong>ten<br />

abbreviate (x i ) L i=1 by x, (X i) L i=1 by X and (Π i) L i=1 by Π. The unobserved sequence (Π i) L i=1 is<br />

a time-homogeneous stationary Markov process with a finite state space S = {1, 2, . . ., K}.<br />

The distribution <strong>of</strong> X i is entirely determined by Π i . This distribution is defined by <strong>the</strong><br />

emission probabilities<br />

e k (b) = P (X i = b|Π i = k)<br />

(B.1)<br />

The evolution <strong>of</strong> <strong>the</strong> process (Π i ) L i=1<br />

is governed by <strong>the</strong> transition probability matrix P,<br />

which has entries<br />

a k,l = P (Π i+1 = l|Π i = k)<br />

(B.2)<br />

and stationary distribution π 0 . The parameters in <strong>the</strong> model consist <strong>of</strong> <strong>the</strong> transition probabilities<br />

a = ((a k,l )) along with any parameters involved in <strong>the</strong> emission distribution, which<br />

we denote by η. Collectively, <strong>the</strong> parameters are denoted by θ = (a, η). Estimation <strong>of</strong> <strong>the</strong><br />

parameters, <strong>of</strong>ten referred to as ‘training’, can be accomplished by using <strong>the</strong> Baum-Welch<br />

algorithm, which can also be stated in terms <strong>of</strong> <strong>the</strong> more familiar EM algorithm. It is an<br />

iterative procedure generally described in terms <strong>of</strong> observed data, missing data and parameters.<br />

In our case, <strong>the</strong> observed data are x and <strong>the</strong> missing data are <strong>the</strong> hidden states Π.<br />

Given a current estimate for θ, say θ t , <strong>the</strong> E-step involves computing <strong>the</strong> function<br />

P ( Π|X = x, θ t) log P (X = x,Π|θ)<br />

(B.3)<br />

and <strong>the</strong> M-step involves obtaining <strong>the</strong> next iterate<br />

θ t+1 = arg max Q(θ|θ t )<br />

θ<br />

(B.4)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!