On the Analysis of Optical Mapping Data - University of Wisconsin ...
On the Analysis of Optical Mapping Data - University of Wisconsin ...
On the Analysis of Optical Mapping Data - University of Wisconsin ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Q(θ|θ t ) = ∑ Π<br />
88<br />
Baum-Welch updates: The parameter estimation step <strong>of</strong> <strong>the</strong> HMM needs some calculations<br />
specific to <strong>the</strong> negative binomial model. To state <strong>the</strong> results, we first need some<br />
notation, which roughly follows Durbin et al. (1998). The data is assumed to be a sequence<br />
(x i ) L i=1<br />
<strong>of</strong> observed hits in L successive intervals along <strong>the</strong> genome, with <strong>the</strong> corresponding<br />
sequence <strong>of</strong> random variables being denoted by (X i ) L i=1<br />
. In practice, <strong>the</strong> data will be in<br />
<strong>the</strong> form <strong>of</strong> several sequences (e.g. one for every chromosome). The derivations done below<br />
generalize trivially to this situation, provided we assume that all <strong>the</strong> sequences are generated<br />
by <strong>the</strong> same model. Each observation X i has an associated hidden state Π i . We will <strong>of</strong>ten<br />
abbreviate (x i ) L i=1 by x, (X i) L i=1 by X and (Π i) L i=1 by Π. The unobserved sequence (Π i) L i=1 is<br />
a time-homogeneous stationary Markov process with a finite state space S = {1, 2, . . ., K}.<br />
The distribution <strong>of</strong> X i is entirely determined by Π i . This distribution is defined by <strong>the</strong><br />
emission probabilities<br />
e k (b) = P (X i = b|Π i = k)<br />
(B.1)<br />
The evolution <strong>of</strong> <strong>the</strong> process (Π i ) L i=1<br />
is governed by <strong>the</strong> transition probability matrix P,<br />
which has entries<br />
a k,l = P (Π i+1 = l|Π i = k)<br />
(B.2)<br />
and stationary distribution π 0 . The parameters in <strong>the</strong> model consist <strong>of</strong> <strong>the</strong> transition probabilities<br />
a = ((a k,l )) along with any parameters involved in <strong>the</strong> emission distribution, which<br />
we denote by η. Collectively, <strong>the</strong> parameters are denoted by θ = (a, η). Estimation <strong>of</strong> <strong>the</strong><br />
parameters, <strong>of</strong>ten referred to as ‘training’, can be accomplished by using <strong>the</strong> Baum-Welch<br />
algorithm, which can also be stated in terms <strong>of</strong> <strong>the</strong> more familiar EM algorithm. It is an<br />
iterative procedure generally described in terms <strong>of</strong> observed data, missing data and parameters.<br />
In our case, <strong>the</strong> observed data are x and <strong>the</strong> missing data are <strong>the</strong> hidden states Π.<br />
Given a current estimate for θ, say θ t , <strong>the</strong> E-step involves computing <strong>the</strong> function<br />
P ( Π|X = x, θ t) log P (X = x,Π|θ)<br />
(B.3)<br />
and <strong>the</strong> M-step involves obtaining <strong>the</strong> next iterate<br />
θ t+1 = arg max Q(θ|θ t )<br />
θ<br />
(B.4)