27.01.2014 Views

PDF 2

PDF 2

PDF 2

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

RNA secondary structure<br />

prediction methods<br />

• Covariation using mutual information<br />

– requires reliable multiple sequence/structure<br />

alignment of many RNAs<br />

• Stochastic context free grammars (generalization<br />

of HMMs)<br />

• Energy minimization methods<br />

FOCUS of talk on energy minimization, but first<br />

briefly mention other two methods.<br />

Peter Clote<br />

July 2007


Stochastic context free grammars<br />

Context free grammar for<br />

L= {well-balanced parenthesis<br />

expressions}<br />

consists of rules<br />

S λ | (S) | SS<br />

Key notions:<br />

λ is empty word<br />

Terminal symbols: ‘(‘, ‘)’<br />

Variables: S<br />

Peter Clote<br />

July 2007


Derivation<br />

S (S) (SS) ((S)S) (( )S)<br />

(( )(S)) (( )( ))<br />

Leftmost derivation (apply rule to leftmost<br />

occurring variable in each step)<br />

One-one association between leftmost<br />

derivations and parse trees<br />

Peter Clote<br />

July 2007


Context free grammar for RNA<br />

consists of following rules (here,<br />

balanced parentheses correspond to<br />

Watson-Crick or GU base pairs).<br />

S λ | AS | CS | GS | US | ASU |<br />

USA | CSG | GSC | GSU | USG |<br />

SS<br />

Peter Clote<br />

July 2007


• Branching given by rule S SS<br />

• Watson-Crick base pairings given by rules<br />

S ASU | GSC | etc.<br />

• Unpaired bases (e.g. in hairpin loop) given<br />

by rules<br />

S AS | GS | etc.<br />

• Stochastic grammar has probabilities<br />

associated with rule applications<br />

Peter Clote<br />

July 2007


http://www.genetics.wustl.edu/eddy/tRNAscan-SE/<br />

Peter Clote<br />

July 2007


Secondary structure<br />

ACGUACGUACGU<br />

((((....))))<br />

RNA secondary structure is an<br />

outerplanar graph, or equivalently, a<br />

balanced parenthesis expression (only<br />

one kind of parenthesis).<br />

Peter Clote<br />

July 2007


Secondary structure definition<br />

Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Ding and Lawrence, A statistical sampling algorithm for {RNA} secondary structure<br />

prediction, Nucleic Acids Res., 31(24):7280--7301 (2003)<br />

Peter Clote<br />

July 2007


Type H pseudoknot, taken from Dirks, Pierce, “A partition function<br />

algorithm for nucleic acid secondary structure including pseudoknots<br />

J Comput Chem, 24(13):1664-1677, 2003<br />

Peter Clote<br />

July 2007


PSEUDOBASE collection of classified pseudoknots<br />

http://wwwbio.leidenuniv.nl/~batenburg/pkb.html<br />

Peter Clote<br />

July 2007


Example of pseudoknot causing frameshift, taken from<br />

Von Batenburg’s PSEUDOBASE<br />

http://wwwbio.leidenuniv.nl/~batenburg/PKBase/PKB00240.html<br />

Peter Clote<br />

July 2007


Counting number of<br />

secondary structures<br />

Peter Clote<br />

July 2007


Counting number of secondary<br />

structures<br />

• Secondary structures are balanced parenthesis<br />

expressions with dots<br />

• Catalan numbers give the number of balanced<br />

parenthesis expressions (without dots)<br />

Peter Clote<br />

July 2007


Catalan number c(n-1) is the number of<br />

balanced parenthesis expressions of length n.<br />

Initial Goal: Count number of secondary<br />

structures (Motzkin numbers)<br />

Peter Clote<br />

July 2007


Case 1: n+1 not base paired<br />

Case 2: n+1 base paired with intermediate j<br />

Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


General treatment of cases<br />

Case 1: n+1 not base paired<br />

Case 2: n+1 base paired with i<br />

Case 3: n+1 base paired with intermediate j<br />

Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Asymptotics for Motzkin numbers<br />

• Is there a closed formula for the asymptotic<br />

number S(n) of secondary structures, where any<br />

base can pair with any base, and where the<br />

threshold for hairpin loops is 1?<br />

• Two methods of solution:<br />

– Viennot, Vauchaussade de Chaumont, “Enumeration<br />

of RNA secondary structures by complexity”, Lecture<br />

Notes in Biomaths. 57 (1985) 360-365. Uses generating<br />

functions and influenced Nebel’s later work.<br />

– Stein, Waterman: applied Bender’s theorem<br />

Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Bender’s theorem for asymptotics<br />

Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


• Asymptotic (exponential) formula for the number<br />

S(n) of secondary structures is STARTING<br />

POINT for NEUTRAL NETWORKS (Schuster,<br />

Fontana, Stadler, Hofacker, et al.)<br />

Sequence Space {A,C,G,U} n<br />

(Genotype)<br />

Shape Space {‘(‘,’)’,’.’} n<br />

(Phenotype)<br />

Peter Clote<br />

July 2007


• What is the number of secondary structures for a<br />

given RNA nucleotide sequence, where base pairs<br />

are either Watson-Crick or GU wobble pairs, and<br />

hairpin loops have at least 3 (or more generally θ)<br />

unpaired bases?<br />

Peter Clote<br />

July 2007


From combinatorics to<br />

secondary structure energy<br />

minimization algorithms<br />

Peter Clote<br />

July 2007


Key idea<br />

• Absence of pseudoknots in secondary structures<br />

allows determination of optimal secondary<br />

structure for s i ,…,s j to be made independently of<br />

s k ,…,s l whenever i


• Pascal’s triangle is a dynamic programming<br />

computation of binomial coefficients:<br />

C(n,k) = C(n-1,k)+C(n-1,k-1)<br />

1<br />

11<br />

121<br />

1331<br />

14641<br />

Peter Clote<br />

July 2007


Idea of Nussinov-Jacobson: add the contribution<br />

due to each base pair, ignoring stabilizing energies for<br />

stacked base pairs or destabilizing energies of loops<br />

(unpaired) regions.<br />

Peter Clote<br />

July 2007


• M i,j equals the maximum number of base pairs in<br />

s i ,…,s j<br />

• M j,i equals index r such that if r,j base pair then<br />

maximum number of base pairs is realized in<br />

s i ,…,s j<br />

• M j,i used for linear time traceback<br />

Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


Dynamic programming order of filling in matrix:<br />

left to right along principal diagonal, then successively<br />

fill off-diagonals proceeding outwards.<br />

Peter Clote<br />

July 2007


Compute energy matrix M<br />

Cubic time FILL algorithm<br />

Upper triangular part of M contains energies<br />

Lower triangular part of M contains crumbs<br />

for traceback


acktrack(i,j)<br />

Linear time backtracking algorithm<br />

after the energy matrix is filled


Zuker-Stiegler modification of<br />

Nussinov-Jacobson<br />

1) An isolated base pair contributes no energy by itself;<br />

instead consider experimentally measured stacking<br />

energies<br />

5’-XA-3’<br />

3’-YB-5’<br />

2) Consider experimentally measured loop energies for<br />

hairpin,bulge and internal loops (not multiloops).<br />

3) Affine approximation for multiloop energy<br />

a+bI+cU<br />

for I internal base pairs and U unpaired bases in multiloop<br />

having external pair (i,j), i.e. within i+1,…,j-1.<br />

Peter Clote<br />

July 2007


mfold base stacking energies at 37<br />

degrees Celsius<br />

Peter Clote<br />

July 2007


Nearest neighbor model (stacked<br />

base pairs)<br />

See “Thermodynamic parameters for an expanded nearest-neighbor model for<br />

Formation of RNA duplexes with Watson-Crick base pairs”, Xia et al.<br />

Biochemistry 37:14719-14735 (1998)<br />

Peter Clote<br />

July 2007


See “Thermodynamic parameters for an expanded nearest-neighbor model for<br />

Formation of RNA duplexes with Watson-Crick base pairs”, Xia et al.<br />

Biochemistry 37:14719-14735 (1998)<br />

Peter Clote<br />

July 2007


See “Thermodynamic parameters for an expanded nearest-neighbor model for<br />

Formation of RNA duplexes with Watson-Crick base pairs”, Xia et al.<br />

Biochemistry 37:14719-14735 (1998)<br />

Peter Clote<br />

July 2007


Loop energies from Vienna RNA<br />

package v1.4<br />

PUBLIC int hairpin37[31] = {<br />

INF, INF, INF, 570, 560, 560, 540, 590, 560, 640, 650,<br />

660, 670, 678, 686, 694, 701, 707, 713, 719, 725,<br />

730, 735, 740, 744, 749, 753, 757, 761, 765, 769};<br />

PUBLIC int bulge37[31] = {<br />

INF, 380, 280, 320, 360, 400, 440, 459, 470, 480, 490,<br />

500, 510, 519, 527, 534, 541, 548, 554, 560, 565,<br />

571, 576, 580, 585, 589, 594, 598, 602, 605, 609};<br />

PUBLIC int internal_loop37[31] = {<br />

INF, INF, 410, 510, 170, 180, 200, 220, 230, 240, 250,<br />

260, 270, 278, 286, 294, 301, 307, 313, 319, 325,<br />

330, 335, 340, 345, 349, 353, 357, 361, 365, 369};<br />

Peter Clote<br />

July 2007


Classify loops by number of base pairs visible from<br />

Closing pair (i,j).<br />

Hairpin loop: 0-loop<br />

Peter Clote<br />

July 2007


Helix (stacked base pairs),<br />

Bulge, interior loop: 1-loops<br />

Peter Clote<br />

July 2007


Ding and Lawrence, A statistical sampling algorithm for {RNA} secondary structure<br />

prediction, Nucleic Acids Res., 31(24):7280--7301 (2003)<br />

Peter Clote<br />

July 2007


Peter Clote<br />

July 2007


• Waterman introduced the notion of order of<br />

multiloop, which yields an inefficient algorithm<br />

• Nevertheless, this classification is interesting.<br />

Peter Clote<br />

July 2007


Lyngsø and Pederson, RNA pseudoknot prediction in energybased models,<br />

J. Comput. Biol., 7:409-427 (2000)<br />

Peter Clote<br />

July 2007


Suboptimal secondary<br />

structures<br />

Zuker<br />

“On finding all suboptimal foldings of an RNA<br />

molecule”, Zuker, Science 244(4900):48-52<br />

(1989)


Lyngsø and Pederson, RNA pseudoknot prediction in energybased models,<br />

J. Comput. Biol., 7:409-427 (2000)<br />

Peter Clote<br />

July 2007


match. i-match<br />

• “An RNA folding method capable of identifying<br />

pseudoknots and base triples”, by Tabaska, Cary,<br />

Gabow and Stormo Bioinformatics 14(8) 1998.<br />

• BASIC IDEA: Given RNA sequence of length n,<br />

apply maximum weight matching to complete,<br />

weighted (undirected) graph G = (V,E)<br />

– V = {1,2,…,n}<br />

– E = all possible Watson-Crick and GU pair positions<br />

{i,j}<br />

– weight edges by mutual information using reliable<br />

sequence-structure alignment of many orthologous<br />

RNAs.<br />

Peter Clote<br />

July 2007


M<br />

i<br />

= ! , j fxi<br />

x log j 2(<br />

fxi,<br />

x / f j xf<br />

i x )<br />

,<br />

j<br />

x<br />

i<br />

, x<br />

j<br />

Cary-Stormo in ISMB’95 weighted edges<br />

{i,j} by mutual information, and applied Ed<br />

Rothberg’s implementation wmatch of<br />

Gabow’s O(n 3 ) maximum weight matching<br />

algorithm.<br />

Peter Clote<br />

July 2007


i-match algorithm<br />

Pseudoknots can be detected using Cary-Stormo algorithm.<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007


i-match algorithm contd.<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007


Handling base triples with bmatch<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007


Handling base triples, contd.<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007


Handling base triples, contd.<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007


Raw output of i-match on tRNA<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007


tRMA after i-match output<br />

filtration<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007


match<br />

“An RNA folding method capable of identifying pseudoknots<br />

and base triples”, Tabaska et al., Bioinformatics 14 (8) 1998<br />

Peter Clote<br />

July 2007

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!