21.01.2013 Views

note - FIZ Karlsruhe

note - FIZ Karlsruhe

note - FIZ Karlsruhe

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Introduction to similarity searching<br />

Expectation value (-E)<br />

Expectation value (E-Value) is the statistical significance threshold for reporting matches against a<br />

sequence database. The E-value can be any positive number, and the default value is 10. This<br />

means that 10 matches may be expected to be found merely by chance. In general, the E-value is<br />

lowered to make the search more precise and raised to retrieve more answers.<br />

Peptide similarity matrices (-M)<br />

For peptide-based searches SQP and TSQN, the advanced options provide additional scoring<br />

matrices to the default BLOSUM-62. Guidelines from the NCBI, regarding the use of scoring<br />

matrices for peptide queries of various lengths, are shown in the table below.<br />

Query Length Matrix Gap costs<br />

85 BLOSUM-62 (11,1) (BLAST default)<br />

Decide how many answers to keep<br />

After the BLAST or GETSIM search is completed a candidate answer set consisting of sequences<br />

exceeding a certain similarity threshold 1 is collated. A diagram is generated that shows the scores<br />

of the retrieved sequences. The x-axis represents the number of sequence record answers with a<br />

specific similarity score, and the y-axis is scaled by the corresponding similarity scores. Also<br />

provided above the diagram are the Query Self Score – which is the similarity score of the query<br />

mapped against itself – and the score of the best answer retrieved in the results set.<br />

After assessing the display of the diagram, the whole candidate answer set, a subset of the answer<br />

set, or a subset represented by a minimum percentage of the Query Self Score may be kept. If<br />

only a subset is required, this is specified by answering "How many answers would you like to<br />

keep?" with the either the number of desired records, or the appropriate minimum percentage of<br />

the Query Self Score, followed by the percent sign (%). The candidate answer set may contain up<br />

to 10,000 best scoring candidate sequences.<br />

Sort answers by similarity score<br />

The answer set L-number generated after the search (above), contains sequence records sorted by<br />

descending database entry date. BLAST and GETSIM sequence records may resorted into<br />

descending score order. Just type SOR SCORE D at the arrow prompt. Alternatively, BLAST<br />

search results may be resorted into BLAST identity percentage order, using SOR IDENT D.<br />

Example<br />

=> SOR SCORE D<br />

PROCESSING COMPLETED FOR L2<br />

L3 153 SOR L2 SCORE D<br />

1 For BLAST searches the threshold is defined by the “expectation value”. For GETSIM searches the threshold is an<br />

automatically adjusted function of the query length, query type and database size.<br />

Page 16 | GENESEQ on STN (DGENE) Workshop Manual

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!