11.07.2015 Views

Bioinformatics for DNA Sequence Analysis.pdf - Index of

Bioinformatics for DNA Sequence Analysis.pdf - Index of

Bioinformatics for DNA Sequence Analysis.pdf - Index of

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Similarity Searching Using BLAST 3Fig. 1.1. Growth <strong>of</strong> GenBank and DDBJ genetic databases over the past 10 years. TheINSDC databases have grown, over the past 10 years, approximately 168-fold in totalnumber <strong>of</strong> base pairs. While in the past the number <strong>of</strong> entries in INSDC databasesdoubled approximately every 2 years, a simple second-order polynomial regression(R 2 ¼0.9995) <strong>of</strong> the data over the past 10 years indicates that the next redoubling willtake a little over 4 years. This graph does not include HTG data.2. Program Usage2.1. Database FileFormatsOne <strong>of</strong> the largest sources <strong>of</strong> diversity among <strong>DNA</strong> databases liesin their file <strong>for</strong>mats. While great ef<strong>for</strong>ts have been made to standardizefile <strong>for</strong>mats, the various types and purposes <strong>of</strong> sequencein<strong>for</strong>mation and annotation entreat customized file types.2.1.1. FASTA Format First used with Pearson and Lipman’s FASTA program <strong>for</strong>sequence comparison (5), the FASTA file <strong>for</strong>mat is the simplest<strong>of</strong> the widely used <strong>for</strong>mats available through the INSDC. It iscomposed <strong>of</strong> a definition or description line followed by thesequence. The definition line begins with a greater-than symbol(>) and marks the beginning <strong>of</strong> each new entry. The in<strong>for</strong>mationfollowing the greater-than symbol varies according to its source.Generally, an identifier follows (Table 1.1), after which optionaldescription words may be included. If the sequence is retrievedthrough NCBI’s databases, a GI number precedes the identifier.Though it is recommended that the definition line be no greaterthan 80 characters, various types and levels <strong>of</strong> in<strong>for</strong>mation are<strong>of</strong>ten included. The definition line is followed by the <strong>DNA</strong>sequence itself, in single or multi-line <strong>for</strong>mat. Nucleotides arerepresented by their standard IUB/IUPAC codes, includingambiguity codes (Table 1.2).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!