Molecular Systematics of Nematodes - Russian Journal of Nematology
Molecular Systematics of Nematodes - Russian Journal of Nematology
Molecular Systematics of Nematodes - Russian Journal of Nematology
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Molecular</strong> <strong>Systematics</strong> <strong>of</strong><br />
<strong>Nematodes</strong><br />
Databases and BLAST<br />
Sergei A. Subbotin<br />
Department <strong>of</strong> <strong>Nematology</strong>, University <strong>of</strong> California, Riverside, USA,<br />
Gent University, Gent, Belgium<br />
Center <strong>of</strong> Parasitology, <strong>Russian</strong> Academy <strong>of</strong> Sciences, Moscow, Russia
DataBase<br />
The one <strong>of</strong> the most important things<br />
in molecular biology: the comparison<br />
<strong>of</strong> data sequenced by yourself with all<br />
known sequences collected in a<br />
certain database. This procedure is<br />
called homology search. Numerous<br />
genetic databases are spread out all<br />
over the world. The probably biggest<br />
nucleic acid databases are:<br />
http://www.embl-heidelberg.de/<br />
http://www.ncbi.nlm.nih.gov/<br />
http://www.nig.ac.jp/
GenBank
GenBank<br />
NCBI Resources NCBI (National Center<br />
for Biotechnology Information) is a<br />
resource for molecular biology<br />
information. NCBI creates and maintains<br />
public databases, conducts research in<br />
computational biology, develops s<strong>of</strong>tware<br />
tools for analyzing genome data, and<br />
disseminates biomedical information.<br />
The NCBI site is constantly being<br />
updated and some <strong>of</strong> the changes<br />
include new databases and tools for data<br />
mining.<br />
NCBI <strong>of</strong>fers several searchable literature,<br />
molecular and genomic databases and<br />
many bioinformatic tools. An up-to-date<br />
list <strong>of</strong> databases and tools can be found<br />
on the NCBI Sitemap.<br />
Location: www.ncbi.nlm.nih.gov
GenBank<br />
from 1982 to the present, the number <strong>of</strong> bases in GenBank has doubled approximately every 18 months
GenBank<br />
NCBI Sitemap
GenBank
Entrez<br />
•Entrez: Entrez is a retrieval system<br />
designed for searching several linked<br />
databases <strong>of</strong> the NCBI for the major<br />
databases, including PubMed, Nucleotide<br />
and Protein Sequences, Protein Structures,<br />
Complete Genomes, Taxonomy, and others.<br />
. Entrez categories can be searched using<br />
subject, author, or unique identifiers such<br />
as accession numbers, phrases, truncated<br />
terms, and combined sets. There is also a<br />
simple Entrez tutorial.
GenBank<br />
PubMed: Allows searching<br />
by author names, journal<br />
titles, and a new<br />
Preview/Index option.<br />
PubMed database provides<br />
access to over 12 million<br />
MEDLINE citations back to<br />
the mid-1960's. It includes<br />
History and Clipboard<br />
options which may<br />
enhance your search<br />
session. NCBI provides a<br />
simple PubMed tutorial.
PubMed<br />
Search : Perry RN
GenBank<br />
Nucleotide Database: The<br />
nucleotide database contains<br />
sequence data from GenBank,<br />
EMBL, and DDBJ, the members<br />
<strong>of</strong> the tripartite, international<br />
collaboration <strong>of</strong> sequence<br />
databases. Nucleotide allows the<br />
user to retrieve nucleotide<br />
sequences in both GenBank and<br />
FASTA formats.<br />
The Entrez Nucleotide database<br />
is a collection <strong>of</strong> sequences from<br />
several sources, including<br />
GenBank, RefSeq, and PDB. The<br />
number <strong>of</strong> bases in these<br />
databases continues to grow at an<br />
exponential rate. As <strong>of</strong> April 2006,<br />
there are over 130 billion bases in<br />
GenBank and RefSeq alone.
GenBank<br />
Taxonomy Database: The<br />
taxonomy database contains<br />
the names <strong>of</strong> all organisms<br />
that are represented in the<br />
genetic databases with at<br />
least one nucleotide or<br />
protein sequence. You can<br />
search for nucleotide, protein,<br />
and structure data from<br />
specific taxonomic groupings,<br />
from the domain level<br />
(archaea,<br />
bacteria,<br />
eukaryota) down to the<br />
species level.
GenBank<br />
Taxonomy Browser is…<br />
browser for the major divisions <strong>of</strong> living organisms<br />
(archaea, bacteria, eukaryota, viruses)<br />
• taxonomy information such as genetic codes<br />
• molecular data on extinct organisms<br />
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
Taxonomy Browser
Blast (Sequence Similarity Search)<br />
BLAST: BLAST (Basic<br />
Local Alignment Search<br />
Tool) is a set <strong>of</strong> similarity<br />
search programs<br />
designed to explore all <strong>of</strong><br />
the available sequence<br />
databases regardless <strong>of</strong><br />
whether the query is<br />
protein or DNA. For a<br />
better understanding <strong>of</strong><br />
BLAST you can refer to<br />
the BLAST Course which<br />
explains the basics <strong>of</strong> the<br />
BLAST algorithm, or to<br />
the NCBI BLAST tutorial.<br />
http://www.ncbi.nlm.nih.gov/BLAST/
Blast (Sequence Similarity Search)<br />
Sequence alignments provide a<br />
powerful way to compare novel<br />
sequences with previously<br />
characterized genes. Both<br />
functional and evolutionary<br />
information can be inferred from<br />
well designed queries and<br />
alignments. BLAST 2.0, (Basic<br />
Local Alignment Search Tool),<br />
provides a method for rapid<br />
searching <strong>of</strong> nucleotide and<br />
protein databases.<br />
BLAST - the most popular datamining<br />
tool ever!<br />
For non-coding DNA - use blastn.<br />
Never forget that blastn is only for<br />
closely related DNA sequences<br />
(more than 70 percent identical).
Blast (Sequence Similarity Search)<br />
1. Point your browser to the NCBI BLAST<br />
server<br />
at:<br />
http://www.ncbi.nlm.nih.gov/BLAST<br />
2. Under the Nucleotide heading, click the<br />
Nucleotide-Nucleotide (blastn) link<br />
3. Paste your sequence in the search<br />
window<br />
4. Click Blast! button<br />
5. Click Format button (and wait)<br />
An overview <strong>of</strong> the BLAST output<br />
1. A graphic display: Shows you where<br />
your query is similar to other sequences<br />
2. A hit list: The name <strong>of</strong> sequences<br />
similar to your query, ranked by<br />
similarity<br />
3. The alignments: Every alignment<br />
between your query and the reported<br />
hits<br />
4. The parameters: A list <strong>of</strong> the varios<br />
parameters used for the search
Blast (Sequence Similarity Search)<br />
The graphic display<br />
• Your query sequence in on the top<br />
• Each bar represents the portion <strong>of</strong> another<br />
sequence similar to your query sequence<br />
• Red bars indicate the most similar sequences,<br />
pink bars indicate matches that are a bit less<br />
good, and green bars indicate matches that<br />
are not impressive at all. Blue and black hits<br />
(not here) are bad hits<br />
The hit list<br />
Each line contains four important features:<br />
• The sequence accession number and the<br />
names: this hyperlink takes you to the<br />
database entry that contains this sequence<br />
• Description<br />
• The bit score: a measure <strong>of</strong> the statistical<br />
significance <strong>of</strong> the alignment. The higher the<br />
bit score, the more similar the two sequences.<br />
Matches below 50 bits are very unreliable<br />
• The E-value (the expectation value): by<br />
estimating the number <strong>of</strong> times you could<br />
have expected such a good much only by<br />
chance. The lower the E-value, the more<br />
similar the sequences. If the E-value is less<br />
than 1 X 10 -50 ,the hit is very similar to the<br />
query sequence and is very likely to be<br />
evolutionarily related.
GenBank<br />
1. Locus gives us the locus name<br />
2. Definition provides a short definition <strong>of</strong> the<br />
gene<br />
3. Accession lists the accession number, a unique<br />
identifier within and across various databases.<br />
4. Source divulges the common name <strong>of</strong> the<br />
relevant organism to which the sequence<br />
belongs<br />
5. Organism gives a more complete identification<br />
<strong>of</strong> the organism, complete with its technical<br />
(!!!) taxonomic classification<br />
6. Reference introduces a section ehere the<br />
credits for the sequence determination are<br />
given<br />
7. Features describe precisely the gene regions<br />
and the associated biological properties<br />
Select FASTA in Display window!
GenBank<br />
Select sequence, copy and paste<br />
in a new text file. Create a file<br />
with several sequences including<br />
an outgroup taxa sequence.
TreeBase<br />
TreeBASE is a relational database designed to manage and explore information on<br />
phylogenetic relationships. Its main function is to store published phylogenetic trees and<br />
data matrices. It also includes bibliographic information on phylogenetic studies, and some<br />
details on taxa, characters, algorithms used, and analyses performed. The database is<br />
designed to allow retrieval and recombination <strong>of</strong> trees and data from different studies,<br />
and it can be explored interactively using trees included in the database. TreeBASE<br />
therefore provides a means <strong>of</strong> assessing and synthesizing phylogenetic knowledge<br />
http://www.treebase.org/treebase/
Useful databases for nematologists<br />
WormBase (http://www.wormbase.org) is the central data<br />
repository for information about Caenorhabditis elegans and related<br />
nematodes. As a model organism database, WormBase extends<br />
beyond the genomic sequence, integrating experimental results<br />
with extensively annotated view <strong>of</strong> genome. WormBase also<br />
provides large array <strong>of</strong> research and analysis tools.<br />
NemaGene (http://www.nematode.net) is a web-accessible<br />
resource for investigating gene sequences from nematode<br />
genomes. The database is an outgrowth <strong>of</strong> the parasitic nematode<br />
EST project. ESTs (Expressed Sequence Tag) are usually shorter<br />
than the full-length mRNAs from which they are derived and are<br />
prone to sequencing errors. The database provides EST cluster<br />
consensus sequence, enhanced online BLAST search tools and<br />
functional classification <strong>of</strong> cluster sequences.
Useful databases for nematologists<br />
NEMBASE (http://www.nematodes.org) is a database providing<br />
access to the sequence and associated meta-data currently being<br />
generated as part <strong>of</strong> the parasitic nematode EST project. Users<br />
may query the database on the basis <strong>of</strong> BLAST annotation,<br />
sequence similarity or expression pr<strong>of</strong>iles. NEMBASE also features<br />
an interactive which allows the simultaneous display and analysis <strong>of</strong><br />
the relative similarity relationships <strong>of</strong> groups <strong>of</strong> sequences to others<br />
databases.<br />
NemAToL (http://nematol.unh.edu) is an open database<br />
dedicated to collecting, archiving and organizing video images<br />
other morphological information, DNA sequences, alignments, and<br />
other reference materials for study <strong>of</strong> the phylogeny and diversity,<br />
and taxonomy, systematics, and ecology <strong>of</strong> nematodes.