DATABASE SEARCHING
1. Gene and nucleotide sequence information
Finding biological information from public
databases can be a confusing problem due to:
1. Choice of multiple databases
(Genbank, EMBL, Unigene, individual species databases).
Problem 1: Find the following information
about the human and mouse c-erbA genes:
a. Synonyms of the gene name.
b. The chromosome locations of the genes on the genetic
and physical chromosome maps.
c. The number and sequences of alternative transcripts.
d. The numbers of the first and last three codons of the
coding sequences of the above.
e. Diseases associated with variant alleles of the
human gene.
f. The percentage homology between the human and mouse
proteins.
g. The relative degrees of expression among
different tissues.
in
i) Genbank, using Entrez
ii) Unigene
iii) Ensembl
iii) DDBJ
iv) The human and mouse Genome Browsers at UCSC
v)
Mouse Genome Informatics
at the Jackson Laboratory
vi) Online Inheritance in Man (OMIM).
2. Degeneracy of information
(multiple entries of identical information)
Information for genes may be available as:
i) Genomic sequence - whole, exons, promoter etc.
ii) mRNA or cDNA sequence - intron sequence removed!
iii) ESTs - Expressed Sequence Tags
- 5' or 3' fragments of cDNAs - may include vector sequence, mistakes
iv) STSs - Sequence Tagged
Sites - short unique sequences within a gene, thereby characterizing it.
Problem 2: what types of sequence information
are available in GenBank for the human Hoxa-13 gene?
3. Degeneracy of Nomenclature
- some proteins have multiple functions and classifications
Problem 3: What are alternate names for the Thyroid
hormone receptor alpha and Foxa1 genes?
4. Alternate spellings
for name of gene.
Problem 4: using Hoxa13 as the query term in Entrez, do you want to change your answer to Problem 2 above?
Interlinkage of databases and further linkage to literature retrieval services now makes information retrieval more facile
Example: NCBI, accessed via Entrez
Problem 5: What is the chromosome map location
of the human THRA gene?
5. Multiple
accession or id numbers for the same gene in different databases.
Identification of alternative accession or id numbers now possible using
various tools, e.g.:
Indirect ploys:
If information appears to be unavailable for a gene in a species,
the following approaches may be successful:
1. Check Unigene with information
from another species where the gene has been identified.
2. Homology searching with gene from another species using e.g. BLAST, FASTA - very helpful if sequence is recorded under different names.
3. Verifying attribution from BLAST result by linking to Medline
4. If chromosomal location of nucleotide sequence is known, verifying by querying appropriate database,
e.g. OMIM or related resources, or species -specific database.
5. Repetition of any or all of above steps.
Problem 6: Identify the name usually used for the Pancreastatin gene in mouse .
Problem 7: There is nucleotide sequence available for the mouse homeobox D11 gene. Is there sequence published for the chick?
(Searching Entrez Nucleotide for chick
hox or homeobox D11 gives no result)
InstaSeq:
Google-like tool for sequence retrieval.
Using fragments of a DNA, RNA or protein query retrieves
sequences from many databases on the web and returns results in a Google-like
format.
paper2sequences: retrieval
of a collection of sequences, e.g. as listed in a publication.
Accession code specified;
automatic lookup in multiple databases.
LITERATURE SEARCHING TOOLS
eJOURNALS
University
of Cincinnati eJournals
LinkOut
Journals at NCBI
CONTEXTUAL TEXT-BASED SEARCHING METHODS:
The vast amount of available
information has led to the recent development of tools to explore relationships
between genes, proteins, etc. in databases.
The types of relationships may
include:
- similar
function
- involvement
in the same cellular pathway
- inclusion
in the same multicomponent complex
- involvement
in a disease
- inclusion in multiple literature citations.
XplorMed: a web server for exploring scientific
literature. Identifies references containing specified words or word combinations
and their context with respect to other words contained in those references.
Exercise 1:
Take the Learning by example
tutorial to explore the invlvement of heparin in Alzheimer's disease at
the Xplormed Webserver.
MedMiner: The MedMiner filters will extract
and organize relevant sentences in the literature based on a gene, gene-gene
or gene-drug query. This tool combines the GeneCards
and PubMed
search engines with user input and automated server-side scripts in an integrated text filtering
system.
Exercise
2: Take the Tutorial at the Medminer Webserver to expore
the properties and relationships of genes involved in apoptosis.
TXTGate: TXTGate
is a literature index database and is part of an experimental platform to
evaluate (combinations of) information extraction and indexing from a variety
of biological annotation databases. It is designed towards the summarization
and analysis of groups of genes based on text.
Exercise
3: Take the TXTGate tutorial
Agilent Technologies Literature Search Software: download from http://www.labs.agilent.com/research/mtl/projects/sysbio/sysinformatics/litsearch.html