DatabaseSearching.html

DATABASE & HOMOLOGY SEARCHING

DATABASE SEARCHING

1. Gene and nucleotide sequence information

Finding biological information from public databases can be a confusing problem due to:

1. Choice of multiple databases (Genbank, EMBL, Unigene, individual species databases).

Problem 1: Find the following information about the human and mouse c-erbA genes:

a. Synonyms of the gene name.

b. The chromosome locations of the genes on the genetic and physical chromosome maps.

c. The number and sequences of alternative transcripts.

d. The numbers of the first and last three codons of the coding sequences of the above.

e. Diseases associated with variant alleles of the human gene.

f. The percentage homology between the human and mouse proteins.

g. The relative degrees of expression among different tissues.

i) Genbank, using Entrez

ii) Unigene

iii) Ensembl

iii) DDBJ

iv) The human and mouse Genome Browsers at UCSC

v) Mouse Genome Informatics at the Jackson Laboratory

vi) Online Inheritance in Man (OMIM).

2. Degeneracy of information (multiple entries of identical information)

Information for genes may be available as:

i) Genomic sequence - whole, exons, promoter etc.

ii) mRNA or cDNA sequence - intron sequence removed!

iii) ESTs - Expressed Sequence Tags - 5' or 3' fragments of cDNAs - may include vector sequence, mistakes

iv) STSs - Sequence Tagged Sites - short unique sequences within a gene, thereby characterizing it.

Problem 2: what types of sequence information are available in GenBank for the human Hoxa-13 gene?

3. Degeneracy of Nomenclature - some proteins have multiple functions and classifications

Problem 3: What are alternate names for the Thyroid hormone receptor alpha and Foxa1 genes?

4. Alternate spellings for name of gene.

Problem 4: using Hoxa13 as the query term in Entrez, do you want to change your answer to Problem 2 above?

Interlinkage of databases and further linkage to literature retrieval services now makes information retrieval more facile

Example: NCBI, accessed via Entrez

Problem 5: What is the chromosome map location of the human THRA gene?

5. Multiple accession or id numbers for the same gene in different databases.

Identification of alternative accession or id numbers now possible using various tools, e.g.:

Matchminer

Indirect ploys:

If information appears to be unavailable for a gene in a species,

the following approaches may be successful:

1. Check Unigene with information from another species where the gene has been identified.

2. Homology searching with gene from another species using e.g. BLAST, FASTA - very helpful if sequence is recorded under different names.

3. Verifying attribution from BLAST result by linking to Medline

4. If chromosomal location of nucleotide sequence is known, verifying by querying appropriate database,

e.g. OMIM or related resources, or species -specific database.

5. Repetition of any or all of above steps.

Problem 6: Identify the name usually used for the Pancreastatin gene in mouse .

Problem 7: There is nucleotide sequence available for the mouse homeobox D11 gene. Is there sequence published for the chick?

(Searching Entrez Nucleotide for chick hox or homeobox D11 gives no result)

InstaSeq: Google-like tool for sequence retrieval.

Using fragments of a DNA, RNA or protein query retrieves sequences from many databases on the web and returns results in a Google-like format.

paper2sequences: retrieval of a collection of sequences, e.g. as listed in a publication.

Accession code specified; automatic lookup in multiple databases.

LITERATURE SEARCHING TOOLS

eJOURNALS

University of Cincinnati eJournals

LinkOut Journals at NCBI

CONTEXTUAL TEXT-BASED SEARCHING METHODS:

The vast amount of available information has led to the recent development of tools to explore relationships between genes, proteins, etc. in databases.

The types of relationships may include:

- similar function

- involvement in the same cellular pathway

- inclusion in the same multicomponent complex

- involvement in a disease

- inclusion in multiple literature citations.

XplorMed: a web server for exploring scientific literature. Identifies references containing specified words or word combinations and their context with respect to other words contained in those references.

Exercise 1: Take the Learning by example tutorial to explore the invlvement of heparin in Alzheimer's disease at the Xplormed Webserver.

MedMiner: The MedMiner filters will extract and organize relevant sentences in the literature based on a gene, gene-gene or gene-drug query. This tool combines the GeneCards and PubMed search engines with user input and automated server-side scripts in an integrated text filtering system.

Exercise 2: Take the Tutorial at the Medminer Webserver to expore the properties and relationships of genes involved in apoptosis.

TXTGate: TXTGate is a literature index database and is part of an experimental platform to evaluate (combinations of) information extraction and indexing from a variety of biological annotation databases. It is designed towards the summarization and analysis of groups of genes based on text.

Exercise 3: Take the TXTGate tutorial

Agilent Technologies Literature Search Software: download from http://www.labs.agilent.com/research/mtl/projects/sysbio/sysinformatics/litsearch.html