GENOMICS II
 
 

Genomics II: Gene prediction and Analysis



The identification of exons, introns, repetitive sequences, splice sites, transcription and translation start sites etc. is termed annotation.

ENCODE (ENcyclopedia Of DNA Elements) project at NHGRI

Gene structures are predicted using three types of information:

1.  SIGNALS

A. Splice junctions

All methods give many false positives.  Accuracy increases significantly if the region in which a splice site is expected to occur can be narrowed.

B. Sites associated with promoters

TATA boxes, transcription factor binding sites, CpG islands

C. Other

PolyA sites - identify carboxyl terminus

In general, use of other signals other than splice junctions provide a marginal improvement over methods that do not use them.


2.  CONTENT STATISTICS

Codon bias

Every species has a bias in its choice of codons.  Knowing the bias can identify genes from sequence.
Coding regions also have asymmetries & periodicities that can help distinguish them from noncoding regions.  Long exons easier to identify than short.

Codon usage database  at Kazusa DNA Research Institute
 

3.  SIMILARITY TO KNOWN GENES

Gene prediction method accuracies have improved  in the past few years due to the enormous increase in the number of examples of known coding sequences

Tips for using analysis and prediction programs with a genomic sequence:

     1. Remove or mask repetitive elements (ALUs, LINES etc.)
     2. Perform Database Search on Translated DNA (BlastX or TFasta)
     3. Perform ORF Gene Finding Search (Grail, Genie, GenScan etc)
     4. Identify signal sites, e.g. splice sites
     5. Translate putative ORFs and do Functional Analysis (Blocks, Motifs, etc)

Always use more than one program to analyze your sequence.

Review: NHGRI Webvideo course

REPETITIVE ELEMENTS

Genomes contain a large proportion of repetitive sequence including:

    a) Simple repeats

    b) Transposons of several categories

Available databases for repeat identification and masking:

CENSOR  at the Genetic Information Research Institute. Sublibraries for C. Elegans, D. Melanogaster etc.

RepeatMasker2


ORF FINDING

The existence of ORFs in uncharacterized sequences is addressed by several types of method:

1. Searching for potential polyA or splice sites by pattern-identification models e.g. HMMs

 BCM Search Launcher at Baylor College of Medicine and Genefinder
 

2. Searching for exons using methods based on HMMs, codon usage frequencies

GRAIL

GENIE and FGENESH - HMM based Human Gene structure prediction (multiple genes, both chains), and other tools at BCM Search Launcher

3.  Complete Gene searching using probability tables

GENSCAN at MIT

4.  Nomi Harris's list of gene annotation tools, including GFF format

5Anders Krogh's Compendium

6.  Gene Ontology Consortium

The goal of the Gene Ontology Consortium is to produce a dynamic controlled vocabulary that can be applied to all eukaryotes
even as knowledge of gene and protein roles in cells is accumulating and
changing.

7. SLAM webserver for comparative gene finding and alignment

Provides annotations and alignments and conserved noncoding sequences in pairs of homologous sequences.

Also links to whole genome annotations of human, mouse & rat genomes.

TRANSCRIPTION START SITE PREDICTION

Computational prediction of transcription start sites is difficult in metazoans, since signals
such as TATA and CAAT  boxes and Inr sequences are not consistently present.

Especially true of vertebrates.

30% or more of human genes are predicted to have more than one transcript (many have multiple)

These genes may therefore have multiple transcription start sites and promoters

Databases

DBTSS - start sites for man, mouse & malaria parasite.


PROMOTER PREDICTION

FIE2 and Dragon Gene Start Finder: Molecular bioinformatics group, Singapore. Also identifies 5' gene start.

Gibbs Recursive Sampler: Transcription Factor Binding Site Determination.

MATCHTM: A tool for searching transcription factor binding sites

Matinspector - Prediction of transcription factor binding sites.

PPNN - Promoter Prediction by Neural Network at Berkeley Drosophila Genome Project.

PROMO - TFBS prediction in single and related groups of sequences at ALGGEN server.

Promoter at Denmark Technical University.

Site Seer: Visualization and Analysis of Transcription factor Binding Sites.

Target Explorer: identification of New Target Genes for a Specified Set of TFs

YMF: program for Discobvery of Novel TFBS by statistical overrepresentation




Transcription factor binding site prediction using orthologous sequences:

TFBS predicted at equivalent positions in pairs of orthologous sequences, which reduces the number of false positive predictions.

Consite

                                            

                                                

PromH: Azerbaijan National Academy of Sciences

PROMO - TFBS prediction in single and related groups of sequences at ALGGEN server.

TRAFAC  

                                 

PROBLEM 1

Carry out a TFBS analysis using Consite, using the sequences provided on the home page.  Compare the results with those obtained using a single sequence and the Ahr-Arnt transcription factor as query

PROBLEM 2

Using TRAFAC find the predicted TFBS in the promoter regions of the human and mouse Uteroglobin genes (symbol SCGB1A1)

PROBLEM 3

Below is a fragment of mouse chromosome 4.  Compare the abilities of gene - finding tools to predict the presence of a gene in the fragment.

GGTACCATGGAGCTGGCTTCAAAACCGGCCTCATCAGAAAGAAGAAAGTGGTGAGGGTATAGAGAAGCCC
ACTGGGCTCTAACTACCCAGCTAATGCCACTATCCAGAGGACAACGCAAGCCCTCCCTGGAACTGACAGG
ACACGGATTTTTTTTTTCCCAGACTTTTGCAGTTCTCAGAGCTAAAGCCAAAGAAAGCAGAGGGTGAAGA
CTCTGAGTGGGGGGAAGAAGGAAGCCAAGGTCATGTTTTGCCAGAAAAAAAAGGGGGCAGAGGGGGGTAA
AGGGGAAGACCCAGAACGGGCCATCGAGGGGGAATTTTGGGACAGAAAGTGAAAGCTCAAGGCATAAATT
GGAGCTGGTGGCAGCCTCAGCAGGCTCAAGCGGGCTCACTGGGGCACACACAAGCTCACTTGCACTTGGC
TCCTGAACCCCTTCACGCCACAGGAGGACATCTGAGCGATTCCAGTCACTCCCCTGTGAACCCGTCGGAG
CCTCGGCCTCTCAGATTCTTGCGCACACAGTCTCTCAGGCTCACTCACTCTCTGTGGCCTGCCTCAGATT
ATCTGCCATGGCCCCCCGTGTGACCCCACTCCTGGCCTTCAGCCTGCTGGTTCTCTGGACCTTCCCAGGT
AAGGCTTGGGAGATGACTACGGTAGGGGCCACAGTGACCTGTGGACAATATGTAGGTCTTAGACTATTAC
AATGGTCATGAAAGACCCTCCGAGTTCAGAGTCAGTGGATCCAAGGGCTCCTTATGCCTTCTGTGTGTGC
CTGTCAATATGAATGTGCCTGTATCCAGGATTCCAATGTGGCTCAGCTCTGTCATATCCTGGGGCTTATA
GCTGTGGGCTCCATGTCTTGTTACCTCAGCTTCCAGGCAACAGAAGGAAGGAACAGTGTATCTTCCTTGA
GGGACCTAAGACTATAATTAAGTGGGTGGAGTGTTTGCCTTGCCTTCGTGAAACCCTGGTTTCAATTGCA
CCACTGGGGTTGTGGGGGTGAGAGCTTTGCTGTGTATTTGCTTCCATCTGAGAGAACAAGAAACACTGAA
GACATGCTTAATAACTTGTTCCCCTTTTGGGGGTTGCATTTTTAAGGGCCTTGGTTTGGTTTGGTTTGGT
TAGTTTGGTTTAGTTTGGCTTTTGAGAAAGCATCTGTCCATGTTGCCCAGGCTAGCCTCCATCTCCTGGG
CTCACGTGATCCTACATCTCAGCTTCCCTCCTGGTAGCTGGGACTCCGGGTATGGCAGCCAGCTGCTGTA
GTCATTCAGAAGCCTGAAGGAGTCTCAGCCGAGGCAAGCCTCCCCTGTCCCAGCATTCTTCTGTTAACTC
TCCATTTGCTAAAACACTGTAAAAGCTGAGAAAACCTGGGAAAGGGTGTATGACCCCAGCATTTGGGATA
CGTTGAAAAATGGTTATTTACTAAATGGCTTAGAAGACTACAATATGTTCAGCAGCCAACTGAGCCACAG
TGGCACTCAAGGCTGAATATCAGGTCCAGCAATCACACACAGAGAGATGCTATCAAAGCCACTTAGGGAC
TGTGGCCTCCCTTCCTCATGGTCTCAGGGTCTTCTCCCCTTCTTTCTCCTATCTCAGCCCCAACTCTGGG
GGGTGCTAATGATGCGGAAGACTGCTGCCTGTCTGTGACCCAGCGCCCCATCCCTGGGAACATCGTGAAA
GCCTTCCGCTACCTTCTTAATGAAGATGGCTGCAGGGTGCCTGCTGTTGTGTGAGTTGCTTGTGGAAAGA
ATATCTGGCCCCATCCCCCCATGAGCCCTTGCTGATGCCATCATGGCTTTAACCCTGAACTCATGGCAGA
GCCCAGTTTTCATGGAAGCCTATGAAACAGGTCCCTACAAATAGTCTCCAAGCCTCTGCTCCTTACTCTA
GAGCCTTCTAGGAAACTGGGTTCCAGGGCTTTTATTCTCTCCAACCTCTGGCTACAGGTTCACCACACTA
AGGGGCTATCAGCTCTGTGCACCTCCAGACCAGCCCTGGGTGGATCGCATCATCCGAAGACTGAAGAAGT
CTTCTGCCAAGGCAAGCCTGACCCTCCTCAGTCCTGCCTCCGCCCTCCCAACACCCCGAGATTCCAGCTC
ATGACCCTGCCTCTCCTCCCTCCCCTTAGAACAAAGGCAACAGCACCAGAAGGAGCCCTGTGTCTTGAGT
AAAGAGATGTGAATCACTCTGGCCCAGGAAACCAAGGACCAGAAGAGAGGACCAGGCCTCCTGATGCTCT
GTCCCAGACCTAACCCAGCCAAGTCTGTGCCTAGAGAGTCGATGTGAGTGTGGACAAGAGAGTTTGTGTG
GCTAGAACACCATCTCTCTGTGGCTAGACTGCAGAGCTTCCAATAAAGCCGCTTGGTACC

Can you identify the gene using the Mouse Genome Browser at UCSC?

2. Predict the promoter region in the above sequence, and possible transcription factor binding sites
 

PROBLEMS OF GENOME ANNOTATION

1.  Inability of ab initio methods to identify 5' and 3' - untranslated exons

2.  Tendency to artificially join or split genes (especially when genes are tandemly duplicated)

3.  Inability of ab initio methods to cope with overlapping genes (surprisingly common in Drosophila)

4.  Problems using EST data - may be incorrect, contaminants, primed off internal poly(A) sequences,or
     reflect abnormal or intermediate splice forms of a pre-mRNA

GASP experiment - compared performance of a number of analytical tools on 2.9 Mb Adh region of D. Melanogaster

Described in Ashburner Genome Research (2000), 10: 391-3 & Stormo Genome Research (2000), 10: 394-7

OTHER AVAILABLE SOFTWARE

MAGPIE (prokaryotes), EGRET (eukaryotes) and other resources at the Rockefeller University

Glimmer (bacterial and archaeal genomes)

Uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them
from noncoding DNA. The IMM approach uses a combination of Markov models from first through eighth order, weighting each model according to its predictive power. Glimmer 1.0 and 2.0 use 3-periodic nonhomogenous Markov models in their IMMs.

VEIL - Hidden Markov Model for vertebrate genes

MORGAN uses decision tree technology combined with dynamic programming for vertebrate genes

MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and
introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95%, with a correlation coefficient of 0.78 and a sensitivity and specificity for coding bases of 83% and 79%. In addition,
MORGAN identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly.

References

The paper describing MORGAN is  S. Salzberg, A. Delcher, K. Fasman, and J. Henderson.  A Decision Tree System for Finding genes in DNA.  Journal of Computational Biology 5:4 (1998), 667-680.  A more tutorial introduction is S. Salzberg.  Decision Trees and Markov Chains for GeneFinding.  In S. Salzberg, D. Searls, and S. Kasif (eds.), Computational Methods in Molecular Biology, pp. 187-203.  Amsterdam:  Elsevier Science B.V., 1998.

GeneMine - sequence analysis & visualization program

Genotator - Workbench for sequence anotation & browsing

ETOPE - Evolutionary test of predicted exons 

ESEfinder:  Identification of exonic  splicing enhancersPoint mutations frequently cause genetic diseases by disrupting the correct pattern of pre-mRNA splicing .

May occur by inactivation of ESEs resulting in exon skipping.

OTHER RESOURCES

ENCODE (ENCyclopedia Of DNA Elements) project at NHGRI

Bioinformatics & Computational Genomics Course at Weizmann Inst., 1998 at Weizmann Institute

MAVID -  Multiple Alignment program for large genomic sequences

PipMaker - Identification of conserved regions in aligned sequences. Genome comparison tools

VISTA - Visual Tools for Alignment. Visualizes long sequence alignments of DNA from two or more species with annotation information.

Codon usage database  at Kazusa DNA Research Institute