Genomics II: Gene prediction and Analysis
The identification of exons, introns, repetitive sequences, splice sites, transcription and translation start sites etc. is termed annotation.
ENCODE (ENcyclopedia Of DNA Elements) project at NHGRI
Gene structures are predicted using three types of information:
1. SIGNALS
A. Splice junctions
All methods give many false positives. Accuracy increases significantly if the region in which a splice site is expected to occur can be narrowed.
B. Sites associated with promoters
TATA boxes, transcription factor binding sites, CpG islands
C. Other
PolyA sites - identify carboxyl terminus
In general, use of other signals other than splice junctions provide a marginal improvement over methods that do not use them.
2. CONTENT
STATISTICS
Codon bias
Every species has a bias in its choice of
codons.
Knowing the bias can identify genes from sequence.
Coding regions also have asymmetries &
periodicities
that can help distinguish them from noncoding regions. Long exons
easier to identify than short.
Codon usage
database
at Kazusa DNA Research Institute
3. SIMILARITY TO KNOWN GENES
Gene prediction method accuracies have improved in the past few years due to the enormous increase in the number of examples of known coding sequences
Tips for using analysis and prediction programs with a genomic sequence:
1. Remove or mask
repetitive
elements (ALUs, LINES etc.)
2. Perform Database Search on
Translated DNA (BlastX or TFasta)
3. Perform ORF Gene Finding
Search (Grail, Genie, GenScan etc)
4. Identify signal sites, e.g.
splice sites
5. Translate putative ORFs and
do Functional Analysis (Blocks, Motifs, etc)
Always use more than one program to analyze your sequence.
Review: NHGRI Webvideo course
REPETITIVE ELEMENTS
Genomes contain a large proportion of repetitive
sequence including:
a) Simple repeats
b) Transposons
of several categories
Available databases for repeat identification and masking:
CENSOR at the Genetic Information Research Institute. Sublibraries for C. Elegans, D. Melanogaster etc.
ORF FINDING
The existence of ORFs in uncharacterized sequences is addressed by several types of method:
1. Searching for potential polyA or splice sites by pattern-identification models e.g. HMMs
BCM
Search Launcher at Baylor College of Medicine and Genefinder
2. Searching for exons using methods based on HMMs, codon usage frequencies
GENIE and FGENESH - HMM based Human Gene structure prediction (multiple genes, both chains), and other tools at BCM Search Launcher
3. Complete Gene searching using probability tables
GENSCAN at MIT
4. Nomi Harris's list of gene annotation tools, including GFF format
The goal of the Gene Ontology Consortium is to
produce
a dynamic controlled vocabulary that can be applied to all eukaryotes
even
as knowledge of gene and protein roles in cells is accumulating and
changing.
7. SLAM
webserver for comparative gene finding and alignment
Provides annotations and alignments and conserved
noncoding sequences in pairs of homologous sequences.
Also links to whole genome annotations of human,
mouse & rat genomes.
TRANSCRIPTION START SITE PREDICTION
Computational prediction of transcription start
sites is difficult in metazoans, since signals
such as TATA and CAAT boxes and Inr sequences are not
consistently present.
Especially true of vertebrates.
30% or more of human genes are predicted to have
more than one transcript (many have multiple)
These genes may therefore have multiple
transcription start sites and promoters
Databases
DBTSS
- start sites for man, mouse & malaria parasite.
PROMOTER PREDICTION
FIE2
and Dragon Gene Start Finder: Molecular bioinformatics group,
Singapore. Also identifies 5' gene start.
Gibbs Recursive
Sampler: Transcription Factor Binding Site Determination.
MATCHTM: A tool for searching transcription factor binding sites
Matinspector - Prediction of transcription factor binding sites.
PPNN
- Promoter Prediction by Neural Network at Berkeley Drosophila Genome
Project.
PROMO
- TFBS prediction in single and related groups of sequences at ALGGEN
server.
Promoter
at
Denmark Technical University.
Site
Seer: Visualization and Analysis of Transcription factor Binding
Sites.
Target
Explorer: identification of New Target Genes for a Specified Set of
TFs
YMF:
program for Discobvery of Novel TFBS by statistical overrepresentation
Transcription factor
binding
site prediction using orthologous sequences:
TFBS predicted at equivalent
positions in pairs of orthologous sequences,
which reduces the number of false positive predictions.
PromH: Azerbaijan National Academy of
Sciences
PROBLEM 1
Carry out a TFBS analysis using Consite, using the
sequences provided on the home page. Compare the results with
those obtained using a single sequence and the Ahr-Arnt transcription
factor as query
PROBLEM 2
Using TRAFAC find the predicted TFBS in the promoter
regions of the human and mouse Uteroglobin genes (symbol SCGB1A1)
PROBLEM 3
Below is a fragment of mouse chromosome 4. Compare the abilities of gene - finding tools to predict the presence of a gene in the fragment.
GGTACCATGGAGCTGGCTTCAAAACCGGCCTCATCAGAAAGAAGAAAGTGGTGAGGGTATAGAGAAGCCC
ACTGGGCTCTAACTACCCAGCTAATGCCACTATCCAGAGGACAACGCAAGCCCTCCCTGGAACTGACAGG
ACACGGATTTTTTTTTTCCCAGACTTTTGCAGTTCTCAGAGCTAAAGCCAAAGAAAGCAGAGGGTGAAGA
CTCTGAGTGGGGGGAAGAAGGAAGCCAAGGTCATGTTTTGCCAGAAAAAAAAGGGGGCAGAGGGGGGTAA
AGGGGAAGACCCAGAACGGGCCATCGAGGGGGAATTTTGGGACAGAAAGTGAAAGCTCAAGGCATAAATT
GGAGCTGGTGGCAGCCTCAGCAGGCTCAAGCGGGCTCACTGGGGCACACACAAGCTCACTTGCACTTGGC
TCCTGAACCCCTTCACGCCACAGGAGGACATCTGAGCGATTCCAGTCACTCCCCTGTGAACCCGTCGGAG
CCTCGGCCTCTCAGATTCTTGCGCACACAGTCTCTCAGGCTCACTCACTCTCTGTGGCCTGCCTCAGATT
ATCTGCCATGGCCCCCCGTGTGACCCCACTCCTGGCCTTCAGCCTGCTGGTTCTCTGGACCTTCCCAGGT
AAGGCTTGGGAGATGACTACGGTAGGGGCCACAGTGACCTGTGGACAATATGTAGGTCTTAGACTATTAC
AATGGTCATGAAAGACCCTCCGAGTTCAGAGTCAGTGGATCCAAGGGCTCCTTATGCCTTCTGTGTGTGC
CTGTCAATATGAATGTGCCTGTATCCAGGATTCCAATGTGGCTCAGCTCTGTCATATCCTGGGGCTTATA
GCTGTGGGCTCCATGTCTTGTTACCTCAGCTTCCAGGCAACAGAAGGAAGGAACAGTGTATCTTCCTTGA
GGGACCTAAGACTATAATTAAGTGGGTGGAGTGTTTGCCTTGCCTTCGTGAAACCCTGGTTTCAATTGCA
CCACTGGGGTTGTGGGGGTGAGAGCTTTGCTGTGTATTTGCTTCCATCTGAGAGAACAAGAAACACTGAA
GACATGCTTAATAACTTGTTCCCCTTTTGGGGGTTGCATTTTTAAGGGCCTTGGTTTGGTTTGGTTTGGT
TAGTTTGGTTTAGTTTGGCTTTTGAGAAAGCATCTGTCCATGTTGCCCAGGCTAGCCTCCATCTCCTGGG
CTCACGTGATCCTACATCTCAGCTTCCCTCCTGGTAGCTGGGACTCCGGGTATGGCAGCCAGCTGCTGTA
GTCATTCAGAAGCCTGAAGGAGTCTCAGCCGAGGCAAGCCTCCCCTGTCCCAGCATTCTTCTGTTAACTC
TCCATTTGCTAAAACACTGTAAAAGCTGAGAAAACCTGGGAAAGGGTGTATGACCCCAGCATTTGGGATA
CGTTGAAAAATGGTTATTTACTAAATGGCTTAGAAGACTACAATATGTTCAGCAGCCAACTGAGCCACAG
TGGCACTCAAGGCTGAATATCAGGTCCAGCAATCACACACAGAGAGATGCTATCAAAGCCACTTAGGGAC
TGTGGCCTCCCTTCCTCATGGTCTCAGGGTCTTCTCCCCTTCTTTCTCCTATCTCAGCCCCAACTCTGGG
GGGTGCTAATGATGCGGAAGACTGCTGCCTGTCTGTGACCCAGCGCCCCATCCCTGGGAACATCGTGAAA
GCCTTCCGCTACCTTCTTAATGAAGATGGCTGCAGGGTGCCTGCTGTTGTGTGAGTTGCTTGTGGAAAGA
ATATCTGGCCCCATCCCCCCATGAGCCCTTGCTGATGCCATCATGGCTTTAACCCTGAACTCATGGCAGA
GCCCAGTTTTCATGGAAGCCTATGAAACAGGTCCCTACAAATAGTCTCCAAGCCTCTGCTCCTTACTCTA
GAGCCTTCTAGGAAACTGGGTTCCAGGGCTTTTATTCTCTCCAACCTCTGGCTACAGGTTCACCACACTA
AGGGGCTATCAGCTCTGTGCACCTCCAGACCAGCCCTGGGTGGATCGCATCATCCGAAGACTGAAGAAGT
CTTCTGCCAAGGCAAGCCTGACCCTCCTCAGTCCTGCCTCCGCCCTCCCAACACCCCGAGATTCCAGCTC
ATGACCCTGCCTCTCCTCCCTCCCCTTAGAACAAAGGCAACAGCACCAGAAGGAGCCCTGTGTCTTGAGT
AAAGAGATGTGAATCACTCTGGCCCAGGAAACCAAGGACCAGAAGAGAGGACCAGGCCTCCTGATGCTCT
GTCCCAGACCTAACCCAGCCAAGTCTGTGCCTAGAGAGTCGATGTGAGTGTGGACAAGAGAGTTTGTGTG
GCTAGAACACCATCTCTCTGTGGCTAGACTGCAGAGCTTCCAATAAAGCCGCTTGGTACC
Can you identify the gene using the Mouse Genome Browser at UCSC?
2. Predict the promoter region in the above
sequence,
and possible transcription factor binding sites
PROBLEMS OF GENOME ANNOTATION
1. Inability of ab initio methods to identify 5' and 3' - untranslated exons
2. Tendency to artificially join or split genes (especially when genes are tandemly duplicated)
3. Inability of ab initio methods to cope with overlapping genes (surprisingly common in Drosophila)
4. Problems using EST data - may be incorrect,
contaminants,
primed off internal poly(A) sequences,or
reflect abnormal or
intermediate
splice forms of a pre-mRNA
GASP experiment - compared performance of a number of analytical tools on 2.9 Mb Adh region of D. Melanogaster
Described in Ashburner Genome Research
(2000),
10: 391-3 & Stormo Genome Research (2000), 10: 394-7
OTHER AVAILABLE SOFTWARE
MAGPIE (prokaryotes), EGRET (eukaryotes) and other resources at the Rockefeller University
Glimmer (bacterial and archaeal genomes)
Uses interpolated Markov models (IMMs) to identify
the
coding regions and distinguish them
from noncoding DNA. The IMM approach uses a combination
of Markov models from first through eighth order, weighting each model
according to its predictive power. Glimmer 1.0 and 2.0 use 3-periodic
nonhomogenous
Markov models in their IMMs.
VEIL - Hidden Markov Model for vertebrate genes
MORGAN uses decision tree technology combined with dynamic programming for vertebrate genes
MORGAN is an integrated system for finding genes in
vertebrate
DNA sequences. MORGAN uses a variety of techniques to accomplish this
task,
the most distinctive of which is a decision tree classifier. The
decision
tree system is combined with new methods for identifying start codons,
donor sites, and acceptor sites, and these are brought together in a
frame-sensitive
dynamic programming algorithm that finds the optimal segmentation of a
DNA sequence into coding and noncoding regions (exons and
introns). The optimal segmentation is dependent on a
separate scoring function that takes a subsequence and assigns to it a
score reflecting the probability that the sequence is an exon. The
scoring
functions in MORGAN are sets of decision trees that are combined to
give
a probability estimate. Experimental results on a database of 570
vertebrate
DNA sequences show that MORGAN has excellent performance by many
different
measures. On a separate test set, it achieves an overall accuracy of
95%,
with a correlation coefficient of 0.78 and a sensitivity and
specificity
for coding bases of 83% and 79%. In addition,
MORGAN identifies 58% of coding exons exactly; i.e.,
both the beginning and end of the coding regions are predicted
correctly.
References
The paper describing MORGAN is S. Salzberg, A. Delcher, K. Fasman, and J. Henderson. A Decision Tree System for Finding genes in DNA. Journal of Computational Biology 5:4 (1998), 667-680. A more tutorial introduction is S. Salzberg. Decision Trees and Markov Chains for GeneFinding. In S. Salzberg, D. Searls, and S. Kasif (eds.), Computational Methods in Molecular Biology, pp. 187-203. Amsterdam: Elsevier Science B.V., 1998.
GeneMine - sequence analysis & visualization program
Genotator
- Workbench for sequence anotation & browsing
ETOPE
- Evolutionary test of predicted exons
ESEfinder:
Identification of exonic splicing enhancers. Point mutations frequently cause genetic diseases by
disrupting the correct pattern of pre-mRNA splicing .
May occur by inactivation of ESEs resulting in exon skipping.
OTHER RESOURCES
ENCODE (ENCyclopedia Of DNA Elements) project at NHGRI
Bioinformatics
& Computational Genomics Course at Weizmann Inst., 1998 at
Weizmann
Institute
MAVID
- Multiple Alignment program for large genomic sequences
PipMaker - Identification of conserved regions in aligned sequences. Genome comparison tools
VISTA - Visual Tools for Alignment. Visualizes long sequence alignments of DNA from two or more species with annotation information.
Codon usage
database
at Kazusa DNA Research Institute