MotifID

MOTIF IDENTIFICATION

Motif Identification in biological sequences is valuable for:

Identifying the position and identity of residues that are more or less conserved in families

Determining variable insertion and deletion probabilities

Improving the sensitivity and speed of database searching techniques using consensus models

Methods for Motif Identification:

1. eMOTIF at Stanford University

A motif is expressed as a pattern specifying the amino acids that can occur at each position in a sequence.

The possibilities are:

`.', which allows any amino acid,
a specific amino acid, or
a group of amino acids that share some physico-chemical property, e.g. charge, hydrophobicity, size

EMOTIF ranks the motifs that it finds by both their specificity (expected false positives) and the number of supplied sequences that it covers (true positives). The twenty highest-scoring motifs are returned, and can be used to search the entire SWISS-PROT database.

Resource: eMOTIF and related resources.

2. SPLASH - Structural Pattern Localization Analysis by Sequential Histogram (requires installation).

3. TEIRESIAS - identifies and highlights motifs common to several nucleotide or protein sequences.

4. Cluster - Buster: identifies clusters of user - specified motifs.

5. SIRW - combines sequence motif searches with keyword searches.

6. MinimotifMiner: analyzes protein queries for the presence of short functional motifs that, in at least one protein, has been demonstrated to be involved in posttranslational modifications, binding to other proteins, nucleic acids, or small molecules, or proteins trafficking.

EXAMPLE:

Find the proteins with nuclear receptor activity that contain the motif LLxxL in the SwissProt database.

HIDDEN MARKOW MODELS

Models related to profiles, which describe multiple alignments in terms of states and sets of probabilities.

Based on the concept of Markov chain:

the next state (e.g. residue in a sequence) is dependent on the current one

in a Markov chain of order k, next state is dependent on last k states

The states refer to each position in the alignment and are of three types:

1. Main (or match) states: contain the probabilities for each residue (e.g. 0.5A, 0.3C, 0.1G, 0.1T)

2. Deletion states: contain no residue (i.e. refer to a deletion)

3. Insert states: contain inserted residues after the corresponding main states (i.e. refer to insertions).

4. Start and End states

The probabilities are:

1. The emission probability - the probability that residues A,B, etc will occur at a given position

2. The transition probability - the probability that one state (see above) will change into another.

Insert states have an internal loop to allow for insertion of multiple residues.

In constructing HMMs, emission probabilities are generally not allowed to be zero, to allow for certain residues to occur in larger sequence sets, even though they do not occur at a given position in the sequence set under study. Probabilities are calculated as Dirichlet Priors from the Dirichlet distribution. An analogy is calculation of the probability that e.g. a 5 will be thrown next in a pair of weighted dice, even though a 5 has not been thrown in the last four throws of the dice.

The probability structure for an aligned set of sequences can then be used to identify other sequences (e.g. in a database) with the same sets of probabilities, similar primary structure and hence similar biological activity.

In HiddenMMs the states are hidden (just sets of probabilities), but can generate visible symbols.

Once a HMM has been constructed for a sequence family, other motifs or sequences which conform to that model may be identified by database searching.

Steps in use of HMMs to detect motifs in unknown sequences or remote homologies:

1. Create a multiple alignment from a family of related sequences

2. Calculate corresponding probabilities and construct a HMM (automatically)

3. Search a sequence database with the HMM (automatically)

The simplest form of HMM is the linear type, but more complex models can also be constructed, e.g.

parallel HMMs - several linear HMMs connected to same start and end states

- enable several protein families to be modelled simultaneously

loop HMMs - models where a state can also be a model itself. Similar to:

wheel HMMs, which describe motifs repeating periodically in phase , e.g. DNA bends

have been used to predict:

a region of high bendability immediately downstream of the transcription start site in introns

regions of high bendability phased at 10.5 bp in ds DNA - predicts bends in double helix

OTHER USES OF HMMS:

1. Ab initio gene finding

2. Radiation hybrid mapping

3. Secondary and tertiary protein structure prediction

4. Detection of unequal evolutionary rates in molecular sequences

EXAMPLES:

PDGF protein family: see Hidden Models in Biopolymers. Science (1998) 282: pp 1436-7

GC-rich DNA sequence tracts: BCG lecture at Weizmann Institute

HMM TOOL PACKAGE

Builds a HMM of sequences entered and searches the profile against the

EXAMPLE: provided at site above

PACKAGES REQUIRING Download or subscription

HmmBuild and HmmSearch in Celera Discovery System

HMMer at Washington University. Also enables searches of the Pfam database of protein motifs and HMMs

SAM at UC Santa Cruz

HMMPro (commercial) - nifty graphics & additional tools

OTHER RELATED RESOURCES

Computational Biochemistry Course at Stanford Lectures:

Pattern Matches and Consensus Sequences

Blocks, Profiles & Hidden Markov Models

Protein Profile vs Protein or DNA Database; Sequence vs Profile database searches on
Decypher (link at introduction to above lecture )

3. NEURAL NETWORKS

Models consisting of networks with interconnected units evolving in time. Most current applications involve Layered Feed-forward or Multilayer Perceptron (MLP) Architechture with hidden and visible layers. Visible layers are in contact with the outside world, and include Input and Output layers.

Output layer - expresses structural or functional features

Hidden layer(s)

Input layer - where e.g. sequence data is encoded

As with HMMs, models are initially generated and trained on existing data set, then used to predict sequence and structure information on query sequences or structures.

Applications in pattern recognition include motifs, alpha helices, splice sites, exons.

Example: GRAIL method for prediction of coding sequences:

(see BCG course for more information)

4. GIBBS SAMPLING METHOD

Example: Identification of sequences mediating coordinate transcriptional regulation in S. Cerevisiae

Motifs in 5' upstream regions identified with AlignAcein Church lab. at Harvard Med. School

(Publication available online).

5. OTHER METHODS

BLOCKS

MEME

MOTIF AND FINGERPRINT DATABASES

BLOCKS - conserved sequence blocks in protein families

InterPro Integrated Resource of Protein Domains and Functional Sites

Pfam - Protein families database of alignments and HMMs

PRINTS - database of protein fingerprints (group of conserved motifs)

ProDom - protein domain database

PROSITE - database of protein families and domains

SMART - domain identification

EXAMPLE:

Find the domains and homologs of the protein:

MNSSSANITYASRKRRKPVQKTVKPIPAEGIKSNPSKRHRDRLNTELDRLASLLPFPQDVINKLDKLSVL
RLSVSYLRAKSFFDVALKSSPTERNGGQDNCRAANFREGLNLQEGEFLLQALNGFVLVVTTDALVFYASS
TIQDYLGFQQSDVIHQSVYELIHTEDRAEFQRQLHWALNPSQCTESGQGIEEATGLPQTVVCYNPDQIPP
ENSPLMERCFICRLRCLLDNSSGFLAMNFQGKLKYLHGQKKKGKDGSILPPQLALFAIATPLQPPSILEI
RTKNFIFRTKHKLDFTPIGCDAKGRIVLGYTEAELCTRGSGYQFIHAADMLYCAESHIRMIKTGESGMIV
FRLLTKNNRWTWVQSNARLLYKNGRPDYIIVTQRPLTDEEGTEHLRKRNTKLPFMFTTGEAVLYEATNPF
PAIMDPLPLRTKNGTSGKDSATTSTLSKDSLNPSSLLAAMMQQDESIYLYPASSTSSTAPFENNFFNESM
NECRNWQDNTAPMGNDTILKHEQIDQPQDVNSFAGGHPGLFQDSKNSDLYSIMKNLGIDFEDIRHMQNEK
FFRNDFSGEVDFRDIDLTDEILTYVQDSLSKSPFIPSDYQQQQSLALNSSCMVQEHLHLEQQQQHHQKQV
VVEPQQQLCQKMKHMQVNGMFENWNSNQFVPFNCPQQDPQQYNVFTDLHGISQEFPYKSEMDSMPYTQNF
ISCNQPVLPQHSKCTELDYPMGSFEPSPYPTTSSLEDFVTCLQLPENQKHGLNPQSAIITPQTCYAGAVS
MYQCQPEPQHTHVGQMQYNPVLPGQQAFLNKFQNGVLNETYPAELNNINNTQTTTHLQPLHHPSEARPFP
DLTSSGFL

ONLINE MULTIPLE ALIGNMENT ANALYSIS TOOLS

CINEMA - color interactive editor for multiple alignments

Protein scan against a Profile database at ProfileScan at ISREC

MUSCA - multiple alignments constrained by sequence motif identification

Tcoffee@igs - computation, evaluation and combination of multiple sequence alignments.

Examples:

1. For the set of sequences below (first put into FASTA multiple alignment format):

1. SHKQIYYSDKYDDEEFEYRHVMLPKDIAKLVPKTHLMSESEWRNLGVQQSQGWVHYMIHEPEPHILLFRRPLP

2 MSKDIYYSDKYYDEQFEYRHVVLPKELVKMVPKTHLMTEAEWRSIGVQQSRGWIHYMIHKPEPHILLFRRPKT

3. LQCKILYSDKYYDDMFEYRHVILPKDLARLVPTSRLMSEMEWRQLGVQQSQGWVHYMIHKPEPHVLLFKRPRT

4. PRDTIQYSEKYYDDKFEYRHVILPPDVAKEIPKNRLLSEGEWRGLGVQQSQGWVHYALHRPEPHILLFRRE

5. GQIQYSEKYFDDTFEYRHVVLPPEVAKLLPKNRLLSENEWRAIGVQQSRGWVHYAVHRPEPHIMLFRRPLN

6. GNNDFYYSNKYEDDEFEYRHVHVTKDVSKLIPKNRLMSETEWRSLGIQQSPGWMHYMIHGPERHVLLFRRPLA

7. FIDQIHYSPRYADDEYEYRHVMLPKAMLKAIPTDYFNPETGTLRILQEEEWRGLGITQSLGWEMYEVHVPEPHILLFKREKD

i) Make blocks of the common motif(s) using Block Maker at the BLOCKS server

ii) Find the Block corresponding to yours in the BLOCKS database using LAMA

iii) Generate a phylogenetic tree of the Block sequences from the Block Maker result page

(also at EBI or Pfam)

iv) Search the nr protein database with the Block using MAST

v) Generate a Sequence Logo using the Blocks Multiple Alignment Processor

iv) Identify motifs using MEME and compare to those obtained with Block Maker

What family of proteins do the sequences above belong to?

2. Identify the conserved domain in sequence 1 above using protein-protein BLAST, followed by a CD Search.

3. Using sequence 1 above as query find other homologs using a HMM-based method at PFAM Protein Search

4. Find information on the domain structure, sequence and phylogenetic relationships of the Transforming Growth factor beta protein family
using PFAM and links to other related resources. Check the results with those found in the previous section on

Phylogenetic Analysis

5. Identify the motif in sequence 7 above using eMOTIF-SEARCH, and other sequences that contain it.