ALGORITHMS FOR HOMOLOGY SEARCHING


1.  Needleman - Wunsch  and derived methods.

The most accurate and rigorous sequence comparison methods.

A.  Needleman Wunsch

Method is based on distance and uses Dynamic Programming.

The method produces a global alignment.

B. Smith-Waterman:

The Smith - Waterman modification of the Needleman-Wunsch method determines local alignments :  

Uses similarity scores.

For prefixes ending at si and tj  in sequences sm and tn of length m and n repectively

Similarity S(si, tj ) = 1     for identities
                                -1/3 for replacements (mismatches)

a variable gap penalty is used:

gap weights (wk )  =  1 + k/3 (k gaps total)

If Hij is the similarity of prefixes ending at si and tj  in sequences sm and tn of length m and n respectively
the algorithm selects the maximum value of:

Hi-1,j-1 + S(si, tj )    or

(Hi-k,,j) - wk)    or

(Hi,j-k) - wk)    or

0

Since the Smith-Waterman method determines local similarity , 0 is also included in the possible choices.  In this way regions of similarity rise above a sea of zeros.

PROBLEM 1:

Show that for the sequences CTG and CTA the optimal alignment has a similarity Hij  = 1.67.  What is the local similarity and alignment of the 2 sequences CTATAATC and CTGTATC?

SW is the most rigorous - therefore the slowest and most demanding on computer time

Web resources:

ALION at Stamford.  Needleman - Wunsch (global) and Smith-Waterman (local) alignment algorithms available


HEURISTIC METHODS

The computer time needed to determine the optimal alignment for 2 sequences using dynamic programming alone is proportional to the product of their sequence lengths. This time can be reduced by using a heuristic method, an approximate, usually multistep method not guaranteed to give exactly the correct answer.

1. FASTA:

Uses four steps to calculate sequence similarity:

STEP 1.  Identification of  identities shared between the 2 sequences using a lookup table

e.g. for: A=1, R=2, N=3, etc
and sequence NDAPL

                            aminoacid                 assigned number         sequence position
                                A                                         1                                 3
                                R                                         2                                 0
                                N                                         3                                 1
                                D                                         4                                 2
 
 

Regions shared by the 2 sequences with the highest density of identities (ktup = 1)
                                                                                     or pairs of identities (ktup = 2)  identified.

For protein searches a value of ktup = 2 is usual
        only regions with at least 2 adjacent identical residues compared

for more sensitive searches use ktup = 1;  5X slower, however, than for ktup = 2

For nucleotide searches, use value of ktup = 4 to 6 unless short - use ktup = 1

To find all regions of similarity between 2 sequences lookup table is used in comparison with diagonal method, using ktup matches and penalizing for intervening mismatches.
(diagonal method refers to diagonal derived from dot matrix plot, e.g. where one sequence is compared to another along x & y axes)

For protein sequences PAM250 matrix used to give maximum similarity values

STEP 2.  After 10 best local regions are found, rescore with scoring matrix

STEP 3.  See if regions of best score > CUTOFF can be joined to give alignment with gaps
Similarity score calculated minus penalty for gaps = 20 (usually)

STEP 4.  Construct an optimal alignment of the query and library sequence using the Needleman -Wunsch - Sellers (dynamic programming) algorithm.

Webserver available for pairwise alignment (LFASTA).

3.  BLAST

Heuristic for similarity assessment  - begins with identification of short regions of  homology scoring above a specified threshold value

Regions then joined and reexamined for cost.  Process repeated.

Gives an assessment of statistical significance, length adjusted.

Many versions available, e.g. at http://www.ncbi.nlm.nih.gov/BLAST/

For 2 sequences use BLAST 2 sequences option

ADVANCED OPTIONS: Gaps allowed, gap penalties adjustable.

Advantages: enables overall relationship and homology between whole sequences to be conceptualized

4. BLAT

Very fast alignment tool available for the human/mouse/rat genome browsers at UC Santa Cruz

PROBLEM 1.   Uracil-DNA glycosylase is an enzyme essential to all forms of life, from simple RNA viruses to humans, but homology between species can be very low (in the "twighlight zone",  <20%).  Compare the ability of different homology search algorithms  (Smith-Waterman at MPsrch, BLAST, FASTA, ) to find orthologs (proteins of the same function in different organisms) of Shope fibroma virus Uracil-DNA glycosylase:

>gi|418154|sp|P32941|UNG_SFVKA URACIL-DNA GLYCOSYLASE (UDG)
MRRVFLSHEPYVIEYHEDWENIITRLVDMYNEVAEWILKDDTSPTPDKFFKQLSVSLKDKRVCVCGIDPY
PRDATGVPFESHNFTKKTIKYIAETVSNITGVRYYKGYNLNNVEGVFPWNYYLSCKIGETKSHALHWKRI
SKLLLQHITKYVNVLYCLGKTDFANIRSILETPVTTVIGYHPAAREKQFEKDKGFEIVNVLLEINDKPSI
RWEQGFSY

Find the sequence of the human ortholog, and its degree of homology to the above protein.


HOMOLOGY SEARCHING AGAINST SEQUENCE DATABASES

1. SMITH - WATERMAN

MPsrch at EBI

2.  FASTA

Web resources:

FASTA and TFASTA at http://iubio.bio.indiana.edu:81/srsfasta/sfsectionsearch.html
(also species selectable)
                                  and http://www2.ebi.ac.uk/fasta3/

3. BLAST

Many versions available, e.g. at http://www.ncbi.nlm.nih.gov/BLAST/


DATABASE HOMOLOGY SEARCHING PROBLEMS:

1.  Examples in the Exercises compiled by Francisco M. de La.Vega in the Appendix to Chapters 1 and 2 of the VSNS course.

2.  Carry out a Blastp search using default parameters with the M.Musculus chondradherin protein, accession #NM_007689.  Explain the alignment obtained for the best hit.  How can you obtain a more unambiguous alignment result?

3.
A region of chromosome 5 of the yeast genome suspected to include a gene involved in gene amplification in Saccharomyces Cerevisiae was cloned and partially sequenced. The sequence is given below:

5' end:

TTCTCTGCCTTATTCTGCGGAAACGCTCCAAAAGATCAANATTAAATCACCTATTCCAGCANCATATTGGTAAT
AAAGGTCTTTTTATGGTGCAN

3'end:

TTTTTCAGATNGGATGGCANGACCAACTTCGGCAGGTGTGTTTCTAGCTGTAG
ATAATCTCCCCACGGTACTGTAGAGGCTTCCAGCCTTTTAAT

What is the probable identity of the gene?

4.   Obtain 1 kb of nucleotide sequence upstream of the transcription start site of the mouse Acute Phase Response factor (APRF) gene beginning with the Genbank sequence Accession number  L29278.1. What type of sequence does L29278.1 turn out to be?

Searching with ESTs

Ests are powerful tools for gene dicovery and annotation, but can also lead to many pitfalls.

Positive characteristics:

1. They are fragments of the transcribed regions of a genome, and therefore focus on the part (e.g. 3% for the human genome) that is "useful".

2. In coordination with other sequence annotation tools they are very powerful tools for gene strucure determination.

Negative characteristics:

1. Only fragments, 3' or 5', of cDNAs, and relatively small: 300-500 nt . May not indicate intron/exon boundaries, translation start site.

2. Often contain sequencing errors and vector sequences.

3. May contain intron sequences because vector picked up DNA corresponding to unspliced RNA.

4. Contain sequences from 5' or 3' untranslated regions (UTRs), present in all mRNAs.

5. May yield confusing results due to alternative splicing of primary transcript. Estimated to occur in 30% or more of human genes.

6. May yield multiple homology search hits due to:

                                                               a) closely related members of the same gene/protein family
                                                               b) related sequences: pseudogenes, Long Internal Nuclear Element (LINE) repeats, viral sequences, etc.
 

Power of EST searching may be increased by:

                                                                           a) using multiple, preferably overlapping, ESTs for the same gene.
                                                                           b) using ESTs in coordination with other identifiers of gene characteristics e.g. splice sites.

Programs using the latter approach are sim4 and Transcript Assembly Program (TAP).

Problem 8: What gene is associated with the EST accession number W64798.1 ? Find the transcription and translation start sites of the gene.
 

Guides and Tutorials:

1. Lectures on Sequence Similarity Searching on the VSNS Home Page

2. Exercises compiled by Francisco M. de La.Vega in the Appendix to Chapters 1 and 2 of the
    VSNS course.

3.  Sequence analysis tutorials developed at PSC.

4. Blast on the Web by Bruno A. Gaeta, Biotechniques (2000), 28  pp436 - 439.