WEIGHT MATRICES FOR SIMILARITY SCORING

In sequence comparison, costs or similarities were previously determined by simple scoring methods such as Hamming distance, where the basic considerations are match or mismatch.

In biological macromolecules, residues may be mismatched, but still similar e.g. in chemical nature, and mismatch may not be as crucial for function.

There are also many examples where dissimilar substitutions in sequences from different species are tolerated with no loss of function.

More sophisticated methods for measuring cost or similarity have therefore been designed. 

1.  PAM (Percent Accepted Mutation) Matrices


Basic Construction Steps:

Matrix as constructed above is termed PAM1 matrix.
Higher order PAM matrices reflect longer evolutionary periods. 

Therefore used to find more highly diverged sequences in sequence homology searches.

PROBLEM 1.

For the previous problem on Uracil-DNA glycosylase from Shope papilloma virus (SWISS-PROT Database Accession number P32941) search for orthologs use MPsrch, varying PAM values from 100 to 400.
 
Resources: VSNS course Chapter 1.3.

PAM matrices in Bioinformatics course at McMaster University

2. BLOSUM Matrices

Derived from the BLOCKS database of conserved sequences from protein families.

BLOSUM 62: the matrix is calculated so that sequences more than 62% identical are merged, so that the contributions of multiple entries of closely related sequences is avoided.  Most closely related to the PAM 160 matrix.

BLOSUM matrices may be more useful for heuristic methods such as FASTA and BLAST, less so for others.


Compared to the PAM 160 the BLOSUM 62 matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.


Resources
:

BLOSUM Matrices (McMaster University).

BLOCKS database.

VSNS course