WeightMatrices

WEIGHT MATRICES FOR SIMILARITY SCORING

In sequence comparison, costs or similarities were previously determined by simple scoring methods such as Hamming distance, where the basic considerations are match or mismatch.

In biological macromolecules, residues may be mismatched, but still similar e.g. in chemical nature, and mismatch may not be as crucial for function.

There are also many examples where dissimilar substitutions in sequences from different species are tolerated with no loss of function.

More sophisticated methods for measuring cost or similarity have therefore been designed.

1. PAM (Percent Accepted Mutation) Matrices

Basic Construction Steps:

Align a group of protein sequences of at least 85% homology

Construct a Phylogenetic Tree

Tally replacements for every pair of adjacent sequences after pairwise alignment

Construct a matrix showing how often any aminoacid is replaced by any other

Correct for chance occurence of aminoacids in the group

Adjust matrix values to reflect probability of one aminoacid change in 100

Matrix as constructed above is termed PAM1 matrix.

Reflects amount of evolutionary time required for one aminoacid in 100 to change on average.

PAM1 matrices can be multiplied according to mathematical rules: e.g. PAM1 X PAM1 = PAM2.

PAM2 represents time for 2 aminoacids/100 to change. PAM250 250/100 etc.

PAM 100 reflects an evolutionary period of 10 milion years.

Higher order PAM matrices reflect longer evolutionary periods.

Therefore used to find more highly diverged sequences in sequence homology searches.

PROBLEM 1.

For the previous problem on Uracil-DNA glycosylase from Shope papilloma virus (SWISS-PROT Database Accession number P32941) search for orthologs use MPsrch, varying PAM values from 100 to 400.

Resources: VSNS course Chapter 1.3.

PAM matrices in Bioinformatics course at McMaster University

2. BLOSUM Matrices

Derived from the BLOCKS database of conserved sequences from protein families.

BLOSUM 62: the matrix is calculated so that sequences more than 62% identical are merged, so that the contributions of multiple entries of closely related sequences is avoided. Most closely related to the PAM 160 matrix.

BLOSUM matrices may be more useful for heuristic methods such as FASTA and BLAST, less so for others.

Compared to the PAM 160 the BLOSUM 62 matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.

Resources:

BLOSUM Matrices (McMaster University).

BLOCKS database.

VSNS course