MultAlign

MULTIPLE ALIGNMENT

1. INTRODUCTION

Alignment of multiple sequences enables:

Motifs and regions critical to function to be characterized

3-Dimensional structure to be elucidated

Evolutionary history to be inferred

The optimal alignment is that which is the most consistent with previous biological knowledge,
i.e. aligns regions with the same functional or structural activity.

To assess the optimal alignment of sequences of undetermined function, computational methods must be used, again using cost functions to determine the alignment having minimum cost among all possible alignments. The cost of a multiple alignment can be calculated:

a) Column by column, adding the pairwise costs of all pairs of residues in each column.
The total cost is the sum of all column costs. An example of a sum-of -pairs or SP-cost.

b) As the sum of all pairwise costs, comparing each sequence pairwise to all others

Some alignments must be handled pairs first.

When two sequences are compared within the multiple alignment the alignment is termed a projection
of the multiple alignment in the direction of the two sequences under comparison.

Analogous to drawing a line representing the optimal pairwise alignment through a matrix, optimal multiple alignment can be represented as the path through a multidimensional lattice enclosing a polyhedron. Projections within the multiple alignment are then analogous to optical projection onto faces of the polyhedron.

For alignment of 3 sequences (represented by a cube) the computer must choose the minimum of 7 values; for 4, 15 values; for k, 2^k- 1.

Determining the cost of multiple alignments of many long sequences therefore involves very large or impossible amounts of computer time, and approximate methods must be used. The most common is the Carillo - Lipman Method. This method depends on the concept that the path to the optimal alignment will follow the diagonal of the cost hyperlattice if no gaps are allowed. The distance the path is displaced from the diagonal will be proportional to the number of gaps introduced. The path will then not lie along the diagonal but within a certain "volume" around it. If the computer can limit its calculations of costs within this volume or some limited estimate of it, instead of at every node, computation time can be significantly reduced. The following steps are involved:

1. The optimal pairwise alignment cost is determined for each pair of sequences.

2. A heuristic method, e.g. alignment along trees, is used to estimate the projected pairwise alignment cost. (The projected heuristic cost).

3. The difference (D_i,j)^* between the costs of the projected heuristic and optimal pairwise alignments is calculated for each sequence pair.

4. The Carillo-Lipman bound for each sequence pair is then the sum of:

Optimal pairwise alignment cost for that pair + sum of cost differences (D_i,j) for all other pairs.

5. The computer will then only calculate costs in the volume defined by the lower bounds (the optimal pairwise alignments) and the upper bounds (the Carillo-Lipman bounds). This is equivalent to reflecting the projections of the pairs of sequences from the faces back onto the diagonal of the alignment hyperlattice.

MULTIPLE ALIGNMENT OF (CODING) DNA FROM ALIGNED AMINO ACID SEQUENCES

RevTrans at Denmark Technical university. Based on the slower disappearance of phylogenetic signal from protein,compared to DNA sequences

PROBLEM 1

Using the programs available at BCM Search Launcher, perform a multiple alignment on the sequences below:

>7FAB_light_chain
ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAGHNVKWYQQLPGTAPKLLIFHNNARFSVSKSGTSATLAITGLQAEDEAD
YYCQSYDRSLRVFGGGTKLTVLRQPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVKAGVETTTP
SKQSNNKYAASSYLSLTPEQWKSHKSYSCQVTHEGSTVEKTVAP
>2FB4_light_chain
QSVLTQPPSASGTPGQRVTISCSGTSSNIGSSTVNWYQQLPGMAPKLLIYRDAMRPSGVPDRFSGSKSGASASLAIGGLQ
SEDETDYYCAAWDVSLNAYVFGTGTKVTVLGQPKANPTVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVK
AGVETTKPSKQSNNKYAASSYLSLTPEQWKSHRSYSCQVTHEGSTVEKTVAPTECS
>2FB4_heavy_chain
EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAIIWDDGSDQHYADSVKGRFTISRNDSKNTLF
LQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPDYWGQGTPVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFP
QPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEPKSC
>7FAB_heavy_chain
AVQLEQSGPGLVRPSQTLSLTCTVSGTSFDDYYWTWVRQPPGRGLEWIGYVFYTGTTLLDPSLRGRVTMLVNTSKNQFSL
RLSSVTAADTAVYYCARNLIAGGIDVWGQGSLVTVSSASTKGPSVFPLAPTAALGCLVKDYFPEPVTVSWNSGALTSGVH
TFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEP
>1FC1
PSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPQVKFNWYVDGVQVHNAKTKPREQQYNSTYRVVSVLTVLHQNWLDGK
EYKCKVSNKALPAPIEKTISKAKGQPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPV
LDSDGSFFLYSKLTVDKSRWQQGNVFSCSVMHEALHNHYTQKSLSLS
>BS1-fragment
VTISCTGSSSNIGAGNHVKWYQQLPG
>BS2-fragment
VTISCTGTSSNIGSITVNWYQQLPG
>BS3-fragment
LRLSCSSSGFIFSSYAMYWVRQAPG
>BS4-fragment
LSLTCTVSGTSFDDYYSTWVRQPPG
>BS5-fragment
PEVTCVVVDVSHEDPQVKFNWYVDG
>BS6-fragment
ATLVCLISDFYPGAVTVAWKADS
>BS7-fragment
AALGCLVKDYFPEPVTVSWNSG
>BS8-fragment
VSLTCLVKGFYPSDIAVEWESNG

The first five sequences are the light or heavy chains of 3 antibody molecules, and the last 8 are fragments of these molecules. Hence the fragments should be in alignment with the appropriate regions of their corresponding antibodies in the multiple alignment.

Why is this not always so in the Clustal alignment? Remember, Clustal first aligns along a tree

What are the conserved features of the sequences, and how do they relate to their function?

PROBLEM 2

Looking at the 3-dimensional structures of the immunoglobin domains below:

>B1, 7FAB light chain variable region
ASVLTQPPSVSGAPGQRVTISCTGSSSNIGAGHNVKWYQQLPGTAPKLLIFHNNARFSVSKSGTSATLAITGLQAEDEAD
YYCQSYDRSLRVFGGGTKLTVLR
>B2, 2FB4 light chain variable region
QSVLTQPPSASGTPGQRVTISCSGTSSNIGSSTVNWYQQLPGMAPKLLIYRDAMRPSGVPDRFSGSKSGASASLAIGGLQ
SEDETDYYCAAWDVSLNAYVFGTGTKVTVLGQ
>B3, 2FB4 heavy chain variable region
EVQLVQSGGGVVQPGRSLRLSCSSSGFIFSSYAMYWVRQAPGKGLEWVAIIWDDGSDQHYADSVKGRFTISRNDSKNTLF
LQMDSLRPEDTGVYFCARDGGHGFCSSASCFGPDYWGQGTPVTVSS
>B4, 7FAB heavy chain variable region
AVQLEQSGPGLVRPSQTLSLTCTVSGTSFDDYYWTWVRQPPGRGLEWIGYVFYTGTTLLDPSLRGRVTMLVNTSKNQFSL
RLSSVTAADTAVYYCARNLIAGGIDVWGQGSLVTVSS
>B5, 1FC1 heavy chain constant region
PSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPQVKFNWYVDGVQVHNAKTKPREQQYNSTYRVVSVLTVLHQNWLDGK
EYKCKVSNKALPAPIEKTISKAKG
>B6, 7FAB light chain constant region
QPKAAPSVTLFPPSSEELQANKATLVCLISDFYPGAVTVAWKADGSPVKAGVETTTPSKQSNNKYAASSYLSLTPEQWKS
HKSYSCQVTHEGSTVEKTVAPtscs
>B7, 7FAB heavy chain constant region
ASTKGPSVFPLAPTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHK
PSNTKVDKKVEPksa
>B8, 1FC1 heavy chain constant region
QPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSRWQQGN
VFSCSVMHEALHNHYTQKSLSL

experts have derived so-called "structurally verified alignments" for parts of them (called"motifs").

The following correspond to the different beta-chains in the regions of the above immunoglobulins, and can be taken as the "standard of truth", i.e. these sequences should align in the optimal multiple alignment:

A B C D E F G

VLTQPP TISCTG NVKWY SVSKS TSATLAI YYCQSY VFG
VLTQPP TISCSG TVNWY SGSKS ASASLAI YYCAAW VFG
QLVQSG RLSCSS AMYWV TISRN NTLFLQM YFCARD YWG
QLEQSG SLTCTV YWTWV TMLVN NQFSLRL YYCARN VWG
SVFLFP EVTCVV KFNWY KTKPR VVSVLTV YKCKVS IEK
SVTLFP TLVCLI TVAWK GVETT ASSYLSL YSCQVT VEK
SVFPLA ALGCLV TVSWN GVHTF LSSVVTV YICNVN VDK
QVYTLP SLTCLV AVEWE NYKTT LYSKLTV FSCSVM TQK

How close do the multiple alignment tools at BCM and Biobenchelper come to to the (correct) optimal alignment corresponding to the above "standard of truth"?

Other sequence sets and aligments for assessing multiple aligment programs available at the BAliBASE Web site.

Notes on Chapter 3, VSNS course:

1. The projected optimal cost is, of course, unknown. We can only get it once the optimal alignment is known.

2. Figure 9 calculates "compensation terms" relative to the projected optimal, which is a theoretical exercise due to the above.

3. Figure 11 calculates these terms relative to the pairwise optimal, which is known. You may therefore may want to skip Figures 9&10.

4. Exercise 51:

Lower bound = sum of (Weight*Cost) values i.e. sum of weighted optimal pairwise alignment costs.

Projected cost = estimated cost of projection of alignment in the direction of each pair of sequences.

Pairwise cost = cost of optimal pairwise alignment.

Delta = sum of epsilon values, i.e. differences in projected & optimal alignment costs.

Max. delta = sum of Max. epsilon values, i.e. compensation terms.

MULTIPLE ALIGNMENT EDITORS AND DISPLAY TOOLS

Boxshade server - paste aligment from BCM Multiple Alignment site, try output format = RTF, input format = other.

GeneDoc available Pittsburg Supercoputing Center

OTHER INTERNET RESOURCES

A Gentle Guide to Multiple Alignment. VSNS course at University of Bielefeld.

Multiple Alignment Stamford Course #218

Mutiple Alignment tutorial at Pittsburgh Supercomputer Center.

MALIG and MALGEN - multiple alignment of large numbers of sequences.

MAVID: Multiple Alignment Server for large genomic sequences at UC Berkeley.

Biobenchelper Compendium of Multiple Alignment Tools & Resources

VISTA - Visual Tools for Alignments. Visualizes long sequence alignments of DNA from two or more species with annotation information.

http://www.sciencecentral.com/scidir/site/480922