ClustalMU



MULTIPLE SEQUENCE ALIGNMENT (MSA)



Introduction :



ClustalMU is a program written in Perl language, whose objective is to obtain the best alignment between multiple sequences of proteins. It has been developed following the ClustalW method, which was designed and optimized by Julie D. Thomson et al., who based the multiple alignment into three steps:

- First of all, it makes a global pairwaise alignment which is used in order to obtain the similarity between all the sequences, after what it will construct a distance matrix.

-The next step consists in providing a tree guide making use of the Neighbour-Joining method, comparing the distances obtained in the distance matrix, created in the first step.

-Finally, it performs the progressive alignment following the order obtained at the tree guide.

After all this steps are done, it creates and output file where all the sequences are aligned, providing a powerful skill in order to discover the aminoacids that are specially conserved in all the proteins and that have been preserved during their evolution because of negative selection processes. Moreover, this regions can be important for their function or for keeping their functional structure.


How ClustalMU works :



The process is divided into several steps, detailed below:

Step 1) Extracting the sequences from a FASTA file:

It can read as sequences as you want from a FASTA file, saving the data into a hash, and then in two separated arrays: one for the sequences (values), and another for the names (keys).


Step 2) Reading a substitution matrix (i.e.: Blossum62):

It extracts the relationships between two aminoacids and saves them into a hash, that will be important when alingning.

i.e.: Aligning an alanine with an arginine. If alingning an alanine (A) with an arginine (R) has a score of -1, it saves {"A"}{"R"} as keys, and -1 as value.


Step 3) Obtaining an alignment order:

It will be necessary for the global pairwaise alignment. It creates an array where will be all of the possible combinations between two sequences for aligning them.

i.e.: For 5 sequences of proteins.
If we have 5 sequences of proteins, it will make all the combinations without repeating: 1-2; 1-3; 1-4; 1-5; 2-3; 2-4; 2-5; 3-4; 3-5; 4-5.


Step 4) The global pairwaise alignment:

It aligns all the sequences taking the order from the order array. The alignment is done making use of the substitution matrix hash created in the step 2. The substitution matrix hash is used due to an important aspect: there is the necessity to differenciate all the possible mismatches that can take place when aligning one aminoacid with another one which is different to the first.

After each of the alignments is finished, ClustalMU provides a score for each one, that will be necessary when constructing the distance matrix. The score is saved in an array that correspond to another one that contains which sequences give each score.

i.e.: An example of how ClustalMU aligns and calculates the distance.

Alineo fuguercc5 i urokinaseiso2: i la distancia es 0.958413085087872
Alineo fuguercc5 i apolipoprotein: i la distancia es 0.973029406646946
Alineo fuguercc5 i homoercc5: i la distancia es 0.558135704874835
Alineo fuguercc5 i urokinaseiso3: i la distancia es 0.954585000870019
Alineo fuguercc5 i tetraercc5: i la distancia es 0.442317730990082
Alineo urokinaseiso2 i apolipoprotein: i la distancia es 0.89735516372796
Alineo urokinaseiso2 i homoercc5: i la distancia es 0.962121212121212
Alineo urokinaseiso2 i urokinaseiso3: i la distancia es 0.32252027448534
Alineo urokinaseiso2 i tetraercc5: i la distancia es 0.949809545149003
Alineo apolipoprotein i homoercc5: i la distancia es 0.978919631093544
Alineo apolipoprotein i urokinaseiso3: i la distancia es 0.88646288209607
Alineo apolipoprotein i tetraercc5: i la distancia es 0.959220255433565
Alineo homoercc5 i urokinaseiso3: i la distancia es 0.95998023715415
Alineo homoercc5 i tetraercc5: i la distancia es 0.645092226613966
Alineo urokinaseiso3 i tetraercc5: i la distancia es 0.94174322204795


This will be the necessary to construct the first distance matrix.


Step 5) Constructing the distance matrix:

Distance matrixs are constructed for obtaining the tree guide. Once it obtains a distance matrix, ClustalMU searchs for the minimum distance into the matrix, saves the references of the sequences that give that distance and rebuilds the distance matrix recalculating the distances among those sequences that gives the minimum distance. That step is done until all the sequences are saved together.

i.e.: An example of a distance matrix constructed by ClustalMU.

fuguercc5 0
urokinaseiso2 0.958413085087872 0
apolipoprotein 0.973029406646946 0.89735516372796 0
homoercc5 0.558135704874835 0.962121212121212 0.978919631093544 0
urokinaseiso3 0.954585000870019 0.32252027448534 0.88646288209607 0.95998023715415 0
tetraercc5 0.442317730990082 0.949809545149003 0.959220255433565 0.645092226613966 0.94174322204795 0



Step 6) Creating the tree guide:

The tree guide is constructed while obtaining the distance matrix. In step 5, a node is defined every time a minimum distance is found. The sequences that give the minimum distances are joined because a node exists between them. Joining sequences, we finally obtain the tree guide from which the progresive alignment would be done.

Results obtained when running 6 sequences of proteins are showed below:

A partir d'aquestes sequencies fuguercc5 urokinaseiso2 apolipoprotein homoercc5 urokinaseiso3 tetraercc5
L'ordre a seguir a l'hora d'aliniar és: (2,((1,4),(3,(0,5))));


Although we have worked very hard in order to finish the program, it has been impossible for us to complete it. We obtain the order from which ClustalMU would have done the progressive alignment, but we were unable to make the regular expression that would have allowed ClustalMU to understand in which order does it has to align sequences.

When the contruction of the tree guide ends, ClustalMU obtains an array (@nodes) with the following structure:

$nodes[0]=(2,((1,4),(3,(0,5))))
$nodes[1]=((1,4),(3,(0,5)))
$nodes[2]=(3,(0,5))
$nodes[3]=(0,5)
$nodes[4]=(1,4)

Its structure shows the order in which ClustalMU has found the nodes. Now it should read this array and create the progressive alignment.


HOME