Search of motifs with EM algorithm

Materials and Methods

Finding motifs in DNA and aminoacids sequences has been a problem since in the most datasets the parameters are not known. In this case one of the approaches is to use a computational model. The one which we use here is our own edition of the classic MEME computer program first done by Timothy Bailey(1995).

This program runs through the EM algorithm. This algorithm is an iterative optimization method to estimate some unknown parameters (matrix Z), given measurement data (matrix P). But we don't really have the measurement data because it is 'hidden', and we need to integrate it out.

The intuition behind EM is to alternate between estimating the unknown parameters and the hidden variables until it arrives to convergence. Convergence is the base of heuristics*, this is an internal auto-learning system that allows computer to improve the probabilistic model in each iteration. This idea has been around for a long time.

*Heuristics are rules derived from the own experience that give aproximate fast solutions to complex problems.

The expectation - maximization algorithm has two main step: The 'E-step' and the 'M-step'.

'E' (expectaction) STEP

The E-step calculates a matrix of probabilities (the Z matrix). This is the probability of each nucleotide to be the start of the motif. This calculus is made from a probability matrix (the P matrix), in which there are defined the probabilities of each nucleotide in each position of the motif plus the probability of each nucleotide to be background (this is all the sequence which is not motif).

'M' (Maximization) STEP

The M-step calculates a new P matrix from the Z matrix calculate in the step above. In this way the algorithm maximizes its likelihood about the real motif.

This two steps are repeated some times (until convergence is reached, that is when the likelihood doesn't improve more). So at the end of the program we get an output where we can see the sequence dataset with the motif found and the likelihood of that motif.

MEME Models

In the MEME program there are three basic models: OOPS, ZOOPS and TCM.
OOPS means One Occurrence Per Sequence. This model assumes that all the sequences of the dataset have been created by a probabilistic model which parameters can be calculated as to find the motif.This model assumes that each motif appears exactly once in each sequence in the data set.
ZOOPS means Zero or One Ocurrence Per Sequence. This one is similar to OOPS, but it adds one more variable to have in account the possibility that in one or more sequences there were no motif.
TCM model is a random process that take the sequences as a Two-Component Mixture. One component, the background, generates one letter at a time; the other component, the motif, generates a string of letters.TCM is the most general model which allows a motif to appear any number of times en each sequence.

The other models have some improvements compared with the OOPS, but that improvements increased the complexity of the program. For these reason we have chosen to implement our program based on this model.


The Program

Now we are going to explain how we have implemented the program step by step.

Given INPUT

First step: randomly generation of a set of sequences

Selection of a random position in each sequence which will be the putative starting point of the hidden motif.

Generation of a dependent position nucleotide-frequency matrix: rows will be each of the four nucleotides (A, C, G, T); the number of columns will be the total motif length and at the end, there will be another extra column that will contain background nucleotide frequencies.

E-step: create the Z matrix where the number of rows is the number of the sequences and the number of columns is the length of the sequences. For normalizing the matrix we divide each cell score by the sum of all the scores obtained in the row.

M-step: create the P matrix again which will contain the probability of each nucleotide to be in each motif position using the calculated results on the E-step. This matrix is also normalized, this matrix is normalized by dividing each calculated probability by the sum of all the position probabilities (for example if position 1 is A:0.5, G:0.25, C:0.42, T:0.12; then normalized A:0.39, G:0.19, C:0.33, T:0.09, then all suming 1).

Iteration: by repeating many times these steps, the positions that will be reinforced will be those that are more repeated in all the data set. That is because as their are equal in the sequences the program will retain them in the P matrix and not dilute them it with all the other random nucleotides in the sequences.

OUTPUT: our program gives the sequences with the motif highlightted. Also gives the value of matrix Z for each position and the likelihood of the last P matrix, which defines the pattern of the found motif. For example:


	     ATGCCGGCCG TATATA ATCGCTAGCT // Z value = 0.39 
	     CGCAG TATATA AGTGACACAGTACAT // Z value = 0.26
	     TAGACACTTGTAGATCGATCA TATATA // Z value = 0.45

	     Likelihood = -4320.055
	
check our PROGRAM !!!!

Homepage