Search of motifs with EM algorithm

Discussion

After testing our program with different datasets, we can affirm that is not always good enough for our purpouses, because it has big problems refering the initiation value of the random 'P' matrix.

If we run the program, reached the convergence, we realize that the maximum value that predicts the probability that any position could be the motif start point is not statistically significant. We ran the program with the following randomly generated sequences. As we show below, these contain the marked motif (which is 6- nucleotide long):


Sequence 1: CACGTACCTACTCTGTATATACAACTACAA

Sequence 2: GGGGGTATATACTTTTCCTCCGAATCATTA

Sequence 3: GACAGTGGATCTCTAACCGCTATGTATATA

Sequence 4: CGAAACGTATATAACCACGTGAGCACTTAT

Sequence 5: TCTGACAGAGTGCTCTCCCGCACTATATAG


We ran the program 10 times and in all of them the motif found was not the real motif hidden in each sequences.

The motifs found were:

Sequence 1: CACGTACCTACTCTGTATATACAACTACAA

Sequence 2: GGGGGTATATACTTTTCCTCCGAATCATTA

Sequence 3: GACAGTGGATCTCTAACCGCTATGTATATA

Sequence 4: CGAAACGTATATAACCACGTGAGCACTTAT

Sequence 5: TCTGACAGAGTGCTCTCCCGCACTATATAG

And the probabilities of each sequence motif starting point are shown in the next table (these values didn't changed after running the program 10 times):


Sequence

Max Prob Motif Start Point

10.05953
20.07148
30.05777
40.06361
50.06384

The probability of any concrete position could be the motif starting point is too low, so in any case, it cannot be considered as a real motif starting point.

So, considering that the motif is not found by our program we conclude that is due to its incapacity to find a position with a higher probability to be the motif starting point.

We also try a lot of different set of sequences, like only 3 sequences and shorter, or more sequences and longer, etc. The best result we had is shown in next table:

CCGTATATAAGCGCGCCAAGAATGGTTATC

CTCCGACAGGATCACGGCGCGCATATATAG

ACTTGTATATATAGCTCCGGCCGCGCGCTG

Sequence

Max Prob Motif Start Point

10.36817
20.39502
30.30516

After thinking about why that happened, we believe that EM algorithm can only find the real hidden motif depending on the following parameters :

  1. The background randomness (for this reason, we tested our program with sequences randomly generated).
  2. The length of the sequence
  3. The length of the motif
  4. The nucleotide composition of the motif
  5. The number of sequences that we run

EM algorithm limitations:

  • EM algorithm is difficult to initialize and the quality of the final solution depends on the initilitation quality
  • As the algorithm tends to converge to an local optimum, then the result is far away from the best solution, eventhought it is usually an acceptable solution
  • The iteration number to get the optimal result is unknown
  • It is difficult to know the worth of the algorithm
  • The MEME program has problems to find a motif if the first random step does not get near it. So, if the first P matrix random constructed does not fall before the motif, the algorithm will tend to other local maximum


Triying to find solutions:

Once we have had our program done and working, it seemed not to work as well as expected. It found the motif once for each 8 times we ran the program, and as explained before the Z matrix values were to low to be statistically significant.

In an atempt to find a solution we tried to change the weak part of the program, the random star. So we try to make ten iteration, each one creating a random P matrix and doing only once all the path trough the program, and storing each matrix in a big hash. Then comparing the likelihood of each matrix and take the best one as our starting P matrix for the main body of the program. But by doing it the program didn't compile, and we haven't be able to fix it.

To see this test: Improvement failed


HomePage