Search of motifs with EM algorithm

Introduction

Many research projects are focused on the discovery of the mechanisms that control different cellular processes in order to understand many cellular responses that can be both organism specifics, or shared in many organisms under different conditions. These cellular mechanisms are regulated by coding or non coding sequences (called motifs), which are present in a concrete location inside an organism genomic DNA. For example the 'TATA' box.

The length of these sequences is a particular feature of each motif.

A motif can control cellular mechanism in different ways depending on its belonging to a specific region: coding or non coding region.

If the motif is placed inside a genomic sequence that encodes for a protein required for cellular survival, it's easy to imagine the need of a motif conservation along the evolution.

However as some non coding sequences (located upstream the coding regions which are regulating) are important for the regulation of cellular pathways, the selective pressure also tends to keep them conserved along the evolution despite of not encoding for a protein.

These regulatory sequences, are often difficult to identify, as they are sometimes short and surrounded by large amounts of background.

The comprehension of cellular responses regulation by the identification of these motifs inside the genome of one or more species, is a goal that can be achieved both by empirical research (for example by doing a microarray), or by computational assisted motif prediction.

Nowadays, there are many programs which can represent useful tools to find motifs shared in a set of sequences.

Background

By doing this essay we have tried to implement the Expectation-Maximization algorithm (which is the basis of the MEME program) in order to find a motif shared in a set of given sequences.

This workshop is based on Timoty Bailey's thesis about Expectation Maximization algorithm (EM). We have implemented a program, similar to Bailey's MEME, that, given a set of sequences and a width of motif, it can discover the occurrences of motifs in the given dataset and output the position of the motif ocurrences and a description of the motif.

As we have mentioned above, the principal input to our program is a set of DNA sequences, and its output is a series of a probabilistic sequence model, each corresponding to one motif, whose parameters have been estimated by expectation - maximization.

For running the EM algorithm, we have to assume that the sequences in the dataset have been created from a probabilistic model whose parameters are unknown. So, we only have to choose one of the possible models, and from there, try to find the motif in the sequences by calculating the missing parameters.

The aim for our work was to implement a program based on the mentioned algorithm, which could read a sequence dataset and found a motif.


HomePage