Introduction Goals Methodology Programs Results Conclusions References Authors Acknowledgements

RESULTS


The exhaustive analysis of results detailed above is available in Conclusions.

Analysis of algorithm complexity and computational cost of brute-force pattern matching algorithm versus Knuth-Morris-Pratt algorithm

In order to determine the computational achievement of Knuth-Morris-Pratt algorithm comparing with force-brute pattern matching algorithm, we implemented two programs for each one, which function is generating all the possible patterns of a determined mRNA and verifying if those patterns are unique or not in the same mRNA, with the correspondence algorithm.
Comparisons %
Brute-force Algorithm Knuth-Morris-Pratt Algorithm KMP/CE x 100 KMP Gain
Growth Hormone (GH)
Homo sapiens
948.044 866.803 91,43 8,57
Insulin-like growth factor-1 (IGF-1)
Homo sapiens
72.677.840 66.679.912 91,75 8,25
Erythropoietin (EPO)
Homo sapiens
2.426.543 2.212.664 91,18 8,82
Erythropoietin (EPO)
Mus musculus
439.250 399.104 90,86 9,14

Nucleotides
Growth Hormone (GH)
Homo sapiens
822
Insulin-like growth factor-1 (IGF-1)
Homo sapiens
7260
Erythropoietin (EPO)
Homo sapiens
1340
Erythropoietin (EPO)
Mus musculus
579

Reproduction of Mus musculus mRNA results by Knuth-Morris-Pratt algorithm

In order to corroborate the results obtained by Dr. Robert Castelo, Mus musculus EPO gene, we implemented our program of unique subsequences search (explained in Programs) based on Knuth-Morris-Pratt algorithm, but only in mouse chromosome 5 to save time, because EPO gene locates in this chromosome.

Obtained results are that the following subsequences are unique:

CGACAGTCGAGTTC

AGTGGTCTACGTAG

GGGTCTACGCCAAC

Moreover, the program also informs about the shift of these patterns in the chromosome, a very imporant fact for using the results to its application in relation to gene doping, and also in those cases that found subsequences locate in two different exons separated by intron. The found shift for each subsequence is the next one:

Shift
CGACAGTCGAGTTC
136.403.346
AGTGGTCTACGTAG
136.402.282
GGGTCTACGCCAAC
136.401.412

As our program has a modifyng the length automatic strategy (in order to find the minimum unique subsequences) and we only checked in chromosome 5, we got also minor length subsequences than 14 nucleotides for the first and third subsequences, that if we had checked in whole genome, subsequences would have not been uniques.

For the first subsequence, the program finds one subsequence of 12 nucleotides: CGACAGTCGAGT, with a shift of 136.403.438. And for the third subsequence, it finds a subsequence of 11 nucleotides: GGGTCTACGCC, with a shift of 136.401.415. If we compare the shift of these smaller sebsequences with the shift of the subsequences they come from, it does not agree with initial position. This is because all the subsequences are found in the negative strand. Considering this fact, the shift coincides.

If you want to consult this search, click on the outputs below:

Prediction of EPO mRNA secondary structure

To elucidate if the found unique subsequences are located in a exposed region of the mouse EPO mRNA we have used the mFOLD program. This program predicts the folding package of a mRNA or DNA sequence. The results are the next ones:

Click on the image to get the postscript that allows you to blow up the mRNA structure.
The three unique subsequences are marked in green.

Now, the three unique subsequences are enlarged separately:

    
Subsequence AGTGGTCTACGTAG    

Subsequence GGGTCTACGCCAAC


Subsequence CGACAGTCGAGTTC