Introduction Goals Methodology Programs Results Conclusions References Authors Acknowledgements

CONCLUSIONS


The results of exhaustive analysis, which are detailed here, are available in Results.

Analysis of algorithm complexity and computational cost of brute-force pattern matching algorithm versus Knuth-Morris-Pratt algorithm

Brute-force pattern matching algorithm runs in a quadratic time: O(nm). It is a kind of algorithm that requires no pre-processing on pattern. As we have explained in Methodology and Programs, the idea is simple: pattern and sequence are compared nucleotide by nucleotide; in case of a mismatch, the pattern is shifted one position to the right and the comparison is repeated, until a match is found or the end of the sequence is reached.

The worst complexity cost of this algorithm is O(nm). This would happen if it is used on e.g. a sequence of n A's followed by a C to search for a pattern of m A's and a C (assume m < n):

Sequence
Pattern

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAC

In this case, up to m nucleotides would have to be compared for all n - m + 1 examined shifts, which sums up to the mentioned worst case complexity of roughly O(nm).

In our results, for example for EPO mRNA of mouse, the sequence length (m) was 579 nucleotides. Our program generates about 566 patterns from this sequence in a length of 14 nucleotides (n). So, the approximately worst number of operations expected is 579*14*566 = 4.587.996. But implementing our program, the number of operations that realizes is 439.250. This difference is because we are not in the worst cases. If we divide the total number of operations that realizes by the total number of patterns that generates from the mRNA (439.250/566) we get a total of 776 operations aproximately for each pattern. If EPO mouse mRNA has 579 nucleotides, this means that for each pattern compares 1,35 nucleotides. Now, we can understand it: it does not compare all the nucleotides of the pattern every time, it shifts to another comparison after an average of 1 - 2 nucleotides finding a mismatch. And the same with the other mRNAs checked.

Although this easy pattern matching algorithm is still used in some applications such as biological research, presents an obvious disadvantage: it is rather inefficient method for sequence strings on small alphabets, because uses small shifts (some of which appear to be negligible) and the position variable has to be backed up each time a mismatch occurs, which means that the input sequence has to be buffered.

On the other hand, Knuth-Morris-Pratt algorithm runs in a lineal time: O(n+m). When we analyzed the brute-force pattern matching algorithm, we judged to be ineffective, because we had to backup the sequence position after each mismatch, and some sequence nucleotides were examined twice or more, although, by the information about the pattern we had, we knew that several shifts could have been skipped. Knuth-Morris-Pratt works quite more efficiently: pattern and sequence nucleotides are compared in a left-to-right scan. If a mismatch occurs, the algorithm searches for the prefix of the pattern that determines how far the pattern can be shifted to the right without missing a possible match. As we have explained in Methodology, the data we need to find the next shifting position is stored in an auxiliary table, next table, which is computed in a pre-processing step by comparing the pattern with itself.

We have just mentioned that the Knuth-Morris-Pratt has a linear worst case boundary. If we analyze the algorithm in Programs, it seems that would be executed up to n and up to m times, making a complexity of O(nm). But this is not so. If you remember our implemented algorithm, whenever sequence and pattern nucleotide match, position is incremented. When a mismatch occurs, position can only be decremented according to the next table as many times as it was incremented before, which means the total number of times is limited by 2n. This signifies that we have a runtime of O(2n) in the worst case, not considering the next table construction, which there is a similar complexity. So, we would have an overall runtime of O(2n) + O(2m) equivalent to O(n+m).

This is one of the major advantages of the KMP algorithm; another good reason to use this algorithm is that, it does not require buffering of the scanned sequence.

Disadvantages of the KMP are that this method requires a high amount of pre-processing, and that it is still not the most efficient solution for larger alphabets. But four our case, biological alphabet L = {A,C, G, T}, it is useful.

Continuing with the previous example of our results of EPO, but now with Knuth-Morris-Pratt algorithm, it does approximately 399.104 operations instead of 439.250. So, we conclude that Knuth-Morris-Pratt gets a 9,14% of computational achievement. And the same with the rest of mRNAs checked, that you can see in Results. The average of computational achievement of Knuth-Morris-Pratt algorithm versus brute-force pattern matching algorithm is about 8,695%.

Reproduction of Mus musculus mRNA results by Knuth-Morris-Pratt algorithm

We can conclude that effectively our program works correctly and can corroborate results obtained by Dr. Robert Castelo for the EPO gene of Mus musculus, as we have demonstrated in Results. Also can gives the shift of found patterns, an important fact for those patterns which locate between two exons separated by intron. Moreover, we have checked that our automatic strategy of modifying length also works efficiently, because when we reproduce the three subsequences, we get other shorter and unique subsequences in chromosome 5. If this search is carried out in the whole genome, these shorter subsequences would not be found as unique ones.

Prediction of EPO mRNA secondary structure

As it can be seen in the Result's pictures obtained by mFOLD, the localizations of unique subsequences found for EPO mRNA are exposed enough of the secondary structure prediction for this gene.