Cerca de subseqüències úniques

In this case, up to m nucleotides would have to be compared for all n - m + 1 examined shifts, which sums up to the mentioned worst case complexity of roughly O(nm).

In our results, for example for EPO mRNA of mouse, the sequence length (m) was 579 nucleotides. Our program generates about 566 patterns from this sequence in a length of 14 nucleotides (n). So, the approximately worst number of operations expected is 579*14*566 = 4.587.996. But implementing our program, the number of operations that realizes is 439.250. This difference is because we are not in the worst cases. If we divide the total number of operations that realizes by the total number of patterns that generates from the mRNA (439.250/566) we get a total of 776 operations aproximately for each pattern. If EPO mouse mRNA has 579 nucleotides, this means that for each pattern compares 1,35 nucleotides. Now, we can understand it: it does not compare all the nucleotides of the pattern every time, it shifts to another comparison after an average of 1 - 2 nucleotides finding a mismatch. And the same with the other mRNAs checked.

Although this easy pattern matching algorithm is still used in some applications such as biological research, presents an obvious disadvantage: it is rather inefficient method for sequence strings on small alphabets, because uses small shifts (some of which appear to be negligible) and the position variable has to be backed up each time a mismatch occurs, which means that the input sequence has to be buffered.

On the other hand, Knuth-Morris-Pratt algorithm runs in a lineal time: O(n+m). When we analyzed the brute-force pattern matching algorithm, we judged to be ineffective, because we had to backup the sequence position after each mismatch, and some sequence nucleotides were examined twice or more, although, by the information about the pattern we had, we knew that several shifts could have been skipped. Knuth-Morris-Pratt works quite more efficiently: pattern and sequence nucleotides are compared in a left-to-right scan. If a mismatch occurs, the algorithm searches for the prefix of the pattern that determines how far the pattern can be shifted to the right without missing a possible match. As we have explained in Methodology, the data we need to find the next shifting position is stored in an auxiliary table, next table, which is computed in a pre-processing step by comparing the pattern with itself.

We have just mentioned that the Knuth-Morris-Pratt has a linear worst case boundary. If we analyze the algorithm in Programs, it seems that would be executed up to n and up to m times, making a complexity of O(nm). But this is not so. If you remember our implemented algorithm, whenever sequence and pattern nucleotide match, position is incremented. When a mismatch occurs, position can only be decremented according to the next table as many times as it was incremented before, which means the total number of times is limited by 2n. This signifies that we have a runtime of O(2n) in the worst case, not considering the next table construction, which there is a similar complexity. So, we would have an overall runtime of O(2n) + O(2m) equivalent to O(n+m).

This is one of the major advantages of the KMP algorithm; another good reason to use this algorithm is that, it does not require buffering of the scanned sequence.

Disadvantages of the KMP are that this method requires a high amount of pre-processing, and that it is still not the most efficient solution for larger alphabets. But four our case, biological alphabet L = {A,C, G, T}, it is useful.

Continuing with the previous example of our results of EPO, but now with Knuth-Morris-Pratt algorithm, it does approximately 399.104 operations instead of 439.250. So, we conclude that Knuth-Morris-Pratt gets a 9,14% of computational achievement. And the same with the rest of mRNAs checked, that you can see in Results. The average of computational achievement of Knuth-Morris-Pratt algorithm versus brute-force pattern matching algorithm is about 8,695%.