Cerca de subseqüències úniques

Introduction

Goals

Methodology

Programs

Results

Conclusions

References

Authors

Acknowledgements

METHODOLOGY

To achieve the main goals of this project and to tackle the problem of pattern matching between symbols, we have implemented two algorithms: the brute-force pattern matching algorithm and Knuth-Morris-Pratt algorithm.

BRUTE-FORCE PATTERN MATCHING ALGORITHM

This algorithm is based on searching matching patterns in a DNA sequence, by means of the "naive" algorithm of going through all sequence nucleotide per nucleotide, making the appropriate comparison with the pattern. The main factor to bear in mind is the problem of searching all possible movements throughout the sequence.

For example, if we consider a sequence t = {CTCACTGCCTGCCTAG} and a pattern p = {CTGCCTAG}, the pattern p is inside t from the ninth symbol, and we can say that the pattern p appears inside t with movement s, in this case, s = 8.

This lead the symbols:

t[s+1]t[s+2]t[s+3]...t[s+m]

to match perfectly with the symbols of p:

t[s+1] == p[0]
t[s+2] == p[1]
t[s+3] == p[2]
...
t[s+m] == p[m-1]

So, in this way, we can find a maximum of n - m + 1 movements s.

This algorithm, also known as "force-brute", works in a simple and obvious way: when a mismatch occurs between the pattern and the sequence, the comparison process stops and starts again moving the pattern one position throughout the sequence. In the example, the comparison between the pattern and the sequence starts with the first position of them. In this case, the comparison also continues with the two following nucleotides, in which there is also coincidence. Therefore, the comparison between the pattern and the sequence starts another time, but now compares the position 0 of pattern array with the position 1 of sequence array (second nucleotide), and so on, displacing the starting point of pattern comparison only one position throughout the sequence.

As a result, we find that sequence t contains the pattern p with a movement s equal to 8.

KNUTH-MORRIS-PRATT ALGORITHM

Knuth-Morris-Pratt (KMP) algorithm is a lineal algorithm for pattern matching. It's similar to brute-force pattern matching algorithm but KMP is based on the fact that, when it finds the first symbol that doesn't coincide between the pattern and the sequence, it already knows the pattern symbols compared until that position. In that way, after a mismatch the process doesn't continue (from the beginning of the pattern) with the next sequence character, and it doesn't compare those positions that already knows that don't generate an exact occurrence of the pattern. So, avoids redundant comparisons.

KMP takes advantage of the pattern information, because the sequence where it search is unknown. To benefit from this fact, it is necessary to construct a table of coincident movements inside the pattern before initiating the search. This table, named "next table", has a position for each pattern position,and in each position n contains the maximum number of coincident symbols with the pattern before that position (without coinciding the whole prefix of n symbols). In other words, for each position n is checked the coincidences between the different prefix of n and the pattern that precedes n.

Continuing with the previous example, in which there is a sequence t ={CTCACTGCCTGCCTAG} and a pattern p = {CTGCCTAG}, and the pattern p is inside t from the ninth symbol. Then, as before, we can declare that the pattern p is inside t with a movement s = 8.

In KMP, first of all, is calculated the next table of the pattern. For the pattern p = {CTGCCTAG} it would be:

In order to build next table, is assigned a value of -1 to position 0 of next table array and a value of 0 to position 1. The value of -1 to position 0 is like a signal used for initiating the comparison with the next sequence nucleotide in those cases that occurs a mismatch between the pattern position 0 and one sequence symbol. It would have to compare the prefix of position 1 of next table array, consisting only of position 0, with the pattern before the position 1. As the prefix is always maximum, even being coincident, it is assigned directly the value of 0 to position 1. Then, it continues with position 2 of pattern array, where it is look for the number of coincident nucleotides with the pattern before that position. In this case, it is simple, because it is only necessary check if position 1 (T) is equal to position 0 (C) of pattern array. As there are not equal, the value of next table array of position 2 is 0. Later, it moves to position 3, comparing if the prefix, consisted of positions 1 and 2 or only of position 2, coincide with the pattern from its beginning. In other words if the prefix constituted by positions 1 and 2 is equal to positions 0 and 1 or if position 2 is equal to position 0 of the pattern. In this case, there is no coincidence either, so the next table value of position 3 is 0. If it is checked position 4, it would have to compare the prefix consisted of positions 1, 2 and 3, positions 2 and 3 or position 3 with the pattern that precedes them. Now, position 3 is equal to position 0 of pattern array, so the value of next table array of position 4 is 1 (the maximum prefix of position 4, that coincides with the pattern, has one nucleotide). What happens in position 5 is the same in position 4. The maximum prefix that coincides with the pattern has one nucleotide, so the next table value of position 5 is 1. Position 6 has different prefix: constituted by positions 1, 2, 3, 4 and 5, by positions 2, 3, 4 and 5, by positions 3, 4 and 5, by positions 4 and 5 and by position 5. The prefix constituted by positions 4 and 5 coincides with positions 0 and 1 of pattern array, and the value of position 6 of next table array is 2 (there are two coincident nucleotides). Finally, it is checked last position 7. Any prefix of this position match with the pattern. So, the next table array has a value of 0 in position 7.

Once we have built the next table, its is proceed to implement KMP algorithm. Its method is the following one: every time there is a match between two symbols (one of the sequence and the other of the pattern), the algorithm moves to next position of both, sequence and pattern. On the other hand, when there are two different symbols, it starts searching the pattern in the sequence from the pattern position that indicates the next table.

For example:
The two first nucleotides of the sequence there are C and T and correspond with the two first symbols of the pattern. But third sequence nucleotide is C and third pattern nucleotide is G. KMP algorithm looks for the value of third pattern symbol which is 0. Now, compares the third sequence position (where it was the mismatch) with the position 0 of the pattern, which as there is an array, it is the first position, corresponding to C. Algorithm avoids the comparison of the pattern with the second sequence position, because uses the next table as a memory system and already knows that the comparison is unnecessary, as it can not match with the pattern.

In the penultimate pattern comparison with the sequence, in which the pattern starts the comparison from sequence nucleotide 5, there's a match between the symbols until arrives to sequence nucleotide number 11, where there is a G, that is compared with pattern nucleotide A. As there is a mismatch, it looks for the next table value of this A, that is 2. Then, starts comparing the sequence symbol G with the position 2 of pattern array. Finally, finds the pattern p inside the sequence t with a movement s equal to 8.

Once we have implemented Knuth-Morris-Pratt in Perl programming language for unique subsequence search in human genome for GH, IGF-1 and EPO genes, it was determined that the time of program execution would be about 40* days long approximately for each gene at best. This period of time was higher than disposal antecedents by our project supervisor, Dr. Robert Castelo, who had implemented the brute-force pattern matching algorithm in C programming language in order to find unique patterns of EPO gene in mouse genome. At the beginning, it was thought that implementing Knuth-Morris-Pratt algorithm, which is more efficient than brute-force pattern matching algorithm, it would be a computational achievement, in spite of the fact that Perl language is slower than C language. But the high difference in terms of time between this two programming languages, has made impossible get the unique subsequences for the required genes in the established time limit. However, we have made a program based on the implementation of Knuth-Morris-Pratt algorithm, that is capable of realizing the unique patterns search in the whole genome for a determinate mRNA. In order to verify the program correct operation we have corroborated the results obtained by Dr. Robert Castelo. That is, we have reproduced the three unique patterns found in EPO gene of mouse, that are unique subsequences of chromosome 5 of mouse genome, where the locus of this gene is located.

*The stimation of these approximately 40 days was carried out in the next way:
Our program contained a group of "warning signals" that inform us about its execution state every moment: the length, the pattern and the chromosome, and for each chromosome, every time it had made 50.000 lines. With those "prints" to the terminal window we could detect that, in one of the fastest computer which Dr. Robert Castelo allowed us to access in, the program process about 50.000 lines every 5 seconds.Considering that mouse genome has a total of 63.299.105 lines, the program would take 6.329,91 seconds to process the whole genome for each pattern. As an hour is equal to 3.600 seconds, this means it would be 1,75 hours. If our program generates 566 patterns of length 14 by a mRNA of 579 nucleotides, it would take a total of 990,5 hours to run, that are equivalent to a 41 days approximately.

As we have mentioned before in Goals, in order to monitoring the gene expression of mRNA in a possible gene doping case, we require specific marked sequences and there have to selectively hybridate with the target mRNA. For this reason, we were interested in obtained unique patterns of GH, IGF-1 and EPO genes, which were accessible to marked PNAs and not hidden in secondary structure of the mRNA. Therefore, once we got the unique patterns in the whole genome (specifically in chromosome 5 of mouse genome) by implementing Knuth-Morris-Pratt, we predicted secondary structure of the mRNA and the localization of found unique subsequences of EPO gene with mFOLD program. This program, created by Dr. Michael Zuker of Rensselaer Polytechnic Institute's School of Science, is available as a web service. mFOLD is a software that predicts the fold and secondary structure in both, RNA and DNA.

The project methodology is the next one:

Brute-force pattern matching algorithm implementation of sequences in order to search for unique patterns in one mRNA.

Knuth-Morris-Pratt pattern matching algorithm implementation of sequences in order to search for unique patterns in one mRNA

Calculation and comparison of the nucleotides operations number that realize both algorithms to find unique patterns for each mRNA inside its own mRNA.

Implementing Knuth-Morris-Pratt algorithm, check if unique subsequences of 14 nucleotides (CGACAGTCGAGTTC, AGTGGTCTACGTAG i GGGTCTACGCCAAC), found by Dr. Robert Castelo for EPO gene in mouse, are unique in chromosome 5 of this organism.

Localize these unique subsequences in the secondary structure of mRNA of EPO mouse gene in order to determine if there are found in more internal or more exposed regions.

The sequences we have used to elaborate this project, obtained in databases of NCBI, there are the next ones:

	mRNA	Accession number
Growth Hormone (GH)	Homo sapiens	NM_000515
Insulin-like growth factor-1 (IGF-1)	Homo sapiens	NM_000618
Erythropoietin (EPO)	Homo sapiens	NM_000799
Erythropoietin (EPO)	Mus musculus	NM_007942