Cerca de subseqüències úniques

Introduction

Goals

Methodology

Programs

Results

Conclusions

References

Authors

Acknowledgements

PROGRAMS

COMPARISON OF THE COMPLEXITY BETWEEN BRUTE-FORCE PATTERN MATCHING ALGORITHM AND KNUTH-MORRIS-PRATT ALGORITHM

In order to compare the efficiency of the brute-force pattern matching algorithm with Knuth-Morris-Pratt algorithm, we made a program in Perl that implements the brute-force pattern matching algorithm. The logic of this algorithm is detailed in Methodology.

The main program function is calculate the number of operations that brute-force pattern matching algorithm does, and then, we can compare with Knuth-Morris-Pratt algorithm. For that reason, the input of the program as a parameter is the sequence of mRNA. The program generates all the possible patterns from mRNA and seek which of them are unique. Moreover, the program contains a meter ($x variable), which has the number of operations that the program does. To read the code of this program, download correspondencia.pl.

To compare the algorithmic complexity, we made a similar program, which also generate pattterns from mRNA and seek if there were unique or not in that mRNA, but with Knuth-Morris-Pratt algorithm. The results of this comparison can be consulted in Results.

SEARCH OF UNIQUE SUBSEQUENCES OF ONE MRNA IN A GENOME

DOWNLOAD THE PROGRAM CODE cercasubsequencies.pl

NECESSARY PARAMETERS TO EXECUTE THE PROGRAM
[Parameter0]	Name of the mRNA file in FASTA format
[Parameter1]	Name of the directory where there is the genome
[Parameter2]	Specific length to start the pattern search. It is an optional parameter. The program start to search with a length of 14 by default

The main function of the program consists in looking for unique and specfic subsequences from one mRNA in one enterely genome (human, mouse,...).

The way we developed the program to achieve this goal has been creating optimum program's substructures for each problem to solve:

Next table construction

Next table is a sine qua non factor of Knuth-Morris-Pratt algorithm and, for this reason, one of the first steps. The basis of this coincidences table in the Knuth-Morris-Pratt algorithm are available in Methodology.

The program code, which allow us to calculate this next table in the program, is encapsulated inside a function named nexttable:

sub nexttable{

    my $patro = $_[0]; 

    @patro = split(//,$patro);
    $m = scalar(@patro);

    my $i = 2;
    my $c = 0;

    $coin[0] = -1;
    $coin[1] = 0;

    while ($i < $m){

	$j = 1;
    
	while ($j < $i){
	
	    $w = 0;
	    while ($w < $j && $j < $i){
		if ($patro[$j] eq $patro[$w]){
		    $c = $c + 1;
		    $j = $j + 1;
		    $w = $w + 1;
		}else{    
		    if ($c > 0){
			$w = 0;
			$c = 0;
		    }else{
			$j = $j + 1;
			$c = 0;
		    }
		}
	    }
	}
	$coin[$i] = $c;
	$i = $i + 1;
	$c = 0;
    }
    return @coin;
}

This function receives as first and unique parameter the pattern, which it will use for building the next table. This table is based on the construction of a coincidences table in an array named @coin (coincidences) indexed by the same positions of the pattern. This array always have the value of -1 in the first position (0) and the value of 0 in the second position (1). Then, the algorithm goes through the rest of pattern positions comparing the maximum previous position prefix with the pattern and annotating the coincidences in each position. If you want a detailed explanation of this operation, you can see program code.

Once you have the next table, although it is very important, there are more problems to be solved yet.

Knuth-Morris-Pratt Algorithm

The logic algorithm and why the achievement in efficiency regarding to the brute-force pattern matching algorithm is detailed in Methodology.

Its implementation in situ the program is in two functions: one function named knuth, to check if the patterns are unique in the mRNA; and another function, knuth4genome, to check the same verification in the whole genome.

The first function, knuth, receives three parameters: the next table calculated before, the mRNA sequence and the pattern. And returns 1 (true) or 0 (false) if the pattern is unique or not in the mRNA. The second function, knuth4genome, receives four parameters: the next table, the genome directory address, the pattern and the pattern length. And returns 1 (true)* or 0 (false) if the pattern is unique or not in the genome, and in the case that is unique, also returns the movement and the chromosome where it is localize.

* The function returns 1 when the unique patterns meter is equal or less than 1. The reason of that is for those patterns which are unique in the mRNA but there are not found in the genome, because the pattern can be located in two different exons with an intron among them.

However, in this second function, before implementing KMP algorithm, we must considered some preparations:

To verify if the pattern is unique in the whole genome, is necessary that the genome locates in a directory. The program opens the directory with the Perl function opendir and saves all chromosome files in an array, except hidden files "." and "..":

my @fitxerscromosomes = ();

opendir(GENOMA,$genoma);
@fitxerscromosomes = grep { $_ ne '.' and $_ ne '..' } readdir(GENOMA);
closedir(GENOMA);

Later, using an iteration with while, the program checks if the pattern is unique in the chromosome. In each case, opens the chromosome and realize the check line per line. We must remind to concatenate the last nucleotides of previous line (depending on pattern length) in the beginning of each line.

while ($z < scalar(@fitxerscromosomes && $c<=1){

    open(CROMOSOMA,"< $genoma/$fitxerscromosomes[$z]"); 

    my $trosset = "";
    my $defline = <CROMOSOMA>;
    while (<CROMOSOMA>) {
        chomp;
        my $seq = $trosset . $_;   
        $seq = uc($seq);
        $trosset = substr($_,length($_)-$llargada);
        *******
    }
    close(CROMOSOMA);
    $z = $z + 1;
}

Finally, some unique subsequences can be located in the negative strand of the chromosome sequences. In order to solve this fact, the program makes the "reverse complement" sequence of the pattern before initiating the search:

my $patrorc = reverse($patro);     
$patrorc =~ tr/acgtACGT/tgcaTGCA/;

And when the program is doing the search line per line, if in the actual line doesn't find the pattern, then tries to repeat the same search but with the reverse complement of the pattern, and if doesn't find it neither, continues with the habitual algorithm procedure.

By the way, in both functions the KMP algorithm is the next one:


my @mrna = split(//,$mrna);
my $n = scalar(@mrna);

my @patro = split(//,$patro);
my $m = scalar(@patro);

my $s = 0;
my $c = 0;
my $i = 1; 
    
while ($s <= $n-$m){

    $i = $coin[$i];

    $x = $x + 1;
	
    while ($i >= 0 && $i < $m && $mrna[$s+$i] eq $patro[$i]){
	$i = $i + 1;
	$x = $x + 1;
    }
	
    if ($i == $m){
	$c = $c + 1;
	$i = $i - 1;
	$s = $s + ($i - $coin[$i]);
	$i = 1;
    }else{
	if ($s == 0){
            if ($i == 0){
		$s = 1;
		$i = 1;
	    }else{
		$s = $i - $coin[$i];
	    }
	}else{
	    if ($i == 0){
		$s = $s + 1;
		$i = 1;
	    }else{
		$s = $s + ($i - $coin[$i]);
	    }
	}
    }
}

This algorithm is based on going through the sequence, where we want to check if the patterns is unique or not, comparing with the pattern. At the moment of modifying the movement, if the pattern matches exactly or not, the program uses the information from the next table. In this way, the program achieves moving the pattern over the sequence enough to save the sequence positions that are not necessary to compare and also, start the next comparison in a fix pattern position (saving pattern positions that doesn't need to be compared neither). For a more detailed explanation, see program code.

Pattern generation

The program is capable of opening the FASTA file of mRNA by itself and generates all the possible patterns, with the Perl function substr, for a fix length. In this way, is sure that the program analyzes all the subsequences susceptible to be unique in mRNA.

$inici = 0;
$llargada = 14;
while ($inici <= $n - $llargada){
    $patro = substr($mrna, $inici, $llargada);
    *****
    *****
    $inici = $inici + 1;  
}

Automatic strategy of modifying the length

Once the program has checked unique patterns for a fix length in mRNA, there are two possible options:

If the program doesn't find unique pattern, adds one unit to the length and and repeats the search of unique subsequences from mRNA another time.

If the program finds unique patterns, this unique patterns, their movement and chromosome where are located, are saved in three result arrays. Then, the program repeats the search of unique subsequences but for a inferior length (one unit less). Moreover, in this case the program doesn't search from the mRNA, but from the unique patterns find it in the previous length, because any unique pattern of an inferior length (for example 13) will be situated inside one of the unique patterns found with a length of 14.

Considerations:

If the program has gone to an inferior length and finds more patterns, it continues going to inferior lengths. At the moment that doesn't finds unique patterns, keeps the results that has saved previously in the three result arrays. And logs off the iteration with the evaluation of $tinc_patro_unic as true (1), because enters to "elsif" detailed below.
The other option is that the program has gone to a superior length. If it doesn't find patterns, continues going to superior lengths. At the moment that finds unique patterns, saves them in the three result arrays. As the previous length will be always less than the actual (because it has incremented the length), the program logs off the iteration also evaluating $tinc_patro_unic variable as true (1) in the second "if" detailed below.

if (scalar(@unics) > 0) {
    @resultat = @unics;
    @desplacaments_resultat = @desplacament;
    @cromosoma_resultat = @cromosoma;

    if ($llargada_anterior < $llargada) {
        $tinc_patro_unic = 1;

    } else {
        $llargada = $llargada - 1;
	@possiblespatrons = @unics;
    }

} elsif (scalar(@resultat) > 0) {
    $tinc_patro_unic = 1;

} else {
        $llargada =$llargada = $llargada + 1;
}