HOMOLOGY SEARCH: U12 INTRONS

4. Scripts

       Three different scripts have been written to carry out with this project. They have been applied in different parts of the protocol to do the data analysis easier.
exons2introns.pl
       This scrips is used to obtain predicted U12 introns from predicted U12-intron-flanking exons, that are stored in a GFF file.
       GFF is a format for describing genes and other features associated with DNA, RNA and Protein sequences. GFF files format consists on nine columns separated by tabs: chromosome, source, feature, start, end, score, strand, frame and attributes or comments. The symbol # at the begging of the line denotes comment lines.
       Our script reads every line of the input (GFF file with predicted U12-intron-flanking exons). If it finds a line starting with the symbol # (a comment) it goes to the next line directly. After skipping all the comments, the script splits the line into an array (@gff) using the tabs (every position of the array is one column). We want to read two lines at the same time, so we have to create a new array (@previousgff). @previousgff reads the first line and @gff the second one. When the script is reading two lines at the same time ($llegint > 0) we will be able to create the predicted U12 intron from the two predicted U12-intron-flanking exons. We only want to create a new intron if we are in the same chromosome and in the same gene, so we should compare column 9 ($previousgff[8] or $GROUP vs $gff[8]). If we are in the same gene and the flanking exons are predicted U12-intron-flanking exons (third column should contain the word 'U12'), we can then print out our predicted U12 intron. Moreover, we have to check the symbol (- or +) of the strand because the extremes can be reversed. Once we have proved it, we cant print the output (the predicted U12 intron) into an array (@introngff), conserving the GFF format. In the second column we will print the kind of predicted U12 intron (atac vs gtag). The start of the intron would be the end of the left flanking exon plus one and its end would be the start of the next exon minus one. The new score would be the average of the score of the two predicted U12-intron-flanking exons.
introns2sequence.pl
       With this second script we can obtain the sequence of a region of the genome from a GFF file. Concretely, we want to concatenate 50 pb of the sequence of every predicted U12-intron-flanking exons. With the output we will be able to run different kinds of BLAST.
       The first thing you have to do when you apply this script to a GFF files is tipping the name of the output file (name.fa) that would contain the nucleotide sequence (100 pb) and its localization (the predicted U12 intro where it comes from). The script stores the input file in a variable ($entrada) using a filehandler. Then, introns2sequence.pl reads every line of the input, chomps it and stores it into an array (@introngff) using the tabs, as the previous script. The next step is to calculate the localization (start position and end position) of the exon flanking sequences. The first half of the sequence would be between the start position of the predicted U12 intron minus one and 50 pb back ($up). The other half would be the end of the predicted U12 intron plus one and 50 pb more ($down). Once the script knows what it have to find, we apply this information to the whole human genome ($path shows the localization of a human nucleotide database) and using chr_subseq we can pick up the sequence (50 pb in $up and 50 pb in $down). To get the 100 pb all together, we concatenate $up and $down into a new variable ($seq). Finally, we can print out the result with FASTA format. The identification of the sequence will be the number of the gene, the chromosome where it is located and the start position and end position of the predicted U12 intron. At the beginning we should put the '>' symbol.
woundedknee.pl
       With this last script we can classify the output of the BLAST analysis (predicted U12 introns) into five different categories: alignment across junction, two exons, one exon flank right, one exon flank left and no-hits. The output will consist on different files containing every category for the different BLAST analysis (blastPnr, blastPEST, blastCnr and blastCEST). We will identify every output file with: name of the category plus the name of the input file without the extension (.txt). To do this we have to split the name of the input ($entrada) using '.' and use the first component of the array (@outname).
       The script will read every line of the input file. It will skip all the lines with comments (starting with #) and all the lines starting with a word and when $n = = 1, it means that we have already chosen the first hit of the BLAST of every query. We only would consider the first line of every query because it has the highest score.
       The main idea of the script is reading the lines of the BLAST output, to choose the best hit of every query (the first subject as it is those with the highest score and lowest e-value), and classify it in order of predicted U12 category. After splitting every line of the input and defining two arrays in order to work with two lines at the same time (@blast and @previousblast) we will estimate the quality of the matching by comparing its e-value ($previousblast[$evalue]) and the number of gaps ($previousblast[$gaps]). If these values are lower than 1 and lower than 4 respectively, we will analyse the first subject (first hit). We will classify the subject in order of some parameters that define five different categories (possibilities of BLAST matching). In all cases the % of identity should be lower than 97.0 % ($identity).

Two exons: to define this category we have to operate with two lines at the same time because our query matches with two different regions of the same subject (about 50 pb in one region and 50 pb in another region). So, there is no alignment across the junction of the two pieces of the flanking exons. We will consider a rang of about 10 nucleotides as possible error margin.

       In all other categories we will work with only the first line (@previousblast) because the query only matches with one region of a subject. The second line (@blast) will be another subject.

Alignment across junction: it means that almost all the query (about 100 pb) matches with the subject in only one region (it finds the concatenation of the two exon flanking sequences in a subject all together). We will consider an error margin between 36 and 65 or higher because some exons can be smaller than 50 pb.
One exon flank-left: in this case only one left flanking exon sequence (about 50 pb) matches with the subject. The other may be an intron sequence as we are working with EST and 'nr' databases. We consider an error margin of about 10 nucleotides.
One exon flank-right: the same case as before but in the case of the right flanking exon.
No hits found!: BLAST could not be able to found any good hit. It means that the two predicted regions as predicted U12-intron-flanking exons should be in reality intron sequences that are not represented in EST and 'nr' databases. In this group may be we can add those BLAST results that were skipped because of their low e-value and/or high number of gaps.

       After assignin the best hit of a query in one category we assign $n = 1 as not to print any result with the next hits of the same query (remember that we skip all the lines that start with a letter and when $n = 1). Moreover we have to undefine @out as not to print the same result twice. Finally, we have to assign 0 to $n and $llegint to start again with the next query.
       The output files will consist on four columns of an array (@out): query, subject, category and identity. We use the command 'format' to organize in columns the output.

HOME