THUNNUS ORIENTALIS

METHODOLOGY



The primary objective of this project was to identify and annotate all the selenoproteins in the genome of Thunnus orientalis. In order to find the regions codifying for selenoproteins we used a homology-based approach using the genome of related eukaryotic organisms that have been studied for selenoproteins. First of all, to start the process we obtained the genome of Thunnus orientalis, which is fragmented in many contigs. We have predicted the selenoproteins of this bony fish from its contigs.

Queries

The main reason why we chose the zebrafish as our reference specie, was because of its closer phylogenetic relationship to Thunnus orientalis and because we consider that its proteins are better annotated than other related species.

We have obtained the majority of proteins from SelenoDB, a specialized selenoprotein database. In many cases we observed that the protein that we were looking for wasn’t well named, its name figured as ‘NONE’ inside its protein family. For this reason, we did a Blast using a database known as ENSEMBL in order to try and find the correct name of the protein.

We also found out that certain proteins were not well annotated in the SelenoDB database where we initially searched for them. One of the main reasons was that the initial codon did not start with the amino acid methionine (Met). In such cases we searched for the protein sequence in ENSEMBL to make sure that there was no error in the annotation of the protein blasting the initial protein in this data base. In case that the output of this database consisted in a sequence starting with Met we took this sequence as a reference and re-started the automatization process to obtain the archives necessary to proceed our analysis. However, in some cases we didn't obtain a sequence from zebrafish starting with Met so as our last resource we used the human protein to make our comparisons.




Automatization process

After having obtained the sequence of the genome of Thunnus orientalis and the different selenoprotein queries, the following procedure involved several steps that were automatized in two informatic programs named donttunablastme and MIDGARD.

To start running the program in a same folder, we needed the following documents:

   -  dontunablastme: it gives the different PATHS to the system to allow the use of the
      different programmes and that they finally call midgard.

   -  midgard: has all the process by which proteins from the indicated genome will be
      predicted.

   -  A file where we had a column list of the proteins that we wanted to use as queries.

   -  Files, in fasta format, of the different proteins named as we have indicated in the
      previous file, and that will be used as a query during the program.


To start running the program, we will open the terminal and order the following command:

                  $ ./dontunablastme                  


This order will ask the user the file where it is located the list of proteins that should be processed.

Once the program finishes, the outputs will be the different files, named as the protein query, and inside we will find all the necessary files to do our protein prediction.

However, the different primordial steps of the program will be described to make sure it is clearly understood.



Change U symbol for an X

The first step of our program involved a letter change in the protein query sequence, so that it changed an U for an X. This procedure was done to make sure there were no problems when using other programs such as Exonerate or Genewise to proceed with the alignment of the sequences.

Moreover, in some proteins we have detected erroneous symbols at the end of it, reason why we have eliminated them.






Blast (Basic Local Alignment Search Tool)

Blast is a frequently used tool when comparing primary biological sequence information, such as proteins or nucleic acids, because it is capable of finding regions of local similarity between sequences. In our analyses we used tblastn, that compares a protein query sequence against a nucleotide sequence. By doing it we found genomic regions that could encode a protein similar to the query. The most important parameter that we based on to select the best hit alignment was the E-value score that can be interpreted as the expected number of distinct alignments that we would obtain by chance in a database search. The higher the E-value is, the less significant the match is. We chose those hits with an E-value lower than 0,01.

The use of blast in the UNIX terminal requires the installation of the program and the previous creation of a genome’s database reason why it is necessary a tool called formatdb. Some command line options that we used are the following:

   -  i: input file for formatting genomic sequence in fasta format (protein query).
   -  p: specifies the type of search (tblastn).
   -  d: specifies the target database (Thunnus orientalis).
   -  o: specifies the output file.
   -  m: it is used to set a tabular format


To obtain the necessary data to continue with the program, we used this command:
Scheme:

   $blastall -p tblastn -i <query> -d  <genome> -o <blast> -m8 -e 0.01   

Our command:

$ blastall -p tblastn -i $protname -d
   /cursos/BI/genomes/vertebrates/2014/Thunnus_orientalis/genome.fa -o $nomblast -m8 -e 0.01   


To analyze Blast in a more visual form we added the following command:

Scheme:

    $ blastall -p tblastn -i <query> -d  <genome> -o <blast> -e0.01    

Our results:

$ blastall -p tblastn -i $protname -d  
    /cursos/BI/genomes/vertebrates/2014/Thunnus_orientalis/genome.fa -o full_$nomblast -e 0.01    


The Thunnus Orientalis genome is annotated in very small fragments (contigs), this fact has caused us some problems when starting our analysis. When using Blast, it found us many different hits of the same protein in different contigs, so when using the other programs we used all the contig that contained one of the hits to make sure that we would find the whole protein to be able to reconstruct it. This also enabled us to search for other hits that could be possible gene copies.

However, we must bare in mind that there are different selenoprotein families that contain more than one protein, so when doing this procedure it is not surprising to find out some hits that belong to other proteins of the same family, associating the hit according to the greatest e-value provided by Blast to assign to the pertinent protein and compare the homology with the other proteins of the same family.




Extracting the sequences of interest
from the genome

Once obtained the blast output we were able to extract the genomic subsequence corresponding to each hit, by using fastafetch. As it has been mentioned, our contigs were very short. So instead of elongating both ends of the subsequence to ensure the presence of the whole gene, including fragments of the gene that are poorly conserved and we were not able to find by running Blast, we used the whole contig to make sure that the different programmes of gene prediction are able to function with sequences of such characteristics. With high probability, and without elongation, all our hits still contained all the gene of interest, reason why we skipped the process and did not use the program named fastasubseq.

Scheme:

$ fastafetch <genome> <genome> <contig_name>   >  <contig>     

Our command:

  $ fastafetch /cursos/BI/genomes/vertebrates/2014/Thunnus_orientalis/genome.fa $nomindex  
$llistacontigs[$contig] > contig_$nl1\_$protname



Gene prediction, extraction and translation

To predict the gene and align the protein query sequence against the DNA sequence, our program used a tool known as exonerate, which is efficient for pairwise sequence comparison as well as exon prediction, it recognizes exons and introns in the DNA sequence. This program can also align sequences using many alignment models in either an exhaustive or heuristic way. Some commands that we used to extract information more precisely were the following:

    -   -m p2g: means from protein to genome.
    -   -- showtargetgff: shows the target in fasta format gff.
    -   -q: specify query sequences as a fasta format file.
    -   -t: specify target sequences as a fasta format file.
    -   --exhaustive yes: it does a more accurate search.

Scheme:

exonerate -m p2g --showtargetgff -q <query> -t <contig> --exhaustive yes > <exonerate>     

Our command:

  exonerate -m p2g --showtargetgff -q $protname -t contig_$nl2\_$protname --exhaustive yes >
$nomexonerateprevi



We were very interested in the output resulting from the exonerate program. However, our program also gave us an output from genewise that we would use in case exonerate didn’t give us any prediction. In such cases, we would continue with the process manually.

Some commands that we used to extract information more precisely were the following:

   -  cdna: output is shown in cDNA.
   -  pep: output is to show protein translation and splicing frameshifts.
   -  pretty: output is to show pretty ascii output.
   -  gff: output is printed in Gene Feature Format file.
   -  both: input is analyzed in both strands.

Scheme:

     $ genewise -cdna -pep -pretty -gff -both <query> -t <contig> > <genewise>     

Our command:

    $ genewise -cdna -pep -pretty -gff -both $protname contig_$nl6\_$protname > $nomgenewise    



To extract the exon sequence in a fasta format file the package that was used was fastaseqfromGFF. The output file generated contained the cDNA nucleotide sequence of the corresponding exons and in order to translate this file into an amino acid sequence our program included fastatranslate to do the process that has been described.

Scheme:

     $ fastaseqfromGFF.pl <contig_file>> <exonerate_file_with_exons> > <cDNA_predicted>    

Our command:

    $ fastaseqfromGFF.pl contig_$nl3\_$protname $nomexonerate > genepredict_$nl3\_$protname    



Scheme:

     $ fastatranslate -f <cDNA_predicted> -F 1 > <protein_predicted>     

Our command:

     $ fastatranslate -f "genepredict_$nl4\_$protname" -F 1 > "protpredict_$nl4\_$protname"   





Comparison between the query protein
and the predicted protein

Once all the candidate proteins were characterized, our program also included the execution of t-coffee to subjectively compare the sequences between the predicted protein and our query, to determine if our protein was well predicted, as well as to detect those cases in which our proteins were splitted in different contigs.

Scheme:

     $ t_coffee <query> <protein_predicted> > <t_coffee>     

Our command:

    $ t_coffee $protname protpredict_$nl5\_$protname > $nomtcofee    



Protein assignation

Once our program had run, we needed to associate the hits (and its related files) we had found to a determined protein. To do this we assumed that the best hit was the one with the lowest E-value.

In the analysis of protein families we found that, due to the intrafamily homology, the same contigs were associated to different proteins inside the family. In those cases, we analyzed each protein individually and we related it to the contig with the best E-value, excluding this contig of the list of possible contigs for the other proteins of the family (usually, the contig that was related to a protein had worst E-values for the other proteins of the family). Following these steps we could associate each of the analyzed proteins to the contig/s that contained its sequence.

In some cases we found that, when all the proteins of a family were associated to their contigs, there were other hits related to them that we could not relate to any other protein known inside this family. What we did was to name them with a “~” after the name of the closest protein they had, in order to analyze the origin of this match. When analyzing TR selenoproteins we found 3 of this cases, so we named them “~A”, “~B” and “~C” respectively.

When all the associations were done, we proceeded to analyze the Sec aminoacids (or their corresponding Cys), the gene length and the number of exons and introns from the exonerate files of each protein. In those cases in which we did not have this file we made our analysis from the genewise file.

After this we looked at the alignment obtained from t-coffee software between our predicted protein and the initial protein (from zebrafish or human), with which we could observe the good prediction of the protein, or, in those cases in which the protein was fragmented into various contigs, the division of the protein in these contigs. In all cases, we did the valoration in a subjective way.




SECIS prediction

After having found the selenoproteins in our genome we searched for the presence of SECIS elements using the web SECISearch3 . When searching in SECISearch3 we must take into account that the program predicts the SECIS elements according to the sequence that forms the input, but it is necessary to analyze each of them individually to ensure that the element predicted is located in the 3’UTR region, and if it is in another location we would reject it.

In many cases the contig did not contain the whole selenoprotein sequence, and for this reason we could not be sure that negative results in SECIS prediction were completely certain, because maybe the SECIS element was located outside the analyzed contig and we could not predict it, but it existed. To ensure those results we followed the next procedure: we assumed that the contigs were sorted following their position in the tuna genome, which means that the fragmented proteins could be found in consecutive contigs. With this assumption in mind, in those cases in which we did not find any SECIS elements, we used fastafech to extract the next 3 consecutives contigs after the last one that matched our protein, and we analyzed them using SECIsearch3. This methodology allowed us to predict SECIS elements in some proteins in which we had not found them.




Selenoprofiles

To ensure that our protein prediction was well performed we run the Selenoprofiles program . The main purpose of selenoprofiles is the accurate search of a set of protein families, in a known genome, from homologue genes. One of the main virtues of this program is its flexibility, but previously it needs the installation of some programs. Selenoprofiles was initially developed to search selenoproteins and therefore as a consequence it is capable of interpreting the UGA codon as the 21st amino acid while other programs are not designed for this item.

$ ./Selenoprofiles results_folder -t
    /cursos/BI/genomes/vertebrates/2014/Thunnus_orientalis/genome.fa -s "Thunnus orientalis"     




tRNA-UGA prediction

To try and find the tRNA of a particular codon we used two programmes named Aragorn and tRNAscan-SE. Both of them are capable of predicting in a genome all the present tRNA.

     $ aragorn -v -s -d -c -l -j -a -q -rn -w -ifro <min>,<max> -t -gc -tv -seq -br -fasta -fo -o <aragorn> <genome>     

    $ tRNAscan-SE <genome> -G -o <tRNAscan-SE>     


Once having the output of tRNAscan-SE programmes we filtered all the tRNAs that were specific for the Sec amino acid with the following command:

    $ cat <tRNAscan-SE> | egrep tRNA-seC > <tRNA_Sec_output> | cut -f 1 > <contigs_tRNAscan-SE>     


In the case of the output generated by the Aragorn program, its interpretation was a bit complicated because in the file there was no indication of the localization of the hit. This fact was the main reason why we used blastn program to find the sequence.

    $ blastall -p blastn -i <query> -d <genome> -o <output_blast> -m8 -e 0.00000000000000000001    

    $ blastall -p blastn -i tRNA-Sec_query_1 -d
/cursos/BI/genomes/vertebrates/2014/Thunnus_orientalis/genome.fa
-o contig_tRNA-Sec_1 -m8 -e 0.00000000000000000001     


Finally, we recompiled the names of the contigs that appeared in the sequence with the following order:

    $ cat *contig | cut -f 2 | sort | uniq > contigs_amb_de_tRNA-seC