THRICHECHUS MANATUS LATIROSTRIS

METHODOLOGY



In order to find the regions codifying for selenoproteins in the genome of Trichechus manatus latirostris we used a homology-based approach using the genome of Loxodonta africana (elephant), the closer mammal in phylogenetics to our organism that have been studied for selenoproteins. First of all, we were provided with the genome of Trichechus manatus latirostris, which is fragmented in many contigs. We predicted the selenoproteins from this contigs.



Queries

We have obtained the majority of proteins from SelenoDB, a specialized selenoprotein database. In some cases we did not obtain the elephant sequence in SelenoDB, therefore we used the human protein to make our comparisons.



Change U symbol for an X

First of all, in case that we were looking for a selenoprotein, we had to change the U symbol (which meant a TAG codon) for an X. This procedure was done to make sure there were no problems when using other programs such as Exonerate to proceed with the alignment of the sequences. Moreover, in some proteins we have detected erroneous symbols at the end of it, reason why we have eliminated them.




Blast (Basic Local Alignment Search Tool)

Blast is a very used tool to compare primary biological sequence information, such as proteins or nucleic acids, as it is able to find regions of local similarity between sequences. In our analyses we used tblastn, which compares a protein query sequence against a nucleotide sequence. By doing that we found genomic regions that could encode a protein similar to the query. The most important parameter that we based on to select the best hit alignment was the E-value score that can be interpreted as the expected number of distinct alignments that we would obtain by chance in a database search. The higher the E-value is, the less significant the match is. We chose those hits with an E-value lower than 0,1.


The use of blast in the UNIX terminal requires the installation of the program and the previous creation of a genome’s database, reason why it is necessary a tool called formatdb. Some command line options that we used are the following:


  • i: input file for formatting genomic sequence in fasta format (protein query).

  • p: specifies the type of search (tblastn).

  • d: specifies the target database (Trichechus manatus latirostris)

  • o: specifies the output file.

  • m: it is used to set a tabular format


To obtain the necessary data to continue with the program, we used this command:
Scheme:

   $blastall -p tblastn -i <query> -d <genome> -o <blast>

Our command:

$blastall -p tblastn -i $protname -d
   /cursos/BI/genomes/2015/Trichecus_manatus_latirostris/genome.fa -o $nomblast   


The Blast file contained many different hits of the same protein in different contigs so we choose the smallest and greatest query and we sum 50000 positions and subtracted 50000 to make sure that we would find the whole protein to be capable of reconstructing it. This also enabled us to search for other hits that could be possible gene copies.

However, we have to take into account that in some selenoprotein families (e.g. GPx, DI) we can find that different proteins share the same hits. We follow a procedure where we compare the hit with the best E-value (provided by Blast) and we compare the homology with the protein of the elephant, but also with the other proteins of the same family.




Extracting the sequences of interest
from the genome

Once obtained the blast output we were able to extract the genomic subsequence corresponding to each hit, by using fastafetch.


Scheme:

$fastafetch <genome> <genome> <contig_name>   > <contig>     

Our command:

$fastafetch /cursos/BI/genomes/2015/Trichechus_manatus_latirostris/genome.fa
/cursos/BI/genomes/2015/Trichechus_manatus_latirostris/genome.index $namehit > $namehit.fa


We use "genome.index" file, were the Trichechus manatus latirostris genome is organised by scaffolds.




Hit selection

We have to extract the genomic regions that probably codify for our queries and we therefore want to evaluate. The first step is to delimit the hit to obtain the most bounded region that contain the gene of interest. This step facilitates the process because we avoid to work with unnecessarily long sequences.

Blast output indicates us the beginning and final position for every alignment at aminoacid and nucleotides level of the genome of Trichechus manatus latirostris. We have to bear in mind that alignment is not perfect and it may be that part of the gene is not in the region aligned with the protein. For this reason, it is very important to take upstream and downstream positions of that gene. That way we make sure that the gene of interest is found within the delimited region.

We have taken 50000 nucleotides upstream and 50000 nucleotides downstream to make sure we don't leave any protein out. To obtain a file with the hit for every scaffold of interest we use the program Fastasubseq .


Scheme:

$fastasubseq $namehit.fa <start> <length> > <genomic output>.fa


Our command:

$fastasubseq namehit.fa <start> <length> > $namehit.genomic.fa





Gene prediction, extraction and translation

To predict the gene and align the protein query sequence against the DNA sequence, our program used a tool known as exonerate , which is efficient for pairwise sequence comparison as well as exon prediction, it recognizes exons and introns in the DNA sequence. This program can also align sequences using many alignment models in either an exhaustive or heuristic way. Some commands that we used to extract information more precisely were the following:


  • -m p2g: means from protein to genome.

  • -- showtargetgff: shows the target in fasta format gff.

  • -q: specify query sequences as a fasta format file.

  • -t: specify target sequences as a fasta format file.

  • --exhaustive yes: it does a more accurate search.


Scheme:

exonerate -m p2g --showtargetgff -q <query> -t <contig> --exhaustive yes > <exonerate>     

Our command:

$exonerate -m p2g --showtargetgff -q $nameprotein.fa -t $namehit.genomic.fa > $namehit.exonerate.out



In some cases, if we had problems with the final alignment we did the same command with the exhaustive option:


$exonerate -m p2g --showtargetgff -q selS.fa -t genomicJH594625.1genomic.fa --exhaustive yes > JH594625.1exonerate.out



To extract the lenght and the number of the predict exons from exonerate we used the followind command:


egrep -w exon $namehit.exonerate.out > $namehit.pred.gff



To extract the exon sequence in a fasta format file the package that was used was fastaseqfromGFF. The output file generated contained the cDNA nucleotide sequence of the corresponding exons and in order to translate this file into an amino acid sequence our program included fastatranslate to do the process that has been described.


Scheme:

$fastaseqfromGFF.pl <contig_file> > <exonerate_file_with_exons> > <cDNA_predicted>

Our command:

$fastaseqfromGFF.pl $namehit.genomic.fa $namehit.pred.gff > $namehit.pred.nuc.fa



Scheme:

     $fastatranslate -f <cDNA_predicted> -F 1 > <protein_predicted>     

Our command:

$fastatranslate -F 1 $namehit.pred.nuc.fa > $namehit.pred.aa.fa





Comparison between the query protein
and the predicted protein

Finally we of t-coffee to subjectively compare the homology between the sequences of predicted protein and our query, to determine if our protein was well predicted. In order to see our selenocysteine aligned, we changed the predicted stop codons, which are indicate with “*”, for an U.


Scheme:

     $t_coffee <query> <protein_predicted> > <t_coffee>     

Our command:

$t_coffee selS.fa $namehit.pred.aa.fa





Protein assignation

Once our program had run, we needed to associate the hits (and its related files) we had found to a determined protein. To do this we assumed that the best hit was the one with the highest homology with Loxodonta africana protein and has de Sec align with a Sec or Cys.

In the analysis of protein families we found that, due to the intrafamily homology, the same contigs were associated to different proteins inside the family. This way, without doing a phylogenetic tree, we only could predict those who are exclusives for one alignment.

When all the associations were done, we proceeded to analyse the Sec aminoacids (or their corresponding Cys). We checked in Exonerate file if the U in the protein predicted is a TGA codon and therefore a probably selenocysteine.

After this we looked at the alignment obtained from T-coffee between our predicted protein and the initial protein (from elephant or human), with which we observe the good prediction of the protein, or, in those cases in which the protein was fragmented into various contigs, the division of the protein in these contigs. In all cases, we did the validation in a subjective manner.




SECIS prediction

An additional checking that our protein is a selenoprotein is to search for the presence of SECIS elements using Seblastian. This website supplies an openly accessible tool to predict eukaryotic selenoproteins and SECIS elements in nucleotide sequences. In selenoprotein transcripts we encounter a peculiar stem loop structure called SECIS element (SEC insertion sequence), which act as a signal for recoding this UGA to Sec.

The program predicts the SECIS elements according to the sequence that forms the input, so it is necessary to analyse each of them individually to ensure that the element predicted is located in the 3’UTR region. In case that is located in another part of the genome we would reject it.

Nonetheless, in some cases the contig did not comprise the whole sequence of the selenoprotein so we cannot guarantee that if the program does not find a SECIS is because it does not exist.

Sometimes, Seblastian predicted more than one SECIS element. When this happened, we looked at its position in the genome and we chose the one that is located closer to 3’UTR of the contig or protein in question. If there is no SECIS in this location, we interpreted as if no SECIS was predicted.




Genome absolute coordinates

This program predicts coordinates from the region selected in the program Fastasubseq. That way, we add the start number introduced in the command Fastasubseq to all results as shown in discussion.