Methods

The aim of this project was to identify in silico all the selenoproteins in the genome of Chlorocebus sabaeus. In order to find the putative regions coding for selenoproteins we used an homology-based approach to scan the genome for members of the known eukaryotic selenoprotein families.


Methods summary

The computational pipeline adopted was the following: first we extracted all the selenoporteins from Homo Sapiens and Macaca mulatta as our queries. Using blast, we identified genomic regions in Chlorocebus sabaeus' genome with high similarity to codify for selenoproteins. Second, we extracted from the genome the subsequence corresponding to each hit. Next, we ran gene-structure prediction programs such as GeneWise and Exonerate, which yielded both the sequence and the structure of the potential protein-coding genes. Finally, we compared the predicted translated sequence to the initial protein query using T-COFFEE, a sequence alignment software. Using this approach we were able to determine coding regions from known selenoproteins. In parallel, we ran Selenoprofiles to scan for selenoproteins in the genome of Chlorocebus sabaeus. This tool also relies on homology but it achieves more accuracy by using an alignment profile instead of a single query. Therefore, it was used to confirm the results. Moreover, we further characterized the predicted selenoproteins by determining the presence of SECIS elements in the 3'-UTR region.
Finally we followed a similar approach to search for the machinery responsible for the translation of selenoproteins. But, in this case, we did not search for SECIS elements. In the same vein, we scanned the genome for the presence of the selenocysteine tRNA using the software tRNAscan-SE.

Genome

The genome of Chlorocebus sabaeus was sequenced and assembled by the Vervet Genomics Consortium. The whole sequence is splitted up in chromosomes and it can be found in the following link (GenBank Assembly ID: GCA_000409795.1)

Queries

An homology-based approach requires a priori information of known selenoproteins. It is feasible to think that the phylogenetically closer the query and the target organism are, the more accuracy the analysis achieves. Therefore, the most appropriate organism to use as query is the one philogenetically closest to Chlorocebus sabaeus. In this case, the closest organism with annotated selenoproteins is Macaca mulatta. However, human selenoproteins are more studied and better annotated. Hence, we complemented the search with a second scan using the selenoproteins from Homo Sapiens. Besides, we searched for the non-mammalian selenoproteins using Selenoprofiles. All the mammalian selenoproteins and SECIS elements that we used in our analysis can be found in SelenoDB 2.0, a comprehensive database for information on eukaryotic selenoproteins. In some cases the selenoproteins were not well annotated in SelenoDB and then we used the selenoproteins that can be found in NCBI.

In most of the cases, there were annotated several isoforms of a single selenoprotein and there were discrepancies between NCBI and SelenoDB. Therefore, in order to reduce the number of querys and to select the most appropriate query, we used the following criteria in the given order:

  1. The coding selenoprotein is the one that shows highest similarity between different species.
  2. The coding selenoprotein exhibits the typical structure of a coding gene with exons and introns. The structure is conserved between closely related species.
  3. The coding selenoprotein has significative expression levels in EST database.

BLAST

BLAST (Basic Local Alignment Search Tool) is the most frequently used tool for comparing primary biological sequence information, such as proteins or nucleic acids. It is based on a heuristic algorithm that searches for local similarity between a query sequence and a sequence database. Depending on the type of query and the type of database, the selection of the specific flavour of blast is different. In our analysis we used tblastn and blastn. tblastn compares a protein query sequence against a nucleotide sequence dynamically translated in the six possible reading frames. blastn compares a nucleotide query sequence against a nucleotide database sequence. More information about BLAST and its flavours can be found in the official website

Once the protein queries were selected, we used tblastn to find genomic regions highly likely to encode a protein similar to the query. An important parameter to take into consideration in blast is the expectation value (E-value) of a given hit, which is the number of hits one can “expect” to see only by chance when searching a database of a particular size. We set blastto retrieve only those hits with an E-value lower than 1x10^-5 (1 out of 100.0000 to find the hit by chance).

The use of blast in the UNIX terminal requires, apart from the installation of the program, the previous creation of a genome's database. This can be accomplished by running a tool called formatdb. This program formats the database (in FASTA format) to a binary file, so it can be searched by BLAST.

formatdb -i genome.fa -p F -n db

  • -i: input file, genomic sequence in fasta format
  • -p: file type, F correspond to nucleotide
  • -n: output name of the database.

More information about this program can be obtained in the following link

The executable (blastall) accepts many arguments and options to provide an optimal search. The ones that we used were the following:

blastall -p tblastn -i query_file.fa -d genome_file -o query_file.blast -m 9 -e 0.00001

  • -p: specifies the type of search (blast flavour)
  • -i: specifies input query file
  • -d: specifies the target database
  • -e: E-value cut-off. Default is 10. Only hits with E-value smaller than the specified will be reported
  • -m: speficies output alignment view. The value 9 is to set a tabular format with comment lines and post search sorting.
  • -o: specifies the output file

There are other options which can also be taken into account, such as the gap opening penalty, the gap extension penalty, the mismmatch penalty or the threshold for HSP, among others. More information about blastall is available in the following website

Blast search reports an output file which contains the following information for each one of the hits found:

  • Query id: The query sequence id (e.g. SPP00000081_2.0)
  • Subject id (scaffold): The matching subject sequence id (e.g. gi|511782574|gb|CM001960.1|)
  • % identity: percentage of identity between a subsequence of the query and the corresponding subsequence found in the database (e.g. 94.61%)
  • Alignment length: the length of the alignment between a subsequence of the query and the corresponding subsequence found in the database (e.g. 167)
  • Mismatches: number of mismatches of the alignment (e.g. 7)
  • Gap openings: number of opening gaps in the alignment (e.g. 1)
  • q.start: start position of the subsequence “blasted” in the query file (e.g. 23)
  • q.end: final position of the subsequence “blasted” in the query file(e.g. 187)
  • s.start: start position of the subsequence found in the target genome (e.g. 32693020)
  • s.end: final position of the subsequence found in the target genome (e.g. 32693520)
  • e-value: expectation value of the given hit. (e.g. 2e-57)

Extracting the sequences of interest from the genome: fastaindex, fastafetch, fastasubseq

Once we had obtained the Blast output, it was used to extract the genomic subsequence corresponding to each hit. Of note, the interval between the minimum q.start and the maximum q.end of each hit was elongated at both ends by 100.000 nucleotides. This method worked when a single hit per scaffold was found. However, we usually obtained more than one hit per scaffold. Hence, we merged all the hits belonging to the same scaffold and treated them as a single unit. Using that methodology we ensured that the subsequence extracted from the genome did not only contain all the hits found by a given query but also a significant content of surrounding nucleotides.
The following programs were executed in the UNIX shell:
Fastaindex is a utility which comes with the exonerate package and its purpose is to index a genome for faster downstream analysis.

fastaindex genome.fa genome.index

  • first argument: specifies the target database
  • second argument: specifies the output index

Fastafetch is another utility from exonerate package. Its function is to extract the sequence of a particular scaffold in a given genome.

fastafetch genome.fa genome.index scaffold_name > scaffold.fa

  • first argument: specifies the target database
  • second argument: specifies the indexed database
  • third argument: specifies the name of the scaffold to be extracted as a fasta file from the multifasta genome (e.g. gi|511782573|gb|CM001961.1|)

The extracted fasta file corresponds to a complete scaffold, which is too large for subsequent analysis. Hence, we extracted the subsequence corresponding to the hit obtained with blast. Fastasubseq is also included in theexonerate package and its purpose is to extract fasta format subsequences from a fasta file. The program's input requires the target scaffold file (in fasta format), as well as the starting position and the length of the subsequence we aim to extract.

fastasubseq scaffold.fa initial_position length > subsequence.fa

  • first argument: specifies the scaffold fasta file that we have extracted with fastafetch
  • second argument: specifies the starting position
  • third argument: specifies the length of the sequence to be extracted

As previously said, it is highly recommended to extent the target subsequence beyond the start and the final position of the blast hit. This is because blast's local alignment algorithm rarely retrieves protein queries as a unique hit, instead, each exon or each highly identical portion between the query and the genome is reported as an individual hit. In consequence, if this extension is not performed, some poorly-conserved exons might be lost. We assumed that a 50.000-100.000 extension is enough to capture all the elements of the potential coding gene.

Gene prediction: Exonerate

Once the subsequence is extracted, this is used for gene prediction using two tools: exonerate and genewise.
Exonerate is a highly complete generic tool for pairwise sequence comparison as well as exon prediction. It can align sequences using many alignment models in either an exhaustive or heuristic way. In our study we used this program to align the protein query sequence against the DNA sequence previously extracted using fastasubseq. With this method it can recognize the exons and the introns in the DNA subsequence, and therefore it can generate an annotation of the gene.
Exonerate can also be executed using the UNIX command line:

exonerate -m p2g --showtargetgff -q query.fa -t extracted_subsequence.fa > output.gff

  • - m Specify alignment model type. p2g means protein to genome
  • -- showtargetgff: Include GFF output on target in results
  • - q: Specify query sequences as a fasta format file
  • - t: Specify target sequences as a fasta format file

The GFF output file contains a table with the predicted gene annotation in terms of exons and introns and the corresponding position in the subsequence fasta file.
The current version of the program (2.2.0)it contains an option to extract the corresponding cDNA sequence (only exons), but it does not work properly. Thus,in order to extract the cDNA sequence based on the information in the GFF file, we ran the following Perl script called: fastaseqfromGFF.pl

fastaseqfromGFF.pl requires indications of the exons from the GFF file:

cat output.gff | egrep -w exon > exons.gff

Then, this information is passed to the script, as well as the subsequence extracted from fastasubseq:

fastaseqfromGFF.pl subsequence.fa exons.gff > sequence_exons.fa

The output file will contain the cDNA sequence corresponding to the exons of the predicted protein. In order to translate the sequence this file can be passed to fastatranslate, also included in the exonerate package:

Fastatranslate -F 1 sequence_exons.fa > protein_sequence.fa

  • - F: indicates the reading frame to translate (1-6)
  • - showtargetgff: Include GFF output on target in results
  • - q: Specify query sequences as a fasta format file
  • - t: Specify target sequences as a fasta format file

Gene prediction: genewise

GeneWise is another tool for gene prediction but, in contrast to exonerate, the algorithm used is exhaustive. It compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors.
Genewise can be run with the following command line:

genewise -pep -pretty -cdna -gff (-trev) protein_query.fa subsequence.fa > output.gff

  • -pep: to show the protein translation
  • -pretty: to show the pretty ASCII alignment
  • -cdna: to show the predicted cDNA
  • -gff: to show the gene prediction in GFF format
  • -trev: by default genewise scans in the forward strand. This parameter allows the scan in the reverse strand.

More information about genewise and its parameters is available here: http://www.ebi.ac.uk/Tools/psa/genewise/help/

Multiple sequence alignment

Once the candidate proteins are characterized, they should be aligned to its respective query in order to detect changes in the sequence. There are plenty of tools which align two or more sequences, here we used T-COFFEE (Notredame et al, 2000). The program can be executed in the website, or it also can be executed using UNIX terminal:

t_coffee sequence1.fa sequence2.fa

SECIS prediction

As mentioned, all eukaryotic selenoproteins contain a specific stem-loop structure designated as the Sec insertion sequence (SECIS) in the 3'-UTR. This sequence is required for the incorporation of selenocysteine when the ribosome records a UGA codon.
To further characterize the selenoproteins found, we scanned for the presence of putative SECIS elements by two different methods: SECISearch3 and blast.

SECISearch3

The first approach consists on de novo prediction of SECIS elements using SECISearch3. SECIS elements have some conserved features in the primary sequence and the secondary structure which can be used for de novo prediction (Mariotti et al, 2013). SECISearch3 combines the features of Infernal, Covels and its predecessor SECISearch to predict stable SECIS elements with high sensitivity. However, the specificity is low and many hits are reported, so subsequent manual curation is mandatory.
Although the original SECISearch is available as an executable, SECISearch3 has to be executed from the website where it is hosted: http://gladyshevlab.org/SelenoproteinPredictionServer/ or http://seblastian.crg.es/

BLAST for SECIS elements

The second one approach relies on homology between SECIS elements of orthologous selenoproteins.
The conservation of SECIS elements is poor. This is why the de novo prediction is challenging and not highly specific. However Kryukov et al. described that SECIS elements between mammalian orthologous selenoproteins exhibit sequence similarity. Thus, the de novo approach was complemented by a homology-based approach using blastn. The same cut-off E-value of 10-5 used for proteins was established for the SECIS elements search.

Scripts

The following perl program can be used to automatize the described pipeline and to find all the selenoproteins at once. It is the script to rule them all:

Selenoprofiles

Selenoprofiles is another homology-based in silico method to predict selenoproteins. It has been the most valuable tool for selenoproteome characterization in the last two years.
The selenoprofiles pipeline is similar to the one described above. However, the main difference is that Selenoprofiles takes as an input an alignment representing the protein family, rather than a single protein query. With that alignment it builds a position-specific scoring matrix (profile) that is used for psitblastn to find homologous regions in the target genome. Next, it internally runs exonerate and genewise to predict candidate selenoproteins and finally it filters out likely false positives to reduce the number of potential candidates. In summary, given a profile and a target genome, Selenoprofiles identifies all members of the given family in the target genome.
We executed Selenoprofiles from the UNIX command line with the following parameters:

Selenoprofiles RESULTS/protein_name -t genome.fa -s "Chlorocebus sabaeus" -p protein_name

  • -t: target genome
  • -s: specie
  • -p: protein family

More information can be found in the website http://big.crg.cat/services/selenoprofiles

tRNA-scan-SE

All the organisms that codify for selenoproteins must contain the selenocysteine tRNA, which is necessary for their translation. In order to find the selenocysteine tRNA in the genome we used the software tRNAscan-SE, a tool that finds tRNAs in a given DNA sequence in fasta format.
We downloaded the executable and we ran it in UNIX command line with the following arguments:

tRNAscan-SE genome.fa -G -o "tRNA_output.txt"

The output contains all the tRNAs found in the genome. Then we extracted the selenocysteïne (TGA) tRNA by executing egrep:

egrep TGA tRNA_output > SeCys_tRNA.txt

More information about tRNAscan-SE can be found in the following Website