Annotation of selenoproteins
in Spermophilus dauricus

Methods & Materials

The aim of this project was to identify in silico all the selenoproteins in the genome of Spermophilus dauricus, whose selenoproteins have never been annotated before, by using an homology-based approach.

Queries

It is reasonable to think that the phylogenetically closer one specie and the target organism are, the more accurate the analysis will be. Therefore, the most appropriate organism to use as query is the one philogenetically closest to Spermophilus dauricus. In this case, the closest organism with annotated selenoproteins is Ictidomys tridecemlineatus. However, human selenoproteins are more studied and better annotated, so we complemented the search by analysing both organisms.

The selenoproteins of Ictidomys tridecemlineatus were obtained from the source:

www.selenodb.org

Spermophilus dauricus genome

Spermophilus dauricus genome was needed in order to identify its selenoproteome.
Such genome was provided by the University Pompeu Fabra, and can be found in the following directory:

/cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.fa

Return

Manual annotation

The procedure allows only to work with one query at the same time, and try to identify whether there is a hit in the genome. A hit means a sequence in the genome that is significantly similar to the query that we are analyzing. This process has to be repeated for each query. First of all, all the needed programs were exported for the analysis:

module load modulepath/goolf-1.7.20
module load BLAST+/2.2.30-goolf-1.7.20
module load Exonerate/2.2.0-goolf-1.7.20
module load T-Coffee/11.00.8cbe486-goolf-1.7.20
export PATH=/cursos/20428/BI/bin:$PATH
export PATH=/cursos/20428/BI/soft/genewise/x86_64/bin:$PATH

export WISECONFIGDIR=/cursos/20428/BI/soft/genewise/x86_64/wise2.2.0/wisecfg/

Tblastn

To locate the genomic region where our selenoproteins can potentially be, and since we were given protein queries, tBLASTN (Basic Local Alignment Search Tool) program was used. This program searches translated nucleotide databases using a protein query. Spermophilus dauricus genome was, therefore, translated from nucleotides to proteins, and Ictidomys Tridecemlineatus protein queries were used to find potential alignments. The command used in shell was:

tblastn -query query1.fa -db /cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.fa -out homologia1.blast

Then, a list of hits ordered by the score was presented. When a hit has to be chosen, the most important value to take into account is the E-Value, that has to be as low as possible. This parameter shows the probability that something will happen by chance. A considered good value is below 0,05. Other important parameter is the score, that shows the probability of the query to be in the sequence. This value has to be as high as possible. The last important parameter is to check the identity of the two sequences in the alignment. We need a high percentage of identity to accept the hit. After selecting the significant hits, it is important to register the information referent to the position of the hit, that is, where it starts and ends in the scaffold, and whether it is the direct or reverse chain.

Fastafetch

The significant hits obtained have to be extracted from the genome. First of all, it is important to export the region sequence where the scaffold is present by using Fastafetch, which provides a new file with the regions where the hits are included. By doing this, it is necessary to run Fasta index, a command used to arrange data in the most suitable way, which is easier for the program.

fastafetch /cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.fa /cursos/20428/BI/genomes/2017/Spermophilus_dauricus/genome.index "KZ294443.1" > hit1_query1.fa

KZ294443.1 is an example that refers to the name of a hit.

Fastasubseq

From the region obtained in fastafetch, a command is applied to further delimit the region where the hits are located:

fastasubseq hit1_query1.fa start length > subseqHIT1_query1.fa

It is important to write the start position and the length of the hit. But not the same values as the ones that were obtained in tblastn, as it has to be ensured that the sequence contains the gene of interests. For that reason, elongation of 50000 nucleotides 5’ upstream and 50000 3’ downstream has to be done.

Exonerate

The next step is to predict the gene of the genome sequence and align it with the sequence of the query. The command is:

exonerate -m p2g --showtargetgff -q query1.fa subseqHIT1_query1.fa --exhaustive yes > HIT1_query1.exonerate

Egrep

This program allows to extract the exonic part from the exonerate file, using the command:

egrep -w exon HIT1_query1.exonerate > HIT1_query1.exonerate.gff

Fastaseq from GFF

The next step is to extract the cDNA sequence that encodes for the final protein, using the next command:

fastaseqfromGFF.pl subseqHIT1_query1.fa HIT1_query1.exonerate.gff > HIT1_query1.nt.fa

Fastatranslate

This program translates the cDNA sequence obtained in the step before into aminoacids, with this command:

fastatranslate -f HIT1_query1.nt.fa -F 1 > HIT1_query1.predicted.fa

Genewise

GeneWise compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors. Therefore it predicts the protein directly from the subseq output, which allows to skip some of the steps performed before. The command used is the following:

For contigs on the forward strand:

genewise -pep -pretty -cdna -gff query.fa subseq_contigs.fa > genewise_bo.fa

For contigs on the reverse strand:

genewise -pep -pretty -cdna -gff -trev query.fa subseq_contigs.fa > genewise_bo.fa

T-coffee

Finally, this program compares the aminoacids obtained from the genome of Spermophilus Dauricus with a query that was used as reference. The command is:

t_coffee query1.fa HIT1_query1.predicted.fa > HIT1_query1.final

SECIS prediction

The t-coffee confirmed the alignment between the sequence of the genome and the query of Ictidomys Tridecemlineatus, but to confirm that this genome sequence refers to a selenoprotein, it is important to determine whether there is a SECIS element in this genome region. This is achieved using SECISearch3 and Seblastian, programs that are based on the SECIS and selenoproteins prediction.

Data acquisition

In order to automatize all the process, we first ran a parsing program script_1, in order to organize all the queries in specific directories. Having done that, we then ran script_2, which first performed tblastn. Then, it processed the results by ruling out all hits with e-Value smaller than 0.01. Also all hits that were found in the same scaffold and had the same direction were joint together into a unique hit with the minimum of the starts and the maximum of the ends. Finally, subseq, exonerate, egrep, fastaseqfromGFF, fastatranslate and tcoffe were called as described above for every hit that had been selected.

Script_3 performed genewise-based protein prediction for all the hits that were selected before. All 3 programs were modified from M. Schikora C et al, 2016. Finally, we used script_4 and script_5 to create the tables and images in the discussion.

Data analysis

Genewise and exonerate predicted a lot of proteins from the tblastn results. In order to find the correct selenoprotein or cys homologue, we manually analysed the different output obtained. We focused our attention on hits with small e-values and great identities. We only selected hits that had a really good alignment with the query in both Exonerate and Genewise when possible. Also we made sure to check that a Seci in 3’UTR existed in a plausible position if the predicted protein contained a Sec residue. Finally, the prediction performed by Seblastian helped us decide which selenoprotein was the correct one.