Discovering Gavialis gangeticus selenoproteins

Gene prediction

SECIS prediction

Automation

 

Materials and Methods

One of the goals in this work is to computationally identify all selenoproteins in Gavialis gangeticus genome. In order to complete our objectives we use an homology based technique. We have used as queries, sequences of known selenoproteins and searched regions of the genome in which crocodile has homologues.

Obtaining Gavialis gangeticus genome

The genome was facilitated by the university and can be found at the directory:

/cursos/BI/genomes/2015/Gavialis_gangeticus

Obtaining queries

As mentioned in the introduction, we used as preferred reference genome G. Gallus. In SelenoDB we can find all the selenoproteins that have been noted for that genome including the gene sequence, promoter sequence, transcript sequence, protein sequence... We have used protein sequences of the different selenoproteins to search for homologues in our genome. In case we could not find some of the selenoproteins in G. Gallus genome we would recurre to the human genome because it is the best studied to date.

Search regions of the genome containing selenoproteins: BLAST

BLAST (Basic Local Alignment Search Tool) is a program that uses an heuristic algorithm and compares sequences of interest with other known sequences through alignment so that we can find similar regions. Furthermore, it calculates the statistical significance of the alignments it finds.

Depending on the type of blast, we could compare aminoacids or nucleotides but in this project we have used tblastn which compares and aligns aminoacid and nucleotide sequences. More specifically we have compared protein sequences from known selenoproteins (chicken genome) with nucleotide sequences from the Gavialis gangeticus genome.

For each alignment we obtained multiple parameters:

Scaffold: location of the alignment in crocodile’s genome

Identity: how similar are the aligned sequences provided with a score

Start and End position: start and end position of the contained hit

E-value: describes how many times you can expect to find an alignment as good as that one in the database by chance

The command used is:

$blastall –p tblastn –i query.fa –d databaseBLAST –o outputfile

-p: specifies the type of blast that we want to use

-i: specifies the input file. In our project we use protein sequences from the chicken genome as queries

-d: specifies the database that will be used for the search. In our project we use the G. gangeticus genome provided by the university

In this project we use as threshold E-value < 0,0001

Sequence of interest: FASTAINDEX, FASTAFETCH, FASTASUBSEQ

In order to extract the sequence found in the genome region we need different programs:

  • Indexing the genome
  • We focus on the scaffold with the best alignment provided by BLAST. To do this, we must execute fastaindex separating the gavial genome segments and indexing them.

    $ fastaindex genome.fa genome.index

    The first argument, genome.fa, is the input

    The second argument, genome.index, is the output

  • Getting the whole scaffold
  • Once we have the indexed genome we have to extract the whole scaffold i.e. the nucleotide sequence of the gavial genome which corresponds to the alignment.

    In order to do that, we need fastafetch program

    $ fastafetch genome.fa genome.index nom_scaffold > nomscaffold.fa

    - First argument: input file (genome of interest provided by the university)

    - Second argument: indexed genome

    - Third argument: name of the scaffold with the best alignment

    Therefore, we create a file with the whole scaffold sequence which contains the nucleotide sequence corresponding to the alignment obtained through BLAST.

  • Obtaining the sequence of interest of the scaffold
  • We need to take into consideration that the alignment made by BLAST only considers the coding regions of the genome of interest. However, the gene also contains introns and, therefore, we need to take a wider range of nucleotides.

    $fastasubseq nomscaffold.fa posicio_inicial llargada > subsequencia.fa

    - First argument: input (file containing the whole sequence of the scaffold)

    - Second argument: starting position in the scaffold sequences

    - Third argument: length of the sequence we want to extract

    We get a file with the sequence corresponding to the alignment obtained through BLAST.




Gene prediction: EXONERATE

After obtaining the sequence of interest of the scaffold we can use exonerate to predict the nucleotide sequence structure of the gene. In order to do that, exonerate aligns the query (protein sequence extracted from SelenoDB) with the sequence extracted from the scaffold of gavial's genome.

$ exonerate –m p2g --showtargetgff –exhaustive yes –q query.fa –t subsequencia.fa | egrep –w

exon > exons.exonerate.gff

-m: mode. In this case we use genome to protein, in which we compare an aminoacid sequence with the genome of interest.

-showtargetgff: results will be shown in GFF format (genome coordinates).

-exhaustive: to extend the positions of the genome sequence

-q: query sequence (protein sequence from SelenoDB) in fasta format

-t: target (nucleotide sequence of the scaffold)

The addition of | egrep -W exon > exons.exonerate.gff will create a file with exons of the predicted gene.

Obtaining the coding sequence of a gene in nucleotides

Afterwards, we need to get the nucleotide sequence of the predicted protein through the GFF file coordinates and the whole scaffold sequence.

In order to do that we need to execute fastaseqfromGFF program.

$ fastaseqfromGFF.pl subsequencia.fa exons.exonerate.gff > exons.fa

- First argument: fasta file with the scaffold sequence

- Second argument: GFF file with the coordinates delimitating the exons of the predicted gene

Translation of the nucleotide sequence of the problem gene to protein

Then, we need to translate the predicted nucleotide sequence to an amino acid sequence (protein sequence).

The program needed to do that is fastatranslate.

$ fastatranslate –f exons.fa –F 1 > proteina_predita.fa

-f: specifies the file containing the predicted nucleotide sequence

-F: specifies the reading pattern of translation

Comparison of two protein sequences: T-COFFEE

Finally, both the predicted selenoprotein sequence and the known selenoprotein (the one used as query) need to be compared.

Using T-coffee program we get an alignment of both proteins in which we can observe the conserved and non-conserved aminoacid residues.

SECIS prediction

As explained before, a SECIS element is an RNA structural motif that in eukaryotes is located in 3'-UTR segment of selenoprotein genes.

In order to better characterize the predicted selenoproteins, we proceeded to predict de novo SECIS elements using SECISearch3, which allows obtaining DNA sequences that correspond to SECIS elements of different species.

Automation

We have created a bash script which imports all necessary packages, creates the folders the program will use, runs the notation programs and also calls for filters (You can find the script here).

Furthermore, the bash script uses some filters and a file matcher written using Phyton programming language (You can find the script here).

Download the full package here.