Gene prediction

Obtaining the genome

Obtaining the queries

BLAST

Removing sequence

EXONERATE

Coding sequence

T-COFFEE

SECIS Prediction

Automation

 

Gene prediction

One goal of our work is to identify computationally all selenoproteins present in the genome of Balaenoptera acutorostrata, a kind of whale. To achieve this, we use a technique based on homology, i.e., through amino acid sequences of known selenoproteins (queries) we will look for regions of the genome where the whale has encoded homologues of these proteins.

Obtaining Balaenoptera acutorostrata genome

The genome of Balaenoptera acutorostrata is found at the following directory:

/cursos/BI/genomes/vertebrates/2014/Balaenoptera_acutorostrata

Information about the genome can be found in NCBI that includes a set of databases relevant to biotechnology and biomedicine.

Obtaining queries

In our project, selenoproteins used to find the homologous sequences come from Tursiops truncatus, the bottlenose dolphin, the closest species that has an annotated selenoprotein.
There were also comparisons among all selenoproteins of Homo sapiens as it is the mammal that has recorded more. And also comparisons with Equus cavallus were considered.
All mammalian selenoproteins that have been used to search queries may be found in the SelenoDB 2.0 database. This page aims to provide high quality annotations of genes, proteins and SECIS elements.

Search regions of the genome containing selenoprotein: BLAST

Obtaining the database

The genome of interest has been provided in multifasta format and it must be transformed into a database to be able to perform BLAST. For this reason formatdb is run.

$ formatdb –i genome.fa –p F –n fitxerdesortida

-i: input, i.e.: genome of interest provided by the university.
-p: question - asking whether it is a database of protein, as it is not, it is written an "F" for false.
-n: output, i.e.: name of the database.

The BLAST (Basic Alignment Search Tool) s an algorithm that makes possible the comparison of the problem sequence with other sequences to find regions of similarity. Compare nucleotide sequences or amino acid sequences stored in different databases and calculates the statistical significance of the alignments.
tblastn, has been used. tblastn is a type of BLAST that compares amino acid sequences (recorded selenoproteins) and nucleotide sequence (whale genome) in order to get different alignments. For each alignment were obtained different parameters:

  • Scaffold: region of the genome where the alignment is localized.
  • Similarity: similarity between two aligned sequences.
  • Start i End position: start position and end hit.
  • E-value: ddescribes the probability of finding an alignment determined by chance. The program made to automate the search only considers hits with an E-value less than 1·10-4.

The command to run BLAST is:

$blastall –p tblastn –i fitxerquery.fa –d nombbddBLAST –o fitxerdesortida
-p: BLAST program, in this case tblastn.
-i: query aligned with the genome of interest.
-d: genome database format. 
-o: output, output file.

Removing the sequence of interest: FASTAINDEX, FASTAFETCH, FASTASUBSEQ

Below it is explained how are executed different programs that accompany EXONERATE allowing the extraction of the genomic sequence in the found region.

Indexing the genome of B. acutorostrata

From this point on, we focus on the scaffold that contains the best alignment and that has provided BLAST. To get this scaffold we execute fastaindex separating the minke whale genome segments and indexing them as scaffolds.

$ fastaindex genome.fa genome.index

- First argument: the input file is specified, i.e. the genome of interest provided by the university.
- Second argument: the output file is specified, i.e. the genome of interest indexed.

Getting the whole scaffold of interest

Fastafetch is another program to extract the nucleotide sequence of the genome scaffold from the indexed genome in multifasta format.

$ fastafetch genome.fa genome.index nom_scaffold > nomscaffold.fa

- First argument: the input file is specified, i.e. the genome of interest provided by the university.
- Second argument: the genome of interest specified indexed.
- Third argument: specifies the name of the scaffold where the best alignment has been found.

This way you get a file in multifasta format with all the nucleotide sequence containing the selected scaffold.

Obtaining a certain sequence of the scaffold

Then run the program for only fastasubseq sequence scaffold that interests us, which is the alignment obtained with BLAST.

$fastasubseq nomscaffold.fa posicio_inicial llargada > subsequencia.fa

- First argument: the fasta file is specified with the whole sequence of the selected scaffold.
- Second argument: specifies the starting position.
- Third argument: specifies the length of the sequence we want to extract.

The alignment the BLAST has made only considers the coding regions of the genome of interest. When taking the sequence of the scaffold, we should take into account that there are introns and therefore we take a wider range of nucleotides. The program made to automate the search expands to 100,000 nucleotides at 5' and 100,000 nucleotides at 3'.

Gene prediction: EXONERATE

From the query and the previously extracted genomic region is generated an annotation of the gene that will create the protein through the Exonerate program.
This program aligns the query and the genomic region, and predicts the functional elements contained in the latter, i.e., exons and introns. Exonerate do not recognize the letter U in the query, so we have to switch to an X.

Exonerate can be run using the following UNIX command:

$exonerate –m p2g --showtargetgff –exhaustive yes –q query.fa –t subsequencia.fa | egrep –w

exon > exons.exonerate.gff

-m: mode, in this case p2g means "genome to protein", i.e. comparing a sequence of amino acids (selenoprotein of reference) with the genome of interest (minke whale).
-showtargetgff: displays the results in a GFF file format that gives information about the coordinates of the genome.
-exhausitve: to extend (or not) the positions of the genome sequence.
-q:  query sequence in fasta format.
-t: target, in this case the nucleotide subsequence obtained with fastasubseq.

In addition, there is an addition of a pipe that allows you to copy exonic gene sequences with the egrep command –w exon.

The resulting file will have a GFF format and contain all exonic sequences of the gene that is predicted.

Obtaining the coding sequence of a gene in nucleotides

With the fastaseqfromGFF program we get the nucleotide sequence of the predicted protein from the initial genome through the GFF file coordinates, where the exons are, and the file containing the subsequence. The final file will contain exons of the gene in fasta format.

$fastaseqfromGFF.pl  subsequencia.fa exons.exonerate.gff > exons.fa

Translation of the nucleotide sequence of the problem gene to protein

The fastatranslate program translates the nucleotide exonic sequence to its corresponding amino acid sequence.

$fastatranslate –f exons.fa –F 1 > proteina_predita.fa

-f: specifies the file containing the exons in nucleotides.
-F: specifies the reading pattern of the translation. In this case: reading frame 1.

We get a file containing the amino acid sequence of the predicted protein in fasta format.

Comparison of two protein sequences: T-COFFEE

With the T-COFFEE program two protein sequences are compared: the predicted selenoprotein and the known selenoprotein.
We get an alignment of both proteins where the conserved and non-conserved amino acid residues are observed.

The program can be run from the website or by using the following command in the UNIX terminal:

$t_coffee sequencia1.fa sequencia2.fa

SECIS Predicion

As explained in the introduction, a SECIS element (SElenoCysteine Insertion Sequence) is a structural motif of mRNA that in eukaryotes is in the 3'UTR segment of the gene in selenoproteins.

o better characterize the predicted selenproteins, we proceeded to predict their SECIS elements using the SECISEARCH3 website, which allows obtaining DNA sequences that correspond to recorded SECIS elements of different species. In this case, the predictions are de novo, and are made from the sequences obtained through the fastasubseq command.

To choose the appropriate SECIS element, it was taken into consideration the proximit if the final exon of the predicted protein, the gene direction and the grade of the given SECIS element.

Automation

The following program conducted with Perl has been used to automate the search for selenoproteins in Balaenoptera acutorostrata. The script can be found by pressing on the following link.

The used bash can also be found by pressing on the following link. Bash allows the exportation of all the necessary information for the posterior execution of the different programes, and also it runs the automation program.