Roderic Guigó, IMIM/UPF/CRG, Barcelona

COMPUTATIONAL GENE IDENTIFICATION

NOTE: click on the images through this document to download higher quality postscript images.

The Problem

The Gene Identification Problem can be formulated as the problem of deducing the aminoacid sequences encoded in a given DNA genomic sequence.

Why is the problem relevant?

From an applied standpoint, it is of great relevance given the Genomic Projects underway. Genome sequences---even for higher eukariotic organisms---are being generated almost automatically. Locating the protein coding genes is the first necessary step to convert the nucleotide sequence of an organism genome into valueable knowledge about the organism biology.
From a basic standpoint, the problem is also very relevant. It is the problem of understanding the way genes are specified in the genome. Being able to deduce the aminoacid sequences encoded in a given DNA genomic sequence by reying only on the sequence is, after all, what it means deciphering the genetic code.

Why is the problem difficult?

In higher eukariotc organims, genes are neither contiguos nor continuous. First, genes coding for different proteins are separated by large intergenic regions that do not code for proteins. Second, a given protein sequence is not usually specified by a continuous DNA sequence, but genes are often splitted in a number (maybe large) of (small) coding fragments known as Exons, separated by (larger) non-coding intervining fragments known as Introns (See figure below). Often, intronic and intergenic DNA makes most of the genome in high eukariotic organisms. In the human genome, for instance, only a very small fraction of the DNA, which can be as low as 2%, corresponds to protein coding exons.

the pathway from DNA to protein sequences

In the next sections, we show that although signals exist on the DNA sequence that instruct the cellular machinery along the pathway from DNA to protein sequences, our knowledge of the way such signals are recognized and processed by the cell is still limited, and it is usually impossible to infer the genes encoded in a given DNA sequence by reliying only on these signals

A few types of signals on the DNA sequence are involved in gene specification

The figure below schematizes the pathway from DNA to protein sequences in a higher eukariotic cell. The main steps in this pahtway are:

Transcription. The continuous sequence of DNA corresponding to a single gene is copied to an RNA sequence.
Splicing. The primary RNA transcript is spliced to remove intron sequences, producing a shorter RNA molecule, known as messenger RNA (mRNA).
Tranlation. The mRNA sequence is translated into protein sequence by a sub-cellular structure known as ribosome. The ribosome binds to an initiation codon, and scans the sequence synthesizing the amino acid sequence specified by consecutive non-overlapping codons. Scanning of the mRNA proceeds until the ribosome finds one of the three codons not specifying amino acids (the Stop Codons). At that point, elongation of the amino acid sequence ends, and the final protein product is released.

Signals exist in the DNA sequence---short strings of nucleotides---, which instruct the cellular machinery during these steps. The Promoter Elements, and the Transcription Termination Motif during transcription, the Donor Sites and Acceptor Sites during Splicing, and the Initiation Codon and the Stop Codon during Translation. Although eventually recognized by the cellular machinery through intermediate RNA molecules, the signals involved in gene specification are all ultimately encoded in the primary DNA sequence.

DNA signals involved in gene specification are aparently ill-defined and highly unspecific

DNA signals involved in gene specification are ill-defined, they lack generality, and are highly unspecific; with currently available detection methods, it is usually impossible to distinguish the signals truly processed by the cellular machinery from those---much more frequent---apparently non functional. As a consequence, attempting to predict gene structure by processing solely DNA sequence signals often results in a computationally untreatable combinatorial explosion of potential products.

In the figure below, we plot the potential start sites, acceptor and donor sites that can be identified along the 2000 bp long sequence containing the three exon beta-globin gene. Sites have been identified using a Position Weigth Matrix, with a cutoff such that no potential true sites are missed. From such signals, hundreds of potential exons can be constructed, which in turn can be combined into milions of potential genes. The cell apparently finds precissely its way through this puzzle, and only one (or a few) of such genes appear to be actually specified.

Information other than sequence signals can be used to infer the genes potentially encoded in a DNA sequence.

Information from a number of sources, other than the sequence signals recognized by the cellular machinery, can be used to infer the genes encoded by the cellular machinery. Roughly, this information can be categorized as follows:

Intrinsec. Information derived only from the query sequence itself (without reference to other known sequences).
- Signal. Information derived from sequence signals. Sequence signals can be not only identified, but also scored. A wide variety of methods exist to score and locate sequence signals, collectively known as Search by Signal methods.
- Content. Information derived from the fact that coding regions in the DNA exhibit peculiar sequence statistical properties. A wide variety of coding measures (or statistics) have been developed over the years. They are collectively known as Search by Content methods.
Extrinsec. Information derived by comparing the query sequence with other known sequences (in the public databases). Usually, coding regions such as amino acid sequences or ESTs, but also regulatory sequence such as promoters, or even intrinsically non coding sequences, such as repeats.

In the figure below, we plot how this additional information can help us to localize the exons of the beta-globin gene.

Sequence signals (Acceptor sites in blue, Donor sites in red, and Initiation Codons in green) can not only be identified, but also scored; Although functional sequence patterns show sequence conservation, this conservation is usually not absolute. Among the possible patterns, some of them are more likely than others, and this "likelihood" can be somehow measured.
A number of coding sequence statistics can be computed along the sequence. Here, we have computed a measure of the 3-base periodicity on an sliding window along the sequnce. Coding regions are known to be characterized by an strong 3-base periodicty. Thus, highly periodic regions (higher in red along the line plotted below the predicted sites) are more likely to be coding.

All this intrinsec information can be used to score the predicted exons, and eventually filter out unlikely candidates.

The position of Stop Codons can be used to delimitate long ORFs where exons can occur, and to establish the frame in which a predicted exon can be read. Only frame compatible exons (exons matching adjacent colors) can be assembled into functional genes, limiting substantially the number of possible candidates.

Database matches are extremly useful, since they constitute very strong evidence of the existence of the genes---but they not always exist. In this case, matches to amino acid sequences in the SWISSPROT database in the three frames have been ploted in purple, matches to cDNAs sequences in the dbEST database in green, and matches to repetitive sequences in the REPEAT database in orange.

Up to date Electronic Biobliographies on Computational Gene Identification are maintained by

Wentian Li from Rockefeller University at http://linkage.rockefeller.edu/wli/gene/list.html
Misha Gelfand from the Russia Academy of Sciences at http://www-hto.usc.edu/software/procrustes/fans_ref/

Roderic Guigo (i Serra), IMIM and UB. rguigo@indy.imim.es