promoters prediction

Introduction

Position Wieght Matrix(PWM)

PWM is a motif descriptor that attempts to capture the intrinsic variability characteristic of sequence patterns. in a a sequence.
PWM is usually derived from a set of aligned sequences functionally related.
The matrix shows how many times a given nucleotide has been observed at a given position. We normalize the PWM because it's absolute frequency values,To get the relative frequencys we divide the value in each position of the matrix by the number of secuences used to built the matrix.

Position Weight Matrix:TRANSFACT.

Sequence in FASTA format

We use sequences in a predetermined format, the FASTA format, it begins with a single-line description of the sequence, followed by the lines of sequence data. The description line is distinguished from the sequence data by (">") in the first column.

To obtain sequences in FASTA:NCBI.

Objective

This program has been created to predict posible gene promoter regions along a DNA sequence using the Position Weight Matrix method.

MATERIAL

Position Weight Matrix: TATA box and/or GC box and/or others(TRANSFAC)

Problem Sequences in FASTA format

Operative System: LINUX (UNIX)

Programation Language: Perl

The Program : PROMFINDER

To run Promfinder you need:

-options

file_sequence.fa

file_matrix.txt

DNAsequence.fa can include one or more sequences

Matrix.txt can contain one or more matrices

The program is structured in three sections :

1. Inicializating the program:

Declare the options of the program. We also declare all the variables we'll use.

2. Processing the sequence:

The program reads the sequence or sequences and executes each rutine for each sequence. The first pass is to catch the identificator(">.......") of each sequence so we can identify them at the results. The next is to build an array with the sequence, now the sequence is ready be scanned by the PWM.
2.1) PWM Processing:
This section es repeated for each sequence as many times as PWM contains the file.

a) Open Matrix.txt
b)Matrix normalization.
The result of this operation is the relative frequency of each nucleotide in the different positions of the matrix.

c)Foreach matrix we calculate the consensous sequence and its score.
2.2) Candidates evaluating:
Each candidate has a lenght especified by the number of positions of the PWM.
The program will only show those candidates wich surpass the threshold chosen by user or assigned by default.

a) Estimating the score of the candidates.
b) If the score is high enough, when the execution of Promfinder ends, the program will show: the initial position, the final position, the score and the sequence of each candidate.
c) Close file_matrix if there're no more matrices. If there are more matrices, go back to point b) of section 2.1
d) Close file_sequence if there're no more sequences. If there are more, go back to point b) of section 2.

3. Graphical Representation and end of Promfinder:

While the promfinder ran it stored all the sequences ,the candidates(and their inicial and final position), the matrices names, in order to represent graphically those results. The graphical representation allows to visualize the exact location of candidates in the sequence.

Finally close Promfinder

Promfinder Options :

-v: information concerning the program execution, indicating each step at schell by means of the subroutine "sub print_mess".
-m: Information about the matrix,number of matrices contained in the file,the consensous sequences and their scores.
-s: Information about the sequence. Promfinder shows the sequence name (the FASTA id) ,its number of nucleotides (lenght) and the content of G and C (in absolute value and %).
-t x.x: Especifying a threshold. If user doesn't use this option the program will assign a threshold value by default. (0.8).

Download The Promfinder

Ir al principio de la página

promoters prediction

Introduction

Objective

MATERIAL

The Program : PROMFINDER

Download The Promfinder