P20 Databases of Protein Domains

Practical 20: Databases of Protein Domains
Bioinformatics 2007

Introduction

Proteins have a modular organisation. They are made up of different regions with specific functions: protein (or functional) domains. Each domain is characterized by one or more sequence motifs, which are related to the function carried out by the domain. Preservation of that function is what prevents the motif from gradually disappearing by the accumulation of mutations during evolution. The same domain may be found in many different proteins from the same organism and in many different organisms. For instance, the RNA binding domain is one of the most abundant in all eukaryotes.

We can represent a conserved domains as a multiple alignment. From the multiple alignment we can build a description of the sequence motif using a consensus sequence, a position specific scoring matrices (or weight matrix) or a hidden markov model (this will be studied in the course of structural biology).

A number of databases exists that store information on known protein domains: (Prosite, Pfam, Interpro, SMART, ...). In Interpro (Mulder et al., 2005) we can access different domain databases.

Using a formal representation of the domain (for example by a position specific scoring matrix) we can search for other molecules that contain the same domain in sequence databases. These searches are usually very sensitive and allow us to detect remote homologies.

Examples:

tyrosine kinase motif

CLUSTAL W (1.82) multiple sequence alignment

<>
ABL_CALVI/28-40           YIHRDLAARNCLV 13
ABL_DROME/505-517         YIHRDLAARNCLV 13
ABL_FSVHY/308-320         FIHRDLAARNCLV 13
ABL2_HUMAN/405-417        FIHRDLAARNCLV 13
ABL1_MOUSE/359-371        FIHRDLAARNCLV 13
ABL1_HUMAN/359-371        FIHRDLAARNCLV 13
ABL1_CAEEL/428-440        FIHRDLAARNCLV 13
7LES_DROME/2339-2351      FVHRDLACRNCLV 13
7LES_DROVI/2351-2363      FVHRDLACRNCLV 13                          ::*****.*****

Practical

We will identify protein domains in a sequence of interest using the resource of protein domains Interpro.

We have cloned the following fragment:

>fragment_seq1
ACGTGTATCAGAGCTCATCAGAGGGTAAAGTTCACAAAAGACCACACTGTCAGACAGAAAGAGGAAGTAT
CTCCAGAGGCAGTTGGTGTCACCAGCCAGCGACCAGTGTTTTGTCCTTTTCATAAAAAGGAGCAGCTGAA
GCTGTACTGTGAGACATGTGACAAACTGACATGTCGAGACTGTCAGTTGTTAGAACATAAAGAGCATAGA
TACCAATTTATAGAAGAAGCTTTTCAGAATCAGAAAGTGATCATAGATACACTAATCACCAAACTGATGG
AAAAAACAAAATACATAAAATTCACAGGAAATCAGATCCAAAACAGAATTATTGAAGTAAATCAAAATCA
AAAGCAGGTGGAACAGGATATTAAAGTTGCTATATTTACACTGATGGTAGAAATAAATAAAAAAGGAAAA
GCTCTACTGCATCAGTTAGAGAGCCTTGCAAAGGACCATCGCATGAAACTTATGCAACAACAACAGGAAG

Part 1. Identify which gene this sequence corresponds to and get the complete protein sequence

* Copy the sequence above and paste it in BLAST to identify the complete gene entry (select blastn program)

* After the first page click Format and wait for the results.

* How many homologous sequences do we find with the BLAST search? From which species?

* Click on an entry with 100% identity to our sequence (presumably the gene it belongs to). Which is this gene?

* Go to the protein entry of this gene entry (Click on /protein_id )

* At Display select "FASTA" (at top of entry) and click "Display"

* Keep the sequence for the next section (copy in a text file or leave this window open)

Part 2. Identify known domains on the protein

* Go to Interpro and select InterProScan (menu at the left)

* Paste the protein sequence in the window and Submit Job

* How many Intepro domains hit the protein?

* Go to Table View.

* How can we know how reliable the hits to these domains are?

* Go to Raw Output

    *   Find the domain boundaries of motif PF00643 zf-B_box in our protein sequence

* Go back to Picture View

* Click on the Interpro entry defined as "Zn-finger, B-box" (IPR000315 Znf_Bbox)

     * Read the description of the domain

   * How many proteins in the Interpro database contain this domain?

     * Go down the entry to see different domain architectures of the proteins that contain this domain

* Click on the Pfam entry of the hit corresponding to "Zn-finger, B-box" (PF00643 zf-B_box)

    * Go to Alignment and click on "Get alignment" and "View HMM logo"

       * Which are the best conserved residues?

    * Go to Species Distribution

       * Which is the species that contains more proteins with this domain?

    * Go to Domain organisation

       * How many domain combinations also contain the Bromodomain?

Part 3. Analyse the similarity to mouse protein B-raf

* Go back to BLAST results and click on the entry ">gi|553877|gb|M64429.1|MUSBRAF Mouse B-raf oncogene mRNA, complete cds"

* Get the protein entry in FASTA format as before

* Go to ClustalW and paste in the window the two protein sequences (TIF1 and B-raf) in fasta format.

* Analyse the protein alignment.

* Open a new browser window, go to Interpro and select InterProScan

* Paste the B-raf protein sequence in the window and Submit Job

* Compare the results with those obtained with TIF1. Which proten domains are shared and which are not?

* Compare the domain organization of the two proteins to the multiple sequence alignment obtained with ClustalW.

Additional files:

Link to protein sequences in fasta format here.

Mar Albà, February 2006