InterProScan : Protein domains wrapper identifier (poster) Quevillon E.1, Silventoinen V.1, Servant F.2, Zdobnov E. M.3, Lopez R.1* and Apweiler R.1 1 : European Bioinformatics Institute – Wellcome Trust Genome Campus Hinxton – UK 2 : Mc Gill university – Montreal – CA 3 : European Molecular Biology Laboratory – Heidelberg – DE * : corresponding author (
[email protected]).
Emmanuel Quevillon European Bioinformatics Institute EMBL Outstation – Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK Tel : +44 1223 49 4443 Fax : +44 1223 494472 Abstract :
When carrying out protein sequence analysis, the aim is to find out as much as possible about a query sequence. The first step is to compare the protein sequence against a nonredundant protein sequence database using Blast1 or Fasta2. However, these searches will only reveal which sequence(s) are similar to the query sequence. To get further information about the query protein’s specific function, searches against, so called, secondary databases (also known as pattern or signature databases) are necessary. If a search retruns significant matches, these results will help to assign the query protein to a particular family. If the structure and function of the family are known, searches of the secondary databases offer a fast track into inferring biological function. Examples of these databases are Pfam3, SMART4 and PRINTS5. These analyse the primary databases data differently and the results contain different information. By uniting these secondary databases, the InterPro consortium was born (table 1). InterPro6 (http://www.ebi.ac.uk/interpro) is a searchable database providing information on protein function and annotation. InterPro entries are grouped based on protein signatures or 'm ethods'. These groups represent superfamilies, families or subfamilies of sequences. InterProScan7 is a tool that can be used to search InterPro with query sequences. It combines the protein function recognition methods of the member databases of InterPro into one resource (the methods and their databases are shown in the table 1). A number of search applications are launched, each against a specific member database and return a list of hits to that database. These results are merged and returned to the user as a list of matches to InterPro. Input Sequences : InterProScan is available at the EBI through a web interface (http:www.ebi.ac.uk/InterProScan) where the user can paste their own sequence(s) or upload a sequence file. Input sequences can be nucleotide (interactive only), in which case the user chooses a translation code table and a minimum ORF size for translated proteins. It is also possible to submit InterProScan jobs via email to
[email protected]. Output Results : The results of each application (table 1) are parsed to produce a merged file in raw format (tab delimited). A converter is then used to produce the results in XML format which can be easily parsed or reused later. HTML output (graphical and table view) is produced on the fly by parsing the XML file. The graphical view (Fig 1) provides a cartoon that represents domain locations on each sequence and their corresponding accession number in the member database. This view also provides links to the InterPro database and the SRS12 indexed version of it. The table view
(Fig 2) provides information about InterPro entries: the parentchildren relationship; GO classification; and the location, evalue and status of each match shown in the graphical view. A free downloadable Perl standalone version (3.3) is currently available from the EBI for users who would like to install and run their own installation. This standalone version transparently supports the use of various queuing systems (such as LSF, OpenPBS and SGE). (ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan) – Perl core package containing all the scripts and modules to run InterProScan. – Data package which contains all the data needed by each application to run. – Binary package precompiled for 6 different platforms (Linux, OSF1, AIX, Sun, IRIX and MacOSX). A new version of InterProScan is currently being rewritten from scratch. It removes the use of gmake and is more flexible (configurable). It also allows user to utilise several queuing systems.
Fig 1 : Graphical view of an InterProScan result.
Fig 2 : Table view of an InterProScan result. Database
Application
ProDom9
BlastProDom
PRINTS
FingerPrintScan
SMART
Hmmpfam 10
TIGRFAMs
Hmmpfam
Pfam
Hmmpfam 8
PROSITE PIRSF
Pfscan
11
Hmmpfam 12
SUPERFAMILY(SCOP) Hmmpfam Table 1 : Database members and their applications.
References :
1) Blast Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z.,Miller W., and Lipman D.J. Gapped Blast and PsiBlast: A New Generation of Protein Database Search Programs. Nucleic Acids Res, 1997. 25(17): p. 3389402. 2) Fasta W. R. Pearson and D. J. Lipman (1988) Improved Tools for Biological Sequence Analysis. PNAS 85:2444 2448 3) PFAM Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., GriffithsJones S., Howe K.L., Marshall M., Sonnhammer E.L.L. (2002) The Pfam Protein Families Database. Nucleic Acids Res. 30, 276280. 4) SMART Letunic I., Goodstadt L., Dickens N.J., Doerks T., Schultz J., Mott R., Ciccarelli F., Copley R.R., Ponting C.P., Bork P. (2002) Recent improvements to the SMART domainbased sequence annotation resource. Nucleic Acids Res. 30, 242244. 5) PRINTS Attwood T.K., Bradley P., Flower D.R., Gaulton A., Maudling N., Mitchell A.L., Moulton G., Nordle A., Paine K., Taylor P., Uddin A., Zygouri C. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31, 400402. 6) InterPro Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Barrell D., Bateman A., Binns D., Biswas M., Bradley P., Bork P., Bucher P., Copley R.R., Courcelle E., Das U., Durbin R., Falquet L., Fleischmann W., GriffithsJones S., Haft D., Harte N., Hulo N., Kahn D., Kanapin A., Krestyaninova M., Lopez R., Letunic I., Lonsdale D., Silventoinen V., Orchard S.E., Pagni M., Peyruc D., Ponting C.P., Selengut J.D., Servant F., Sigrist C.J.A., Vaughan R, Zdobnov E.M. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucl. Acids. Res. 31, 315318. 7) InterProScan Zdobnov E.M. and Apweiler R. InterProScan an integration platform for the signaturerecognition methods in InterPro Bioinformatics, 2001, 17(9): p. 8478. 8) PROSITE Falquet L., Pagni M., Bucher P., Hulo N., Sigrist C.J.A., Hofmann K., Bairoch A. (2002) The PROSITE database, its status in 2002. Nucleic Acids Res. 30, 235238. 9) ProDom Corpet F., Servant F., Gouzy J., Kahn D. (2000) ProDom and ProDomCG: Tools for protein domain analysis and whole genomecomparisons. Nucleic Acids Res. 28, 267269. 10) TIGRFAMs Haft D.H., Selengut J.D., White O. (2003) The TIGRFAMs database of protein families. Nucleic Acids Res. 31, 371373. 11) PIR SuperFamily
Wu C.H., Huang H., Yeh L.S.L., Barker W.C. (2003) Protein family classification and functional annotation. Comput Biol Chem. 27, 3747. 12) SUPERFAMILY Gough J., Karplus K., Hughey R., Chothia C. (2001) Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that Represent all Proteins of Known Structure. J. Mol. Biol. 313(4), 903919. 13 ) SRS Thure Etzold and Patrick Argos SRS an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci. 9:4957, 1993