Output formats ============== In this version of InterProScan, you can retrieve output in any of the following five formats: - `TSV `__: A simple tab-delimited file format - `XML `__: The InterProScan XML format (`XSD available here `__). - `JSON `__: Full output of results in JSON format - `GFF3 `__: The `GFF 3.0 `__ format InterProScan 5 can output results for protein and nucleotide sequences in all formats. **Please note** you can only trace protein match positions to the original nucleotide sequence with GFF3, XML and JSON outputs. You can override the default output formats using the **-f** option, e.g.: :: ./interproscan.sh -f XML -f JSON -i /path/to/sequences.fasta -b /path/to/output_file or :: ./interproscan.sh -f XML, JSON -i /path/to/sequences.fasta -b /path/to/output_file These two equivalent commands will output the results in XML and JSON format. Tab-separated values format (TSV) --------------------------------- Basic tab delimited format. Outputs only those sequences with domain matches. Example output ~~~~~~~~~~~~~~ :: P51587 14086411a2cdf1c4cba63020e1622579 3418 Pfam PF09103 BRCA2, oligonucleotide/oligosaccharide-binding, domain 1 2670 2799 7.9E-43 T 15-03-2013 P51587 14086411a2cdf1c4cba63020e1622579 3418 ProSiteProfiles PS50138 BRCA2 repeat profile. 1002 1036 0.0 T 18-03-2013 IPR002093 BRCA2 repeat GO:0005515|GO:0006302 P51587 14086411a2cdf1c4cba63020e1622579 3418 Gene3D G3DSA:2.40.50.140 2966 3051 3.1E-52 T 15-03-2013 ... The TSV format presents the match data in columns as follows: 1. Protein accession (e.g. P51587) 2. Sequence MD5 digest (e.g. 14086411a2cdf1c4cba63020e1622579) 3. Sequence length (e.g. 3418) 4. Analysis (e.g. Pfam / PRINTS / Gene3D) 5. Signature accession (e.g. PF09103 / G3DSA:2.40.50.140) 6. Signature description (e.g. BRCA2 repeat profile) 7. Start location 8. Stop location 9. Score - is the e-value (or score) of the match reported by member database method (e.g. 3.1E-52) 10. Status - is the status of the match (T: true) 11. Date - is the date of the run 12. InterPro annotations - accession (e.g. IPR002093) 13. InterPro annotations - description (e.g. BRCA2 repeat) 14. GO annotations with their source(s), e.g. GO:0005515\(InterPro\)|GO:0006302\(PANTHER\)|GO:0007195\(InterPro,PANTHER\). This is an optional column; only displayed if the :code:`--goterms` option is switched on 15. Pathways annotations, e.g. REACT\_71. This is an optional column; only displayed if the :code:`--pathways` option is switched on If a value is missing in a column, for example, the match has no InterPro annotation, a '-' is displayed. Extensible Markup Language (XML) -------------------------------- XML representation of the matches - this is the richest form of the data. The XML Schema Definition (XSD) file links are below the example output. Example output ~~~~~~~~~~~~~~ :: MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTVKTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESLKKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVVDTSEEDSFSLCFSKCRTKNLQKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEPNDTDPLDSNVAHQKPFESGSDKISKEVVPSLACEWSQLTLSGLNGAQMEKIPLLHISSCDQNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDCILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGLEIHTVCSQKEDSLCPNLIDNGSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSYKGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLHSSVKRSCSQNDSEEPTLSLTSSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCENDPKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPTSKDVLSNLVMISRGKESYKMSDKLKGNNYESDVELTKNIPMEKNQDVCALNENYKNVELLPPEKYMRVASPSRKVQFNQNTNLRVIQKNQEETTSISKITVNPDSEELFSDNENNFVFQVANERNNLALGNTKELHETDLTCVNEPIFKNSTMVLYGDTGDKQATQVSIKKDLVYVLAEENKNSVKQHIKMTLGQDLKSDISLNIDKIPEKNNDYMNKWAGLLGPISNHSFGGSFRTASNKEIKLSEHNIKKSKMFFKDIEEQYPTSLACVEIVNTLALDNQKKLSKPQSINTVSAHLQSSVVVSDCKNSHITPQMLFSKQDFNSNHNLTPSQKAEITELSTILEESGSQFEFTQFRKPSYILQKSTFEVPENQMTILKTTSEECRDADLHVIMNAPSIGQVDSSKQFEGTVEIKRKFAGLLKNDCNKSASGYLTDENEVGFRGFYSAHGTKLNVSTEALQKAVKLFSDIENISEETSAEVHPISLSSSKCHDSVVSMFKIENHNDKTVSEKNNKCQLILQNNIEMTTGTFVEEITENYKRNTENEDNKYTAASRNSHNLEFDGSDSSKNDTVCIHKDETDLLFTDQHNICLKLSGQFMKEGNTQIKEDLSDLTFLEVAKAQEACHGNTSNKEQLTATKTEQNIKDFETSDTFFQTASGKNISVAKESFNKIVNFFDQKPEELHNFSLNSELHSDIRKNKMDILSYEETDIVKHKILKESVPVGTGNQLVTFQGQPERDEKIKEPTLLGFHTASGKKVKIAKESLDKVKNLFDEKEQGTSEITSFSHQWAKTLKYREACKDLELACETIEITAAPKCKEMQNSLNNDKNLVSIETVVPPKLLSDNLCRQTENLKTSKSIFLKVKVHENVEKETAKSPATCYTNQSPYSVIENSALAFYTSCSRKTSVSQTSLLEAKKWLREGIFDGQPERINTADYVGNYLYENNSNSTIAENDKNHLSEKQDTYLSNSSMSNSYSYHSDEVYNDSGYLSKNKLDSGIEPVLKNVEDQKNTSFSKVISNVKDANAYPQTVNEDICVEELVTSSSPCKNKNAAIKLSISNSNNFEVGPPAFRIASGKIVCVSHETIKKVKDIFTDSFSKVIKENNENKSKICQTKIMAGCYEALDDSEDILHNSLDNDECSTHSHKVFADIQSEEILQHNQNMSGLEKVSKISPCDVSLETSDICKCSIGKLHKSVSSANTCGIFSTASGKSVQVSDASLQNARQVFSEIEDSTKQVFSKVLFKSNEHSDQLTREENTAIRTPEHLISQKGFSYNVVNSSAFSGFSTASGKQVSILESSLHKVKGVLEEFDLIRTEHSLHYSPTSRQNVSKILPRVDKRNPEHCVNSEMEKTCSKEFKLSNNLNVEGGSSENNHSIKVSPYLSQFQQDKQQLVLGTKVSLVENIHVLGKEQASPKNVKMEIGKTETFSDVPVKTNIEVCSTYSKDSENYFETEAVEIAKAFMEDDELTDSKLPSHATHSLFTCPENEEMVLSNSRIGKRRGEPLILVGEPSIKRNLLNEFDRIIENQEKSLKASKSTPDGTIKDRRLFMHHVSLEPITCVPFRTTKERQEIQNPNFTAPGQEFLSKSHLYEHLTLEKSSSNLAVSGHPFYQVSATRNEKMRHLITTGRPTKVFVPPFKTKSHFHRVEQCVRNINLEENRQKQNIDGHGSDDSKNKINDNEIHQFNKNNSNQAAAVTFTKCEEEPLDLITSLQNARDIQDMRIKKKQRQRVFPQPGSLYLAKTSTLPRISLKAAVGGQVPSACSHKQLYTYGVSKHCIKINSKNAESFQFHTEDYFGKESLWTGKGIQLADGGWLIPSNDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRRSAIKKIMERDDTAAKTLVLCVSDIISLSANISETSSNKTSSADTQKVAIIELTDGWYAVKAQLDPPLLAVLKNGRLTVGQKIILHGAELVGSPDACTPLEAPESLMLKISANSTRPARWYTKLGFFPDPRPFPLPLSSLFSDGGNVGCVDVIIQRAYPIQWMEKTSSGLYIFRNEREEEKEAAKYVEAQQKRLEALFTKIQEEFEEHEENTTKPYLPSRALTRQQVRALQDGAELYEAVKNAADPAYLEGYFSEEQLRALNNHRQMLNDKKQAQIQLEIRKAMESAEQKEQGLSRDVTTVWKLRIVSYSKKEKDSVILSIWRPSSDLYSLLTEGKRYRIYHLATSKSKSKSERANIQLAATKKTQYQQLPVSDEILFQIYQPREPLHFSKFLDPDFQPSCSEVDLIGFVVSVVKKTGLAPFVYLSDECYNLLAIKFWIDLNEDIIKPHMLIAASNLQWRPESKSGLLTLFAGDFSVFSASPKEGHFQETFNKMKNTVENIDILCNEAENKLMHILHANDPKWSTPTKDCTSGPYTAQIIPGTGNKLLMSSPNCEIYYQSPLSLCMAKRKSVSTPVSAQMTSKSCKGEKEIDDQKNCKKRRALDFLSRLPLPPPVSPICTFVSPAAQKAFQPPRSCGTKYETPIKKKELNSPQMTPFKKFNEISLLESNSIADEELALINTQALLSGSTGEKQFISVSESTRTAPTSSEDYLRLKRRCTTSLIKEQESSQASTEECEKNKQDTITTKKYI ... ... ... ... ... ... The XML Schema Definition ------------------------- The XML Schema Definition (XSD) is available `here `__. Listed below are the XSD files for the InterProScan 5 XML output format (with the InterProScan release versions they apply to noted in brackets afterwards). - `interproscan-model-4.6.xsd `__ (as produced by InterProScan 5 from version 5.63-95.0 onwards) - `interproscan-model-4.5.xsd `__ (as produced by InterProScan 5 from version 5.51-85.0 to 5.62-94.0) - `interproscan-model-3.0.xsd `__ (as produced by InterProScan 5 from version 5.31-70.0 to 5.50-84.0) - `interproscan-model-2.2.xsd `__ (as produced by InterProScan 5 from version 5.28-67.0 to 5.30-69.0) - `interproscan-model-2.1.xsd `__ (as produced by InterProScan 5 from version 5.26-65.0 to 5.27-66.0) - `interproscan-model-2.0.xsd `__ (as produced by InterProScan 5 from version 5.21-60.0 to 5.25-64.0) - `interproscan-model-1.4.xsd `__ (as produced by InterProScan 5 in version 5.20-59.0 only) - `interproscan-model-1.3.xsd `__ (as produced by InterProScan 5 in version 5.19-58.0 only) - `interproscan-model-1.2.xsd `__ (as produced by InterProScan 5 from version 5.17-56.0 to 5.18-57.0) - `interproscan-model-1.1.xsd `__ (as produced by InterProScan 5 from version RC7 to 5.16-55.0) - `interproscan-model-1.0.xsd `__ (InterProScan 5 version RC1 to RC6) JavaScript Object Notation (JSON) --------------------------------- JSON representation of the matches - an alternative to XML format. As new releases are made public, the changes to the expected JSON format are documented in :ref:`Change log for InterProScan JSON output format`. Example output ~~~~~~~~~~~~~~ :: { "interproscan-version": "5.26-65.0", "results": [{ "sequence" : "MSKIGKSIRLERIIDRKTRKTVIVPMDHGLTVGPIPGLIDLAAAVDKVAEGGANAVLGHMGLPLYGHRGYGKDVGLIIHLSASTSLGPDANHKVLVTRVEDAIRVGADGVSIHVNVGAEDEAEMLRDLGMVARRCDLWGMPLLAMMYPRGAKVRSEHSVEYVKHAARVGAELGVDIVKTNYTGSPETFREVVRGCPAPVVIAGGPKMDTEADLLQMVYDAMQAGAAGISIGRNIFQAENPTLLTRKLSKIVHEGYTPEEAARLKL", "md5" : "88d47cc807fe8e977130b0cc93e0bd61", "matches" : [ { "signature" : { "accession" : "PIRSF038992", "name" : "Aldolase_Ia", "description" : null, "type" : null, "signatureLibraryRelease" : { "library" : "PIRSF", "version" : "3.01" }, "models" : { "PIRSF038992" : { "accession" : "PIRSF038992", "name" : "Aldolase_Ia", "description" : null, "key" : "PIRSF038992" } }, "entry" : { "accession" : "IPR002915", "name" : "DeoC/FbaB/lacD_aldolase", "description" : "DeoC/FbaB/ lacD aldolase", "type" : "FAMILY", "goXRefs" : [ { "identifier" : "GO:0016829", "name" : "lyase activity", "databaseName" : "GO", "category" : "MOLECULAR_FUNCTION" } ], "pathwayXRefs" : [ { "identifier" : "R-HSA-71336", "name" : "Pentose phosphate pathway (hexose monophosphate shunt)", "databaseName" : "Reactome" }, { "identifier" : "R-HSA-6798695", "name" : "Neutrophil degranulation", "databaseName" : "Reactome" } ] } }, "locations" : [ { "start" : 1, "end" : 265, "hmmStart" : 2, "hmmEnd" : 262, "hmmBounds" : "INCOMPLETE", "evalue" : 3.3E-94, "score" : 302.6, "envelopeStart" : 1, "envelopeEnd" : 265 } ], "evalue" : 3.0E-94, "score" : 302.7 }, { ... }] } Generic Feature Format Version 3 (GFF3) --------------------------------------- The GFF3 format is a flat tab-delimited file, which is much richer then the TSV output format. It allows you to trace back from matches to predicted proteins and to nucleic acid sequences. It also contains a FASTA format representation of the predicted protein sequences and their matches. You will find a documentation of all the columns and attributes used on http://www.sequenceontology.org/gff3.shtml. **Please note** in GFF3 sequence identifiers "...may contain any characters, but must escape any characters not in the set..." (1) :: a-zA-Z0-9.:^*$@!+_?-|. 1. http://www.sequenceontology.org/gff3.shtml Example output ~~~~~~~~~~~~~~ :: ##gff-version 3 ##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269 ##interproscan-version 5.26-65.0 ##sequence-region AACH01000027 1 1347 ##seqid|source|type|start|end|score|strand|phase|attributes AACH01000027 provided_by_user nucleic_acid 1 1347 . + . Name=AACH01000027;md5=b2a7416cb92565c004becb7510f46840;ID=AACH01000027 AACH01000027 getorf ORF 1 1347 . + . Name=AACH01000027.2_21;Target=pep_AACH01000027_1_1347 1 449;md5=b2a7416cb92565c004becb7510f46840;ID=orf_AACH01000027_1_1347 AACH01000027 getorf polypeptide 1 449 . + . md5=fd0743a673ac69fb6e5c67a48f264dd5;ID=pep_AACH01000027_1_1347 AACH01000027 Pfam protein_match 84 314 1.2E-45 + . Name=PF00696;signature_desc=Amino acid kinase family;Target=null 84 314;status=T;ID=match$8_84_314;Ontology_term="GO:0008652";date=15-04-2013;Dbxref="InterPro:IPR001048","Reactome:REACT_13" ##sequence-region 2 ... >pep_AACH01000027_1_1347 LVLLAAFDCIDDTKLVKQIIISEIINSLPNIVNDKYGRKVLLYLLSPRDPAHTVREIIEV LQKGDGNAHSKKDTEIRRREMKYKRIVFKVGTSSLTNEDGSLSRSKVKDITQQLAMLHEA GHELILVSSGAIAAGFGALGFKKRPTKIADKQASAAVGQGLLLEEYTTNLLLRQIVSAQI LLTQDDFVDKRRYKNAHQALSVLLNRGAIPIINENDSVVIDELKVGDNDTLSAQVAAMVQ ADLLVFLTDVDGLYTGNPNSDPRAKRLERIETINREIIDMAGGAGSSNGTGGMLTKIKAA TIATESGVPVYICSSLKSDSMIEAAEETEDGSYFVAQEKGLRTQKQWLAFYAQSQGSIWV DKGAAEALSQYGKSLLLSGIVEAEGVFSYGDIVTVFDKESGKSLGKGRVQFGASALEDML RSQKAKGVLIYRDDWISITPEIQLLFTEF ... >match$8_84_314 KRIVFKVGTSSLTNEDGSLSRSKVKDITQQLAMLHEAGHELILVSSGAIAAGFGALGFKK RPTKIADKQASAAVGQGLLLEEYTTNLLLRQIVSAQILLTQDDFVDKRRYKNAHQALSVL LNRGAIPIINENDSVVIDELKVGDNDTLSAQVAAMVQADLLVFLTDVDGLYTGNPNSDPR AKRLERIETINREIIDMAGGAGSSNGTGGMLTKIKAATIATESGVPVYICS