Installation requirements¶
InterProScan is developed to run on Linux. There are no versions planned for Windows or Apple (MAC OS X) operating systems. This is due to constraints in the various third-party binaries that InterProScan runs.
Note that InterProScan and the individual member database analyses are processor and memory intensive.
A minimum specification requirement is a machine with 2 cores and 4 GB of RAM, which will allow the analysis of a small number of sequences at a time. However the more resources the faster the analysis/more sequences can be analysed at a time.
Software requirements:
64-bit Linux
Perl 5 (default on most Linux distributions)
Python 3 (InterProScan 5.30-69.0 onwards)
Java JDK/JRE version 11 (InterProScan 5.37-76.0 onwards)
Environment variables set
$JAVA_HOME should point to the location of the JVM
$JAVA_HOME/bin should be added to the $PATH
How to check these on a system?¶
Which version of Linux am I running?¶
InterProScan has been prepared with 64-bit binaries. To determine if you have a 32-bit or a 64-bit system, enter on the command line:
uname -a
The exact response will depend upon the hardware vendor & architecture, however typical responses may look like:
64-bit as hinted by x86_64
$ uname -a
Linux bob.com 2.6.32-358.6.2.el6.x86_64 #1 SMP Tue May 14 15:48:21 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
32-bit as hinted by i686
$ uname -a
Linux jim.com 2.6.32-50-generic-pae #112-Ubuntu SMP Tue Jul 9 20:44:31 UTC 2013 i686 GNU/Linux
If you are still in any doubt, ask your systems administrator.
Testing your Perl installation¶
To test that Perl 5 is installed, enter on the command line
perl -version
This should report a version of Perl is available, similar to:
This is perl, v5.10.1 (*) built for i486-linux-gnu-thread-multi
Copyright 1987-2009, Larry Wall
...etc
A default Perl installation is sufficient: no third party Perl modules need to be installed.
Alternatively you could change the value of the ‘perl.command’ property in your interproscan.properties configuration file to point at a suitable Perl installation, the default value is:
perl.command=perl
Testing your Python installation¶
To test that Python 3 is installed, enter on the command line
python3 --version
This should report a version of Python is available, similar to:
Python 3.5.1
A default Python installation is sufficient: no third party Python modules need to be installed.
You could also change the value of the ‘python3.command’ property in your interproscan.properties configuration file to point at a suitable Python installation, the default value is:
python3.command=python3
Testing the Java environment¶
To test your environment, enter on the command line
java -version
This should report a version of java is available, similar to:
openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+11)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.4+11, mixed mode)
**InterProScan release 5.37-76.0 or later will only run with Java version 11.*++ **
You can get Java from many places. We have tested Java 11 from the OpenJDK Binaries from https://adoptopenjdk.net/ You can get information on OpenJDK reference implementations at https://jdk.java.net/ and download from https://openjdk.java.net/install/index.html
InterProScan releases prior to 5.37-76.0 required Java 8.
Appendix - Historical Java version testing information¶
Any Oracle/Open JDK/JRE with Java 1.11.x should work with InterProScan. Historical information about Java versions tested and confirmed to work/not work include below for information but this is not an exhaustive list!
Oracle JDK/JRE for InterProScan 5.17-56.0 or later
Version |
Build |
Operating System |
Architecture |
Status |
---|---|---|---|---|
1.8.074 |
1.8.0_74-b02 |
Linux |
x64 |
Works |
1.8.060 |
1.8.0_60-b27 |
Linux |
x86 |
Works |
1.7.* |
Linux |
x86 |
Doesn’t work |
OpenJDK for Interproscan 5.17-56.0 or later
Version |
Operating System |
Architecture |
Status |
Misc |
---|---|---|---|---|
1.8.0_66 |
Linux |
x64 |
Works |
|
1.7.* |
Linux |
x64 |
Doesn’t work |
Oracle JDK/JRE for InterProScan 5.16-55.0 or before
Version |
Build |
Operating System |
Architecture |
Status |
---|---|---|---|---|
1.8.0 |
1.8.0-Works |
Linux |
x64 |
Doesn’t work |
1.7.0_51 |
1.7.0_51-b13 |
Linux |
x86 |
Works |
1.7.0_40 |
Linux |
x64 |
Works |
|
1.7.0 |
Linux |
x64 |
Works |
|
1.6.0_45 |
Linux |
x64 |
Works |
|
1.6.0_37 |
Linux |
x64 |
Works |
|
1.6.0_22 |
Linux |
x64 |
Works |
|
1.6.0_11 |
Linux |
x64 |
Works |
|
1.6.0_07 |
Linux |
x64 |
Works |
|
1.6.0_05 |
Linux |
x64 |
Works |
|
1.6.0_04 |
Linux |
x64 |
Works |
|
1.6.0_03 |
Linux |
amd64 |
Doesn’t work |
|
1.6.0_02 |
Linux |
amd64 |
Doesn’t work |
OpenJDK for InterProScan 5.16-55.0 or before
Version |
Operating System |
Architecture |
Status |
Misc |
---|---|---|---|---|
1.7.0_25 |
Linux |
x64 |
Works |
:— |
1.6.0_30 |
Linux |
i686 |
Works |
:— |
1.6.0_27 |
Linux |
x64 |
Works |
:— |
1.6.0_24 |
Linux (Red Hat Distribution) |
x64 |
Doesn’t work |
Reported by user |
Obtaining a copy of InterProScan¶
Firstly check your system satisfies the Installation requirements. To install the InterProScan 5 software you then need to complete the following steps:
Install the core InterProScan
Configure the Pre-calculated Match Lookup
Obtaining the core InterProScan software¶
mkdir my_interproscan
cd my_interproscan
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/interproscan-5.67-99.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/interproscan-5.67-99.0-64-bit.tar.gz.md5
# Recommended checksum to confirm the download was successful:
md5sum -c interproscan-5.67-99.0-64-bit.tar.gz.md5
# Must return *interproscan-5.67-99.0-64-bit.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.
(Direct link: https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/interproscan-5.67-99.0-64-bit.tar.gz)
As the compressed file is large, it is strongly recommended that you use md5sum to check that the file has been downloaded without errors, as described above.
Extract the tar ball:
tar -pxvzf interproscan-5.67-99.0-*-bit.tar.gz
# where:
# p = preserve the file permissions
# x = extract files from an archive
# v = verbosely list the files processed
# z = filter the archive through gzip
# f = use archive file
This is a completely self-contained version that includes member database specific binaries and model / signature files. This should run ‘out of the box’ on a Linux system. Note that it excludes analyses that contain components for which you are obliged to acquire your own license.
Index hmm models¶
Before you run interproscan for the first time, you should run the command:
python3 setup.py -f interproscan.properties
This command will press and index the hmm models to prepare them into a format used by hmmscan.
Panther models¶
Previous versions of InterProScan required a separate installation of Panther data. Starting with InterProScan 5.47-82.0 onwards, this is not necessary. Panther data is bundled together with the rest of the application data.
Using the Local Pre-calculated Match Lookup Service (optional)¶
This service is by default switched on, so you don’t need to do any more installation or configuration, unless you want to install your own Pre-calculated Match Lookup Service. The uncompressed Match Lookup Service disk usage comes to more that 1TB, so it is recommended just to use the default setup.
The pre-calculated match lookup web service is able to provide matches to more than 500 million protein sequences, including all of the sequence in UniProtKB.
By default InterProScan is configured (in the interproscan.properties file) to use the web service hosted at the EBI. Your servers will need to have external access to http://www.ebi.ac.uk to use it.
InterProScan uses this service to retrieve pre-calculated matches, reducing the need for compute on your server and speeding up the response time.
If you are behind a firewall that prevents such access and you are unable to configure access, you could either Installing the lookup service locally or turn off the use of this service, which means the analysis will run locally without any match lookup
To turn off the use of the service, either use the -dp command line option or edit interproscan.properties and add a # to the start of the following line to comment out the line or delete the following line, near the bottom of the file:
precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup
It is important to note that we run the latest available version of the pre-calculated match lookup service at the EBI. In the event of a new release, you will be required to either install the latest version of InterProScan 5, or to install the required version of the lookup service locally :ref:`The InterProScan Lookup Match Service.
Running InterProScan¶
Once you have uncompressed your Obtaining a copy of InterProScan, you can run InterProScan directly from the command line.
Run the supplied shell script. If you run this script with no arguments, you will be presented with the usage instructions:
./interproscan.sh
After a short delay, you will see the following usage instructions:
Welcome to InterProScan-5.57-90.0
Running InterProScan v5 in STANDALONE mode... on Linux
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
-Xmx2048M -jar interproscan-5.jar
Please give us your feedback by sending an email to
interhelp@ebi.ac.uk
-appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option
is not set, ALL analyses will be run.
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path).
Note that this option, the --output-dir (-d) option and the
--outfile (-o) option are mutually exclusive. The
appropriate file extension for the output format(s) will be
appended automatically. By default the input file path/name
will be used.
-cpu,--cpu <CPU> Optional, number of cores for inteproscan.
-d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the
--outfile (-o) option and the --output-file-base (-b) option
are mutually exclusive. The output filename(s) are the same
as the input filename, with the appropriate file extension(s)
for the output format(s) appended automatically .
-dp,--disable-precalc Optional. Disables use of the precalculated match lookup
service. All match calculations will be run locally.
-dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output
-etra,--enable-tsv-residue-annot Optional, includes sites in TSV output
-exclappl,--excl-applications <EXC-ANALYSES> Optional, comma separated list of analyses you want to
exclude.
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output
formats. Supported formats are TSV, XML, JSON, and GFF3.
Default for protein sequences are TSV, XML and GFF3, or for
nucleotide sequences GFF3 and XML.
-goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology
annotation (IMPLIES -iprlookup option)
-help,--help Optional, display help information
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master
startup. Alternatively, in CONVERT mode, the InterProScan 5
XML file to convert.
-incldepappl,--incl-dep-applications <INC-DEP-ANALYSES> Optional, comma separated list of deprecated analyses that
you want included. If this option is not set, deprecated
analyses will not run.
-iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in
the TSV and GFF3 output formats.
-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only
be considered if n is specified as a sequence type. Please be
aware of the fact that if you specify a too short value it
might be that the analysis takes a very long time!
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute
path). Note that this option, the --output-dir (-d) option
and the --output-file-base (-b) option are mutually
exclusive. If this option is given, you MUST specify a single
output format using the -f option. The output file name will
not be modified. Note that specifying an output file name
using this option OVERWRITES ANY EXISTING FILE.
-pa,--pathways Optional, switch on lookup of corresponding Pathway
annotation (IMPLIES -iprlookup option)
-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or
protein (p)). The default sequence type is protein.
-T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or
absolute path). The default location is temp/.
-verbose,--verbose Optional, display more verbose log output
-version,--version Optional, display version number
-vl,--verbose-level <VERBOSE-LEVEL> Optional, display verbose log output at level specified.
-vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV
output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.
Available analyses:
TIGRFAM (XX.X) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
SFLD (X) : SFLD is a database of protein families based on hidden Markov models (HMMs).
SUPERFAMILY (X.XX) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
PANTHER (XX.X) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
Gene3D (X.X.X) : Structural assignment for whole genes and genomes using the CATH domain structure database.
Hamap (XXXX_XX) : High-quality Automated and Manual Annotation of Microbial Proteomes.
ProSiteProfiles (XXX_XX) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
Coils (X.X.X) : Prediction of coiled coil regions in proteins.
SMART (X.X) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
CDD (X.XX) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
PRINTS (XX.X) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
PIRSR (XXXX_XX) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
ProSitePatterns (XXXX_XX) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
AntiFam (X.X) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
Pfam (XX.X) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
MobiDBLite (X.X) : Prediction of intrinsically disordered regions in proteins.
PIRSF (X.XX) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
Deactivated analyses:
SignalP_EUK (X.X) : Analysis SignalP_EUK-X.X is deactivated, because the following parameters are not set in the interproscan.properties file: binary.signalp.X.X.path
SignalP_GRAM_NEGATIVE (X.X) : Analysis SignalP_GRAM_NEGATIVE-X.X is deactivated, because the following parameters are not set in the interproscan.properties file: binary.signalp.X.X.path
SignalP_GRAM_POSITIVE (X.X) : Analysis SignalP_GRAM_POSITIVE-X.X is deactivated, because the following parameters are not set in the interproscan.properties file: binary.signalp.X.X.path
Phobius (X.XX) : Analysis Phobius-X.XX is deactivated, because the following parameters are not set in the interproscan.properties file: binary.phobius.pl.path.X.XX
TMHMM (X.X) : Analysis TMHMM-X.Xc is deactivated, because the following parameters are not set in the interproscan.properties file: binary.tmhmm.path, tmhmm.model.path
SignalP_GRAM_NEGATIVE (X.X) : Analysis SignalP_GRAM_NEGATIVE-X.X is deactivated, because the following parameters are not set in the interproscan.properties file: binary.signalp.X.X.path
The latest analysis versions can be obtained by running the InterProScan script without any options specified.
InterProScan test run¶
This distribution of InterProScan provides a set of protein test sequences, which you can use to check how InterProScan behaves on your system. First, if you have not yet run the initialisation script run the following command:
python3 setup.py -f interproscan.properties
This command will press and index the hmm models to prepare them into a format used by hmmscan. This command need only be run once.
You can then run the following two test case commands:
./interproscan.sh -i test_all_appl.fasta -f tsv -dp
./interproscan.sh -i test_all_appl.fasta -f tsv
The first test should create an output file with the default file name test_all_appl.fasta.tsv, and the second would then create test_all_appl.fasta_1.tsv (since the default filename already exists).
Both the above test commands should be run successfully, before running InterProScan on you own input set of sequences.
What should you get?
InterProScan should run through properly without any warnings and it will create a TSV output file containing several member database matches, including Gene3d, PIRSF etc.
The member database binaries supplied with InterProScan should run on most Linux systems, however if they don’t work on a particular system then see the FAQ page, What should I do if one of the binaries included with InterProScan 5 doesn’t work on my system?.
Command-line options¶
-dp / –disable-precalc (optional)¶
InterProScan is a computationally expensive program, sometimes taking a couple of minutes to characterise a single sequence. It calculates matches to InterPro signatures based purely on the amino acid sequence that is submitted to it. Therefore, 2 identical amino acid sequences will produce identical outputs (although if the sequences differ by just one residue, the outputs may or may not be the same). We can take advantage of this feature, and increase the speed of InterProScan, by pre-calculating matches for sequences already found in UniProtKB. When a sequence is submitted to it, InterProScan calculates an MD5 checksum for the amino acid sequence and then uses that checksum to check the What is the InterProScan 5 Lookup Service? pre-calculated lookup service to see whether it has already been encountered. If it has, the pre-calculated results are returned to the user; if not, the InterProScan search algorithms are run against the sequence.
By default, InterProScan has this option turned on. If you wish to turn it off, you should add the “–disable-precalc” option to the command line. Users also have the option of using an EBI-hosted instance of the look-up service (this is what is enabled by default) or downloading a copy and running it locally. For more information, read the section on configuring the match lookup service below
-appl / –applications application_name (optional)¶
By default, all available analyses are run, however if you wish to restrict to a single analysis, use the -appl option. The argument to the -appl option should be one of the analyses named at the bottom of the usage instructions. Analysis names may or may not contain version numbers. For example:
./interproscan.sh -appl Pfam -i /path/to/sequences.fasta
If you wish to specifically run two or more analyses you can include multiple -appl arguments:
./interproscan.sh -appl Pfam-33.1 -appl PRINTS-42.0 -i /path/to/sequences.fasta
or you can use a single -appl option with a comma-separated list of analyses:
./interproscan.sh -appl CDD,COILS,Gene3D,HAMAP,MobiDBLite,PANTHER,Pfam,PIRSF,PRINTS,PROSITEPATTERNS,PROSITEPROFILES,SFLD,SMART,SUPERFAMILY,TIGRFAM -i /path/to/sequences.fasta
A list of all available analyses is in the section “Included Analyses”
-i / –fasta sequence_file¶
To analyse the contents of a fasta file, you should add one argument as in the following example:
./interproscan.sh -i /path/to/sequences.fasta
This will return results in the default formats as described above, i.e., for protein sequences, return TSV, XML and GFF3 files or for nucleotide sequences, return GFF3 and XML files with file names based upon the name of the fasta file. (sequences.tsv, sequence.xml, sequences.gff3 in this case).
-iprlookup,–iprlookup¶
Option that provides mappings from matched member database signatures to the InterPro entries that they are integrated into. Starting from release of InterProScan-5.40-77.0, you don’t have to explicity specify this option
as InterProScan will always provide mappings to InterPro entries.
-goterms,–goterms (optional)¶
Option that provides mappings to the Gene Ontology (GO). These mappings are based on the matched manually curated InterPro entries. (IMPLIES -iprlookup option)
-b / –output-file-base file_name (optional)¶
Optionally, you can supply a path and base name (excluding a file extension) for the results file as follows:
./interproscan.sh -i /path/to/sequences.fasta -b /path/to/output_file
The appropriate file extension will be added to each output file, depending upon the format(s) requested. (It is therefore recommended that you do not include a file extension yourself.)
Note that using this option will not overwrite existing files. If a file with the required name exists at the path specified, the provided file name will have ‘underscore_number’ appended in front of the file extension.
-o / –outfile (optional)¶
This command can be given instead of the -b option. If you provide this argument, you must specify a single output format. The output file will be given the name specified by this option.
Note that this option will overwrite existing files with the same path / name.
-pa / –pathways (optional)¶
Option that provides mappings from matches to pathway information, which is based on the matched manually curated InterPro entries. (IMPLIES -iprlookup option). The different pathways databases that InterProScan provides cross links to are:
MetaCyc
Reactome
-t / –seqtype (optional)¶
InterProScan supports analysis of both protein and nucleic acid sequences (DNA/RNA). Your input sequences are interpreted as protein sequences by default. If you like to scan nucleotide sequences you must set the -t option:
./interproscan.sh -t n -i /path/to/sequences.fasta
-T / –tempdir (optional)¶
Optionally, you can specify the location of the InterProScan temporary directory. This directory is used as a working directory. The default temporary directory will be in the same directory as the InterProScan script file (interproscan.sh). By default, this directory is completely cleaned up after InterProScan finished all analyses successfully.
Example usage:
./interproscan.sh -T /path/to/temp-directory -i /path/to/sequences.fasta
-dra / –disable-residue-annot (optional)¶
Optionally, you can prevent InterProScan from calculating the residue level annotations and displaying in the output where available. If you don’t require this information then disabling the feature will improve performance and result in smaller output files.
-version / –version (optional)¶
Display the version number of the InterProScan software you are running.
Included analyses¶
This distribution of InterProScan includes:
PROSITE (Profiles and Patterns)
SMART (unlicensed components only by default - this analysis has simplified post-processing that includes an E-value filter, however you should not expect it to give the same match output as the fully licensed version of SMART)
A number of other analyses are available in InterProScan. These analyses use licensed code and data provided by third parties. If you wish to run these analyses it will be necessary for you to obtain a licence from the vendor and configure your local InterProScan installation to use these:
The InterPro team would like to thank the developers and maintainers of all of these analyses for their valued and on-going support.
Output format¶
Please see Output formats.
Optional configuration¶
Working directory for temporary files¶
There is a second way of changing temporary/working directory beyond the -T option (where fasta files, binary output etc. are written to). You can do this by editing the interproscan.properties file and change the path for the property:
temporary.file.directory=temp/[UNIQUE]
NOTE: Leave /[!UNIQUE] on the end - this is replaced with a timestamped / unique directory for each run. This directory is cleaned up and deleted at the end of each run of InterProScan.
Configuring the Pre-calculated Match Lookup Service¶
As this is a web service, your servers will need to have external access to http://www.ebi.ac.uk to use it. If you are behind a firewall that prevents such access and you are unable to configure access, you can either turn off use of this service or download a copy and run a local match lookup service.
To turn off use of the service, either use the -dp command line option,
or edit interproscan.properties and comment out*
or delete the
following line, near the bottom of the file:
precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup
``*``(To comment the line out, add a # to the start of the line.)
Running InterProScan on an LSF/SGE Cluster¶
Please see Cluster Mode.
Input formats¶
InterProScan 5 supports the FASTA file format.
An example of a simple FASTA format file containing unaligned sequences:
> seq1 Description of seq1.
AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA
TAGTA
> seq2 Description of seq2.
CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG
CATCGTCAGTTACTGC
InterProScan 5 supports unaligned sequences only. Sequences should contain only valid IUPAC amino acid or nucleic acid characters. In addition gap (‘-‘), period (‘.’), asterix or underscore symbols are not allowed and should produce warnings and InterProScan will exit immediately.
Example for supported protein sequence:
MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPN
Example for supported nucleic acid sequence:
atgaaatataaacgcattgtgtttaaagtgggcaccagcagcctgaccaacg
Unsupported sequences:
-RFLLLSLARFSNNRFGVQLLQIANVNLKVRRYG (illegal gap character at the start)
RFLLLSL--ARFSNNRFGVQLLQIANVNLKVRRYG (illegal gap character in the middle)
RFLLLSLARFSNNRFGVQLLQIANVNLKVRRYG* (illegal asterix character at the end)
RFLLLSL_ARFSNNRFGVQLLQIANVNLKVRRYG (illegal underscore character)
RFLLLSL.ARFSNNRFGVQLLQIANVNLKVRRYG (illegal period character)
Output formats¶
In this version of InterProScan, you can retrieve output in any of the following five formats:
TSV: A simple tab-delimited file format
XML: The InterProScan XML format (XSD available here).
JSON: Full output of results in JSON format
InterProScan 5 can output results for protein and nucleotide sequences in all formats. Please note you can only trace protein match positions to the original nucleotide sequence with GFF3, XML and JSON outputs.
You can override the default output formats using the -f option, e.g.:
./interproscan.sh -f XML -f JSON -i /path/to/sequences.fasta -b /path/to/output_file
or
./interproscan.sh -f XML, JSON -i /path/to/sequences.fasta -b /path/to/output_file
These two equivalent commands will output the results in XML and JSON format.
Basic tab delimited format. Outputs only those sequences with domain matches.
Example output¶
P51587 14086411a2cdf1c4cba63020e1622579 3418 Pfam PF09103 BRCA2, oligonucleotide/oligosaccharide-binding, domain 1 2670 2799 7.9E-43 T 15-03-2013
P51587 14086411a2cdf1c4cba63020e1622579 3418 ProSiteProfiles PS50138 BRCA2 repeat profile. 1002 1036 0.0 T 18-03-2013 IPR002093 BRCA2 repeat GO:0005515|GO:0006302
P51587 14086411a2cdf1c4cba63020e1622579 3418 Gene3D G3DSA:2.40.50.140 2966 3051 3.1E-52 T 15-03-2013
...
The TSV format presents the match data in columns as follows:
Protein accession (e.g. P51587)
Sequence MD5 digest (e.g. 14086411a2cdf1c4cba63020e1622579)
Sequence length (e.g. 3418)
Analysis (e.g. Pfam / PRINTS / Gene3D)
Signature accession (e.g. PF09103 / G3DSA:2.40.50.140)
Signature description (e.g. BRCA2 repeat profile)
Start location
Stop location
Score - is the e-value (or score) of the match reported by member database method (e.g. 3.1E-52)
Status - is the status of the match (T: true)
Date - is the date of the run
InterPro annotations - accession (e.g. IPR002093)
InterPro annotations - description (e.g. BRCA2 repeat)
GO annotations with their source(s), e.g. GO:0005515(InterPro)|GO:0006302(PANTHER)|GO:0007195(InterPro,PANTHER). This is an optional column; only displayed if the
--goterms
option is switched onPathways annotations, e.g. REACT_71. This is an optional column; only displayed if the
--pathways
option is switched on
If a value is missing in a column, for example, the match has no InterPro annotation, a ‘-‘ is displayed.
XML representation of the matches - this is the richest form of the data. The XML Schema Definition (XSD) file links are below the example output.
Example output¶
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.26-65.0">
<protein>
<sequence md5="14086411a2cdf1c4cba63020e1622579">MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTVKTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESLKKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVVDTSEEDSFSLCFSKCRTKNLQKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEPNDTDPLDSNVAHQKPFESGSDKISKEVVPSLACEWSQLTLSGLNGAQMEKIPLLHISSCDQNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDCILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGLEIHTVCSQKEDSLCPNLIDNGSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSYKGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLHSSVKRSCSQNDSEEPTLSLTSSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCENDPKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPTSKDVLSNLVMISRGKESYKMSDKLKGNNYESDVELTKNIPMEKNQDVCALNENYKNVELLPPEKYMRVASPSRKVQFNQNTNLRVIQKNQEETTSISKITVNPDSEELFSDNENNFVFQVANERNNLALGNTKELHETDLTCVNEPIFKNSTMVLYGDTGDKQATQVSIKKDLVYVLAEENKNSVKQHIKMTLGQDLKSDISLNIDKIPEKNNDYMNKWAGLLGPISNHSFGGSFRTASNKEIKLSEHNIKKSKMFFKDIEEQYPTSLACVEIVNTLALDNQKKLSKPQSINTVSAHLQSSVVVSDCKNSHITPQMLFSKQDFNSNHNLTPSQKAEITELSTILEESGSQFEFTQFRKPSYILQKSTFEVPENQMTILKTTSEECRDADLHVIMNAPSIGQVDSSKQFEGTVEIKRKFAGLLKNDCNKSASGYLTDENEVGFRGFYSAHGTKLNVSTEALQKAVKLFSDIENISEETSAEVHPISLSSSKCHDSVVSMFKIENHNDKTVSEKNNKCQLILQNNIEMTTGTFVEEITENYKRNTENEDNKYTAASRNSHNLEFDGSDSSKNDTVCIHKDETDLLFTDQHNICLKLSGQFMKEGNTQIKEDLSDLTFLEVAKAQEACHGNTSNKEQLTATKTEQNIKDFETSDTFFQTASGKNISVAKESFNKIVNFFDQKPEELHNFSLNSELHSDIRKNKMDILSYEETDIVKHKILKESVPVGTGNQLVTFQGQPERDEKIKEPTLLGFHTASGKKVKIAKESLDKVKNLFDEKEQGTSEITSFSHQWAKTLKYREACKDLELACETIEITAAPKCKEMQNSLNNDKNLVSIETVVPPKLLSDNLCRQTENLKTSKSIFLKVKVHENVEKETAKSPATCYTNQSPYSVIENSALAFYTSCSRKTSVSQTSLLEAKKWLREGIFDGQPERINTADYVGNYLYENNSNSTIAENDKNHLSEKQDTYLSNSSMSNSYSYHSDEVYNDSGYLSKNKLDSGIEPVLKNVEDQKNTSFSKVISNVKDANAYPQTVNEDICVEELVTSSSPCKNKNAAIKLSISNSNNFEVGPPAFRIASGKIVCVSHETIKKVKDIFTDSFSKVIKENNENKSKICQTKIMAGCYEALDDSEDILHNSLDNDECSTHSHKVFADIQSEEILQHNQNMSGLEKVSKISPCDVSLETSDICKCSIGKLHKSVSSANTCGIFSTASGKSVQVSDASLQNARQVFSEIEDSTKQVFSKVLFKSNEHSDQLTREENTAIRTPEHLISQKGFSYNVVNSSAFSGFSTASGKQVSILESSLHKVKGVLEEFDLIRTEHSLHYSPTSRQNVSKILPRVDKRNPEHCVNSEMEKTCSKEFKLSNNLNVEGGSSENNHSIKVSPYLSQFQQDKQQLVLGTKVSLVENIHVLGKEQASPKNVKMEIGKTETFSDVPVKTNIEVCSTYSKDSENYFETEAVEIAKAFMEDDELTDSKLPSHATHSLFTCPENEEMVLSNSRIGKRRGEPLILVGEPSIKRNLLNEFDRIIENQEKSLKASKSTPDGTIKDRRLFMHHVSLEPITCVPFRTTKERQEIQNPNFTAPGQEFLSKSHLYEHLTLEKSSSNLAVSGHPFYQVSATRNEKMRHLITTGRPTKVFVPPFKTKSHFHRVEQCVRNINLEENRQKQNIDGHGSDDSKNKINDNEIHQFNKNNSNQAAAVTFTKCEEEPLDLITSLQNARDIQDMRIKKKQRQRVFPQPGSLYLAKTSTLPRISLKAAVGGQVPSACSHKQLYTYGVSKHCIKINSKNAESFQFHTEDYFGKESLWTGKGIQLADGGWLIPSNDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRRSAIKKIMERDDTAAKTLVLCVSDIISLSANISETSSNKTSSADTQKVAIIELTDGWYAVKAQLDPPLLAVLKNGRLTVGQKIILHGAELVGSPDACTPLEAPESLMLKISANSTRPARWYTKLGFFPDPRPFPLPLSSLFSDGGNVGCVDVIIQRAYPIQWMEKTSSGLYIFRNEREEEKEAAKYVEAQQKRLEALFTKIQEEFEEHEENTTKPYLPSRALTRQQVRALQDGAELYEAVKNAADPAYLEGYFSEEQLRALNNHRQMLNDKKQAQIQLEIRKAMESAEQKEQGLSRDVTTVWKLRIVSYSKKEKDSVILSIWRPSSDLYSLLTEGKRYRIYHLATSKSKSKSERANIQLAATKKTQYQQLPVSDEILFQIYQPREPLHFSKFLDPDFQPSCSEVDLIGFVVSVVKKTGLAPFVYLSDECYNLLAIKFWIDLNEDIIKPHMLIAASNLQWRPESKSGLLTLFAGDFSVFSASPKEGHFQETFNKMKNTVENIDILCNEAENKLMHILHANDPKWSTPTKDCTSGPYTAQIIPGTGNKLLMSSPNCEIYYQSPLSLCMAKRKSVSTPVSAQMTSKSCKGEKEIDDQKNCKKRRALDFLSRLPLPPPVSPICTFVSPAAQKAFQPPRSCGTKYETPIKKKELNSPQMTPFKKFNEISLLESNSIADEELALINTQALLSGSTGEKQFISVSESTRTAPTSSEDYLRLKRRCTTSLIKEQESSQASTEECEKNKQDTITTKKYI</sequence>
<xref id="P51587"/>
<matches>
...
<hmmer3-match score="341.9" evalue="0.0">
<signature name="BRCA-2_helical" desc="BRCA2, helical" ac="PF09169">
<entry type="DOMAIN" name="BRCA2_hlx" desc="Breast cancer type 2 susceptibility protein, helical domain" ac="IPR015252">
<go-xref category="BIOLOGICAL_PROCESS" name="double-strand break repair via homologous recombination" id="GO:0000724" db="GO"/>
<go-xref category="MOLECULAR_FUNCTION" name="single-stranded DNA binding" id="GO:0003697" db="GO"/>
<go-xref category="BIOLOGICAL_PROCESS" name="DNA recombination" id="GO:0006310" db="GO"/>
</entry>
<models>
<model name="BRCA-2_helical" desc="BRCA2, helical" ac="PF09169"/>
</models>
<signature-library-release version="27.0" library="PFAM"/>
</signature>
<locations>
<hmmer3-location env-start="2479" env-end="2667" hmm-end="195" hmm-start="1" evalue="9.6E-102" score="0.0" end="2667" start="2479"/>
</locations>
</hmmer3-match>
...
<superfamilyhmmer3-match evalue="0.0">
<signature name="BRCA2 helical domain" ac="SSF81872">
<entry type="DOMAIN" name="BRCA2_hlx" desc="Breast cancer type 2 susceptibility protein, helical domain" ac="IPR015252">
<go-xref category="BIOLOGICAL_PROCESS" name="double-strand break repair via homologous recombination" id="GO:0000724" db="GO"/>
<go-xref category="MOLECULAR_FUNCTION" name="single-stranded DNA binding" id="GO:0003697" db="GO"/>
<go-xref category="BIOLOGICAL_PROCESS" name="DNA recombination" id="GO:0006310" db="GO"/>
</entry>
<models>
<model name="BRCA2 helical domain" ac="0039279"/>
<model name="BRCA2 helical domain" ac="0040951"/>
</models>
<signature-library-release version="1.75" library="SUPERFAMILY"/>
</signature>
<locations>
<superfamilyhmmer3-location end="2668" start="2479"/>
</locations>
</superfamilyhmmer3-match>
...
<rpsblast-match>
<signature ac="cd08964" desc="L-asparaginase_II" name="L-asparaginase_II">
<models>
<model ac="cd08964" desc="L-asparaginase_II" name="L-asparaginase_II"/>
</models>
<signature-library-release library="CDD" version="3.14"/>
</signature>
<locations>
<rpsblast-location evalue="8.66035E-152" score="433.09" start="50" end="364">
<sites>
<rpsblast-site description="homotetramer interface" numLocations="51">
<site-locations>
<site-location residue="Y" start="271" end="271"/>
<site-location residue="R" start="246" end="246"/>
<site-location residue="Y" start="229" end="229"/>
...
</site-locations>
</rpsblast-site>
...
</sites>
</rpsblast-location>
</locations>
</rpsblast-match>
...
</matches>
</protein>
</protein-matches>
The XML Schema Definition (XSD) is available here.
Listed below are the XSD files for the InterProScan 5 XML output format (with the InterProScan release versions they apply to noted in brackets afterwards).
interproscan-model-4.6.xsd (as produced by InterProScan 5 from version 5.63-95.0 onwards)
interproscan-model-4.5.xsd (as produced by InterProScan 5 from version 5.51-85.0 to 5.62-94.0)
interproscan-model-3.0.xsd (as produced by InterProScan 5 from version 5.31-70.0 to 5.50-84.0)
interproscan-model-2.2.xsd (as produced by InterProScan 5 from version 5.28-67.0 to 5.30-69.0)
interproscan-model-2.1.xsd (as produced by InterProScan 5 from version 5.26-65.0 to 5.27-66.0)
interproscan-model-2.0.xsd (as produced by InterProScan 5 from version 5.21-60.0 to 5.25-64.0)
interproscan-model-1.4.xsd (as produced by InterProScan 5 in version 5.20-59.0 only)
interproscan-model-1.3.xsd (as produced by InterProScan 5 in version 5.19-58.0 only)
interproscan-model-1.2.xsd (as produced by InterProScan 5 from version 5.17-56.0 to 5.18-57.0)
interproscan-model-1.1.xsd (as produced by InterProScan 5 from version RC7 to 5.16-55.0)
interproscan-model-1.0.xsd (InterProScan 5 version RC1 to RC6)
JSON representation of the matches - an alternative to XML format. As new releases are made public, the changes to the expected JSON format are documented in Change log for InterProScan JSON output format.
Example output¶
{
"interproscan-version": "5.26-65.0",
"results": [{
"sequence" : "MSKIGKSIRLERIIDRKTRKTVIVPMDHGLTVGPIPGLIDLAAAVDKVAEGGANAVLGHMGLPLYGHRGYGKDVGLIIHLSASTSLGPDANHKVLVTRVEDAIRVGADGVSIHVNVGAEDEAEMLRDLGMVARRCDLWGMPLLAMMYPRGAKVRSEHSVEYVKHAARVGAELGVDIVKTNYTGSPETFREVVRGCPAPVVIAGGPKMDTEADLLQMVYDAMQAGAAGISIGRNIFQAENPTLLTRKLSKIVHEGYTPEEAARLKL",
"md5" : "88d47cc807fe8e977130b0cc93e0bd61",
"matches" : [ {
"signature" : {
"accession" : "PIRSF038992",
"name" : "Aldolase_Ia",
"description" : null,
"type" : null,
"signatureLibraryRelease" : {
"library" : "PIRSF",
"version" : "3.01"
},
"models" : {
"PIRSF038992" : {
"accession" : "PIRSF038992",
"name" : "Aldolase_Ia",
"description" : null,
"key" : "PIRSF038992"
}
},
"entry" : {
"accession" : "IPR002915",
"name" : "DeoC/FbaB/lacD_aldolase",
"description" : "DeoC/FbaB/ lacD aldolase",
"type" : "FAMILY",
"goXRefs" : [ {
"identifier" : "GO:0016829",
"name" : "lyase activity",
"databaseName" : "GO",
"category" : "MOLECULAR_FUNCTION"
} ],
"pathwayXRefs" : [ {
"identifier" : "R-HSA-71336",
"name" : "Pentose phosphate pathway (hexose monophosphate shunt)",
"databaseName" : "Reactome"
}, {
"identifier" : "R-HSA-6798695",
"name" : "Neutrophil degranulation",
"databaseName" : "Reactome"
} ]
}
},
"locations" : [ {
"start" : 1,
"end" : 265,
"hmmStart" : 2,
"hmmEnd" : 262,
"hmmBounds" : "INCOMPLETE",
"evalue" : 3.3E-94,
"score" : 302.6,
"envelopeStart" : 1,
"envelopeEnd" : 265
} ],
"evalue" : 3.0E-94,
"score" : 302.7
}, {
...
}]
}
The GFF3 format is a flat tab-delimited file, which is much richer then the TSV output format. It allows you to trace back from matches to predicted proteins and to nucleic acid sequences. It also contains a FASTA format representation of the predicted protein sequences and their matches. You will find a documentation of all the columns and attributes used on http://www.sequenceontology.org/gff3.shtml.
Please note in GFF3 sequence identifiers “…may contain any characters, but must escape any characters not in the set…” (1)
a-zA-Z0-9.:^*$@!+_?-|.
Example output¶
##gff-version 3
##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269
##interproscan-version 5.26-65.0
##sequence-region AACH01000027 1 1347
##seqid|source|type|start|end|score|strand|phase|attributes
AACH01000027 provided_by_user nucleic_acid 1 1347 . + . Name=AACH01000027;md5=b2a7416cb92565c004becb7510f46840;ID=AACH01000027
AACH01000027 getorf ORF 1 1347 . + . Name=AACH01000027.2_21;Target=pep_AACH01000027_1_1347 1 449;md5=b2a7416cb92565c004becb7510f46840;ID=orf_AACH01000027_1_1347
AACH01000027 getorf polypeptide 1 449 . + . md5=fd0743a673ac69fb6e5c67a48f264dd5;ID=pep_AACH01000027_1_1347
AACH01000027 Pfam protein_match 84 314 1.2E-45 + . Name=PF00696;signature_desc=Amino acid kinase family;Target=null 84 314;status=T;ID=match$8_84_314;Ontology_term="GO:0008652";date=15-04-2013;Dbxref="InterPro:IPR001048","Reactome:REACT_13"
##sequence-region 2
...
>pep_AACH01000027_1_1347
LVLLAAFDCIDDTKLVKQIIISEIINSLPNIVNDKYGRKVLLYLLSPRDPAHTVREIIEV
LQKGDGNAHSKKDTEIRRREMKYKRIVFKVGTSSLTNEDGSLSRSKVKDITQQLAMLHEA
GHELILVSSGAIAAGFGALGFKKRPTKIADKQASAAVGQGLLLEEYTTNLLLRQIVSAQI
LLTQDDFVDKRRYKNAHQALSVLLNRGAIPIINENDSVVIDELKVGDNDTLSAQVAAMVQ
ADLLVFLTDVDGLYTGNPNSDPRAKRLERIETINREIIDMAGGAGSSNGTGGMLTKIKAA
TIATESGVPVYICSSLKSDSMIEAAEETEDGSYFVAQEKGLRTQKQWLAFYAQSQGSIWV
DKGAAEALSQYGKSLLLSGIVEAEGVFSYGDIVTVFDKESGKSLGKGRVQFGASALEDML
RSQKAKGVLIYRDDWISITPEIQLLFTEF
...
>match$8_84_314
KRIVFKVGTSSLTNEDGSLSRSKVKDITQQLAMLHEAGHELILVSSGAIAAGFGALGFKK
RPTKIADKQASAAVGQGLLLEEYTTNLLLRQIVSAQILLTQDDFVDKRRYKNAHQALSVL
LNRGAIPIINENDSVVIDELKVGDNDTLSAQVAAMVQADLLVFLTDVDGLYTGNPNSDPR
AKRLERIETINREIIDMAGGAGSSNGTGGMLTKIKAATIATESGVPVYICS
Nucleic acid sequences scan¶
The Open Reading Frame prediction tool¶
InterProScan 5 takes advantage of the Open Reading Frame (ORF) prediction tool Emboss getorf. The getorf application itself and all of its dependencies are integrated in InterProScan. You do not need to install the Emboss package on your own, but you may use a local installation if you wish.
If you want to use a local installation you must edit the interproscan.sh script. This script sets 2 environment variables for Emboss getorf. Set these to the correct paths for your installation of Emboss.
# set environment variables for getorf
export EMBOSS_ACDROOT=bin/nucleotide
export EMBOSS_DATA=bin/nucleotide
In addition open and edit your properties file (interproscan.properties), which you will find in your InterProScan root directory. Search for the property ‘binary.getorf.path’ and change the path to your local getorf binary.
binary.getorf.path=/path/to/bin/nucleotide/getorf
How can I scan nucleic acid sequences in InterProScan 5?¶
./interproscan.sh -t n -i /path/to/nucleic_acid_sequences.fasta
or run the following commands:
#translate the nucleic_acid_sequences
./bin/nucleotides/translate -i /path/to/nucleic_acid_sequences.fasta -o /path/to/output_orfs_sequences.fasta
#if output_orfs_sequences.fasta has more than 32,000 sequences then chunk the file then send the chunks to InterProScan
#run InterProScan on the translated output
./interproscan.sh -i /path/to/output_orfs_sequences.fasta
Which output formats are supported?¶
Supported output formats are GFF3 and XML, which allow you to trace back from the match to the position inside your nucleic acid sequence. Please not that the TSV format is not available for nucleic acid sequence analysis.
Redundant sequences and identifiers in your FASTA file¶
InterProScan 5 is able to handle FASTA file entries with the same sequence, but different identifiers. For instance you have the following 2 sequences in your input file:
>sequence_1
ABC
>sequence_2
ABC
InterProScan 5 will condense these into a single sequence with two identifier cross-references in the XML output file:
<nucleotide-sequence>
<sequence md5="e9b174d63adc63bab79c90fdbc8d1670">ABC</sequence>
<xref id="sequence_1"/>
<xref id="sequence_2"/>
<orf strand="SENSE" start="1" end="3">
...
and in the GFF3 output:
##sequence-region sequence_1|sequence_2 1 3
sequence_1|sequence_2 provided_by_user nucleic_acid 1 3
...
Entries with the same identifier and the same sequence will be merged into one.
Please note: non unique identifiers are not supported. InterProScan 5 will exit (with exit code 0) and will print out a list of all non unique identifiers.
Improving performance¶
InterProScan does not select one best ORF from the getorf output, instead it takes the ORFs generated and select N longest ORFs and inputs them for analysis. The number selected depends on the binary.getorf.parser.filtersize property mentioned below. The default is 8. This means analysing nucleotide sequences can take much longer than analysing protein sequences.
To improve InterProScan performance while running large nucleotide input files (> 10,000 sequences) you can:
First use an external program to translate your input. This is the best approach. There are various options, one of which is emboss-transeq (http://emboss.open-bio.org/rel/rel6/apps/transeq.html) from emboss. If you use transeq then please use the -clean option to change STOP codon positions from ‘*’ to ‘X’ because Interproscan does not accept sequences with the ‘*’ character.
and/or…
Chunk the input and then send the chunks to InterProScan. For tips on configuring the general InterProScan CPU usage see also improving performance.
Selecting the ORFs to analyse¶
For improved performance, Interproscan will select the longest 8 ORFs predicted for each nucleic acid sequence. This can be changed using the new “binary.getorf.parser.filtersize” setting in the interproscan.properties file
binary.getorf.parser.filtersize=8
The InterProScan Lookup Match Service¶
The InterProScan match lookup service stores pre-calculated InterProScan results for the sequences in the InterPro database. When InterProScan is queried with a known sequence, it retrieves the result from the lookup service and reports the result immediately, thereby reducing compute requirements and improving performance.
For sequences not in the lookup service, InterProScan will calculate these from scratch using the various analyses requested by the user.
The default interproscan.properties configuration will use the lookup service hosted at EBI http://www.ebi.ac.uk/interpro/match-lookup/version. This will be will be the most recent lookup service version and only compatible with the most recent InterProScan release:
precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup
#proxy set up
precalculated.match.lookup.service.proxy.host=
precalculated.match.lookup.service.proxy.port=3128
The default lookup service will not be used in the following scenarios (therefore all calculations are performed locally):
The version number of the service does not match your version of InterProScan.
The service cannot be accessed (e.g., for firewall reasons or it is temporarily unavailable).
You disable the lookup feature. To disable the service you could either:
Use the “-dp” command lineoption.
Set the “precalculated.match.lookup.service.url=” property in your interproscan.properties configuration file (to an empty value).
You can choose to download and install the InterProScan lookup service locally if required. This offers you several advantages:
provide control over the version of the lookup service - if you choose to upgrade InterProScan less frequently than the release cycle, you can ensure that you are using a lookup service that is synchronized with the version of InterProScan that you are running.
A dedicated service. You will not be competing with other users for access to the service.
Control over the scale of the service. The service is extremely responsive (a few milliseconds per sequence request) and a single web server will cope with a high load, however if you expect to put the service under a very high load, you may chose to run the service in parallel on multiple machines, potentially with load balancing.
Run the service behind your firewall for maximum security.
Because of the very large size of the Berkeley database used by the Lookup Service, you are recommended to observe the following minimum requirements:
Java 11
Recommended minimum 2 cores (processors)
4GB RAM (of which > 2GB will consumed by the service when you run it)
Version 5.67-99.0 of the lookup service is only compatible with version 5.67-99.0 of InterProScan. Instructions below are for installing the latest version, you can download previous versions of the lookup service from https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/.
This service is a very large download! You are strongly recommended to check the md5 checksum (as described below) to ensure that the file has been downloaded correctly.
# Create and enter a suitable directory
mkdir i5_lookup_service
cd i5_lookup_service
# Download the tarball and the MD5 file.
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/lookup_service_5.67-99.0.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/lookup_service_5.67-99.0.tar.gz.md5
# Recommended checksum to confirm the download was successful:
md5sum -c lookup_service_5.67-99.0.tar.gz.md5
# Must return *lookup_service_5.67-99.0.tar.gz: OK*
# If not - try downloading the file again as it may be a corrupted copy.
(Direct link: https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/lookup_service_5.67-99.0.tar.gz)
Extract the tarball:
tar -pxvzf lookup_service_5.67-99.0.tar.gz
# where:
# p = preserve the file permissions
# x = extract files from an archive
# v = verbosely list the files processed
# z = filter the archive through gzip
# f = use archive file
The service can be run in one of two ways:
Run with graphical user interface (to set port number)¶
If you are running it on a machine with a desktop interface and just want to test the lookup service, a simple user interface is included to allow you to set the port number to run the service.
Note that in the example below, the memory available to Java has been
set to 8000MB (using the -Xmx8000m
switch). This is recommended as a
good starting value - you may choose to set this higher if the service
will be used heavily. (We have tested it with -Xmx36000m without problems).
cd lookup_service_5.67-99.0
java -Xmx8000m -jar server-5.67-99.0-jetty-console.war
A new window will open. Set the port number as required and click the “Start” button to start the web service running.
The initialization of the web service usually takes a while, depending on the machine you are running it. After successful initialization you will be forwarded to the ‘InterProScan 5 Pre-calculated Match Lookup Service’ landing page within your browser and from now on the lookup service is ready to be used.
Run “Headless” (no graphical user interface)¶
It is most likely that you will want to run the lookup service “headless”, i.e. purely as a command line tool. In this case, the port number and other options can be passed in on the command line as follows:
Note that in the example below, the memory available to Java has been
set to 8000MB (using the -Xmx8000m
switch). This is recommended as a
good starting value - you may choose to set this higher if the service
will be used heavily. (We have tested it with -Xmx36000m).
cd lookup_service_5.67-99.0
java -Xmx8000m -jar server-5.67-99.0-jetty-console.war [--option=value] [--option=value]
# Example command:
# java -Xmx8000m -jar server-5.67-99.0-jetty-console.war --headless --port 8080
Where options include:
Options:
--sslProxied - Running behind an SSL proxy
--port n - Create an HTTP listener on port n (default 8080)
--bindAddress addr - Accept connections only on address addr (default: accept on any address)
--forwarded - Set reverse proxy handling using X-Forwarded-For headers
--contextPath /path - Set context path (default: /)
--headless - Don't open graphical console, even if available
--help - Print this help message
--tmpDir /path - Temporary directory, default is /tmp
The lookup service is very large and could take over an hour to start. Example output from a successful startup is given below:
$ java -Xmx8000m -jar server-5.67-99.0-jetty-console.war
10242 [Thread-2] INFO org.simplericity.jettyconsole.DefaultJettyManager - Added web application on path / from war /example/path/to/server-5.67-99.0-jetty-console.war
10243 [Thread-2] INFO org.simplericity.jettyconsole.DefaultJettyManager - Starting web application on port 8080
10245 [Thread-2] INFO org.eclipse.jetty.server.Server - jetty-8.1.12.v20130726
10818 [Thread-2] INFO org.eclipse.jetty.plus.webapp.PlusConfiguration - No Transaction manager found - if your webapp requires one, please configure one.
12226 [Thread-2] INFO org.eclipse.jetty.webapp.StandardDescriptorProcessor - NO JSP Support for /, did not find org.apache.jasper.servlet.JspServlet
12243 [Thread-2] INFO / - No Spring WebApplicationInitializer types detected on classpath
12344 [Thread-2] INFO / - Initializing Spring root WebApplicationContext
Initializing BerkeleyDB Match Database (creating indexes): Please wait...
Initializing BerkeleyDB MD5 Database (creating indexes): Please wait...
1049793 [Thread-2] INFO / - Initializing Spring FrameworkServlet 'mvc'
Initializing BerkeleyDB Match Database (creating indexes): Please wait...
Initializing BerkeleyDB MD5 Database (creating indexes): Please wait...
1050000 [Thread-2] INFO org.eclipse.jetty.server.AbstractConnector - Started @0.0.0.0:8080
Note a “Address already in use” error would indicate that the lookup service (or another existing service) appears to be already running on that machine and port. Either stop the existing service, or configure the lookup service to use a different port using the –port option.
Once successfully started the service will wait, ready to receive any requests that are passed it’s way. It will continue listening for requests until the service is stopped. To confirm all is running correctly you can now test the service.
To test the service:
# Assuming the lookup service has been started on the same machine and you are using
# the default port of 8080 then...
# in a web browser:
http://localhost:8080/version
http://localhost:8080/matches?md5=2E38C8D754C63117A4FA5F5E44F2194E
# or using curl on the command line:
curl http://localhost:8080/version
curl http://localhost:8080/matches?md5=2E38C8D754C63117A4FA5F5E44F2194E
# To access your lookup service from another machine replace "localhost" with
# the fully qualified name of the machine where the lookup service is running.
# The Linux command "uname -n" can be used to find the machine name.
# Alternatively you could use the machines IP address instead of the hostname.
This should return an XML file containing match data (you may need to “view source” on your web browser to see this properly).
If you leave it running then the lookup service is now ready to receive any requests that may come it’s way.
To configure your local installation of InterProScan 5 to use your
lookup service, edit the interproscan.properties
file and set the
property precalculated.match.lookup.service.url
to point to your
service.
Replace host with the machine name and port with the port number your server is running on:
precalculated.match.lookup.service.url=http://host:port
# Note: You can check your lookup service URL is accessible using curl on
# the command line of the machine you will be running InterProScan from
# For example, "curl http://host:port/" should return the expected HTML source
For example, if you are running the server on a machine named lookuphost on port 8080, you should set the property as follows:
precalculated.match.lookup.service.url=http://lookuphost:8080
Or if you are running the server on locally on port 8080, you should set the property as follows:
precalculated.match.lookup.service.url=http://localhost:8080
You can also substitute the server name with an IP address if necessary.
Please note that if you need to access the internet through a proxy server then you will also need to update the following properties:
precalculated.match.lookup.service.proxy.host=
precalculated.match.lookup.service.proxy.port=3128
Running InterProScan 5 in Cluster Mode¶
In the “cluster” mode, InterProScan 5 activates a master/worker parallelisation mode which takes advantage of your cluster capabilities to distribute the analysis components on the cluster making large jobs complete faster. The benefits of this mode will be seen with larger inputs (approx >32000 protein sequences depending on resources). However, for smaller inputs the default “standalone” mode (or “singleseq” mode for one sequence) will still be preferable due to the overhead in initialising InterProScan in cluster mode.
This documentation should be read in conjunction with the information on the page Running InterProScan 5.
Currently we support Load Sharing Facility (LSF) and Sun Grid Engine (SGE) now known as Oracle Grid Engine. InterProScan 5 has been tested on SGE 8.1.2 running 64 bit linux. However, currently “clustermode” is not as fault tolerant as the default “standalone” mode, so we recommend the more stable “standalone” mode.
You can configure InterProScan 5 to run on other clusters by changing the submission commands below.
Initial Setup¶
Before running InterProScan 5 in cluster mode, the following configuration must be completed correctly for your cluster setup.
Edit the interproscan.properties
file.
Add or modify the properties below appropriately for your cluster.
Note - you must set the submission command including the ‘QUEUE_NAME’ correctly for your LSF, SGE or other cluster.
If you are in any doubt about any of these settings, you should consult the systems administrator who maintains your cluster.
#Specify your cluster (LSF, SGE or any other cluster)
grid.name=lsf
#grid.name=other-cluster
#Java Virtual Machine (JVM) maximum idle time for jobs.
#Default is 180 seconds, if not specified. When reached the worker will shutdown.
jvm.maximum.idle.time.seconds=180
#JVM maximum life time for workers.
#Default is 14400 seconds, if not specified. After this period has passed the worker will shutdown unless it is busy.
jvm.maximum.life.seconds=14400
#Maximum number of jobs per clusterRunId. Default is 3000.
grid.jobs.limit=3000
#commands to start new jvms
worker.command=java -Xms256m -Xmx1024m -jar interproscan-5.jar
worker.high.memory.command=java -Xms256m -Xmx2048m -jar interproscan-5.jar
#directory for any log files generated by InterProScan
log.dir=logs
Cluster submission commands¶
On your cluster the following submission command properties should be configured. LSF example:
#Grid submission commands (e.g. LSF bsub or SGE qsub) for starting remote workers
#The following 2 commands are used by the master to spawn normal or high memory workers
grid.master.submit.command=bsub -q QUEUE_NAME
grid.master.submit.high.memory.command=bsub -q QUEUE_NAME -M 8192
#The following 2 commands are used by workers to spawn normal or high memory workers
grid.worker.submit.command=bsub -q QUEUE_NAME
grid.worker.submit.high.memory.command=bsub -q QUEUE_NAME -M 8192
#network growth
#if the main/master !InterProScan job runs on a submission node and other nodes cannot submit jobs set max.tier.depth to 1 else it can be greater than 1
max.tier.depth=1
SGE equivalent:
grid.master.submit.command=qsub -cwd -V -b y -N i5t1worker
grid.master.submit.high.memory.command=qsub -cwd -V -b y -N i5t1hmworker
grid.worker.submit.command=qsub -cwd -V -b y -N i5t2worker
grid.worker.submit.high.memory.command=qsub -cwd -V -b y -N i5t2hmworker
We would like to recommend to read the SGE manual (http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html) for the different qsub options.
Note The SGE cluster mode is a new feature that has not been tested extensively and we would welcome any Feedback you may have.
Other clusters¶
For other clusters, change the submission property grid.master.submit.command to suit your cluster requirements.
Master configuration options¶
If you require that the master InterProScan should not run any analysis but only do housekeeping, change the following property to false (from version 5.1-44.0 onwards).
#allow master interproscan to run binaries
master.can.run.binaries=false
Example usage on an LSF, SGE and other clusters¶
To enable InterProScan 5 to “farm out” analysis components on LSF, it is
necessary to run the interproscan.sh
script with the
-mode cluster
switch. This turns on the ability for the “master” to
create child “worker” processes on the cluster that are able to take
analysis steps from the master and run them remotely.
As an example:
./interproscan.sh -mode cluster -clusterrunid uniqueName -i /path/to/sequences.fasta -b /path/to/output_file
Please note, in cases where the main (master) InterProScan jvm dies unexpectedly you might still see workers running, but they will shutdown as soon as they reach their maximum idle time.
clusterrunid¶
--clusterrunid
(alias -crid
) is a mandatory option that
takes an argument.
This can be used for monitoring your distributed jobs within a single
run. On LSF clusters, the value for --clusterrunid
is passed as the
LSF project option -P.
In cluster mode InterProScan 5 spawns new “worker” Java processes according to the volume of analysis that needs to be performed.
In house tested cluster versions¶
Platform LSF
Version |
Result |
---|---|
8.0.1 |
Tested successfully |
9.1.1.1 |
Tested successfully 1) |
1) From this LSF version on you have to include the -n option in your bsub command, if you want to set more then 1 CPU for workers (1 CPU is the default value in this version). We strongly recommend to do that, otherwise InterProScan will be much slower in CLUSTER mode. How much CPUs you need to reserve depends on your cluster nodes and your binary CPU settings. If you need help on that, please don’t hesitate to contact us using EMBL-EBI’s support form.
SGE
Version |
Result |
---|---|
8.1.2 |
Tested successfully |
Running InterProScan 5 in CONVERT mode¶
InterProScan 5’s CONVERT mode allows you to reformat an existing InterProScan XML result file into any other possible output format (TSV, GFF3, JSON). For compatibility reasons you can also convert XML results into InterProScan 4.8 raw format (RAW). This will give our users enough time to migrate their pipeline to InterProScan 5.
Please note it is NOT possible to reformat any non-XML format. XML is the richest data type and is therefore the only format which allows us to produce any other format of interest.
For more information on InterProScan formats available see `output formats <OutputFormats.html>__.
To enable InterProScan 5 to run in CONVERT mode you need to set the mode option to ‘CONVERT’.
Usage instructions¶
./interproscan.sh -mode convert
You will see the following usage instructions:
Welcome to InterProScan 5RC7
usage: java -XX:+UseParallelGC -XX:+AggressiveOpts
-XX:+UseFastAccessorMethods -Xms512M -Xmx2048M -jar
interproscan-5.jar
Please give us your feedback by sending an email to
interhelp@ebi.ac.uk
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename
(relative or absolute path).
Note that this option and the
--outfile (-o) option are
mutually exclusive. The
appropriate file extension for
the output format(s) will be
appended automatically. By
default the input file
path/name will be used.
-d,--output-dir <OUTPUT-DIR> Optional, output directory.
Note that this option and the
--outfile (-o) option or the
--output-file-base (-b) option
are mutually exclusive. The
appropriate file extension for
the output format(s) will be
appended automatically. By
default the input file
path/name will be used.
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive,
comma separated list of output
formats. Supported formats are
TSV, XML, JSON, and GFF3.
Default for protein sequences
are TSV, XML and GFF3, or
for nucleotide sequences
GFF3 and XML.
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file
that should be loaded on
Master startup. Alternatively,
in CONVERT mode, the
InterProScan 5 XML file to
convert.
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file
name (relative or absolute
path). Note that this option
and the --output-file-base
(-b) option are mutually
exclusive. If this option is
given, you MUST specify a
single output format using the
-f option. The output file
name will not be modified.
Note that specifying an output
file name using this option
OVERWRITES ANY EXISTING FILE.
-T,--tempdir <TEMP-DIR> Optional, specify temporary
file directory (relative or
absolute path). The default
location is temp/.
Copyright (c) EMBL European Bioinformatics Institute, Hinxton, Cambridge,
UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided
under the Apache License, Version 2.0
(http://www.apache.org/licenses/LICENSE-2.0.html). Third party components
(e.g. member database binaries and models) are subject to separate
licensing - please see the individual member database websites for
details.
Example Usage¶
# Convert from XML format to all other available formats
./interproscan.sh -mode convert -f tsv,gff3,raw -i /path/to/existing_output_file.xml -b /path/to/output_file_basename
# Convert from XML format to TSV format (which automatically includes all available InterPro entry/GO term/pathways information)
./interproscan.sh -i /path/to/existing_output_file.xml -mode convert -f tsv -o /path/to/new_output_file.tsv
Improving performance¶
If InterProScan is taking a long time to run, or you just want to improve on the run time you are getting, then consider some of the following:
By default InterProScan uses 8 cpu cores on your machine. Most of the times this configuration is sufficient. However, if you have more cores available and you have more memory to support more threads, then you can change the number of cpu cores used by adding the option below to the InterProScan command line, where N is the desired number of cores
-cpu N
The value N for -cpu represents the maximum number of threads (embedded workers) InterProScan will start and run at a time.
You have to remember, the more cores you specify, the more memory InterProScan will require to run successfully. Here are some observed numbers that may act as a guide, but you may have to experiment for your own data. The input sequences were taken from UniProt
-cpu |
max memory used (GB) |
input sequence count |
input sequence size (MB) |
run time |
---|---|---|---|---|
16 |
8 |
8,000 |
3 |
2 hrs |
16 |
12 |
16,000 |
6 |
4 hrs |
16 |
15 |
160, 000 |
56 |
12hrs |
Let’s say you have a super machine with 32 cores available and you want to use all or most of the cores. It would be recommended to specify -cpu 30, as the main InterProScan process will always use 1 core.
Each database analysis may also have options to specify how many threads to assign to it, for example, HMMER3 based analyses such as Gene3D have this option. But we dont recommend changing the default cpu values for each analysis.
If your FASTA input files contains a large number of sequences say over 160, 0000 protein sequences, then you may consider splitting your input into smaller chunks (depends on resources, but batches of 80,000 protein sequences is a suggested starting point). You can then submit the smaller input files to InterProScan and process the results afterwards.
For DNA/RNA sequences a much smaller number is suggested (e.g. 12,000 sequences). However for improved performance you could translate these using an external tool and then submit the necessary protein sequences instead, see running nucleic acid sequences for more information.
Do you need all the output InterProScan supplies by default? See How to run InterProScan for more details, for example you may consider options such as:
Which result data are you interested in, do you require all applications (see -appl option )?
Do you require the residue level annotation? If not, this calculation can be disabled with the -dra option.
Make use of the default lookup service, or your own local lookup service to avoid the need for calculating known results again (on by default, read more).
This mode is still experimental, so I would not run in this mode in production.
You want to analysis sequences on a cluster/farm and you would like to set the number of reserved cores for each node.
See Running InterProScan in CLUSTER mode
For nucleic acid sequences, consider reducing the number of ORFs to analyse.
By default the Phobius, SignalP and TMHMM member database analyses are deactivated because they contain licensed components. In order to activate these analyses please obtain the relevant license and files from the provider (ensuring the software version numbers are the same as those supported by your current InterProScan installation).
An example of how to activate the Phobius 1.01, SignalP 4.1 and TMHMM 2.0 analyses with InterProScan 5.19-58.0 is given below. Files can be placed in any location as long as your interproscan.properties configuration is updated accordingly.
Phobius¶
Website: http://phobius.sbc.su.se/data.html
Files required by InterProScan:
bin/phobius/1.01/decodeanhmm
bin/phobius/1.01/phobius.model
bin/phobius/1.01/phobius.options
bin/phobius/1.01/phobius.pl
Example inteproscan.properties configuration:
phobius.signature.library.release=1.01
binary.phobius.pl.path=bin/phobius/1.01/phobius.pl
SignalP¶
Website: http://www.cbs.dtu.dk/services/SignalP/
For academic users there is a download site at: http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp Other users are requested to contact software@cbs.dtu.dk.
Files required by InterProScan:
bin/signalp/4.1/signalp
bin/signalp/4.1/bin/nnhowplayer.Linux_i386
bin/signalp/4.1/bin/nnhowplayer.Linux_i486
bin/signalp/4.1/bin/nnhowplayer.Linux_i586
bin/signalp/4.1/bin/nnhowplayer.Linux_i686
bin/signalp/4.1/bin/nnhowplayer.Linux_ia64
bin/signalp/4.1/bin/nnhowplayer.Linux_x86_64
Example inteproscan.properties configuration:
signalp_euk.signature.library.release=4.1
signalp_gram_positive.signature.library.release=4.1
signalp_gram_negative.signature.library.release=4.1
binary.signalp.path=bin/signalp/4.1/signalp
signalp.perl.library.dir=bin/signalp/4.1/lib
Please confirm that the following line in the “signalp” binary is set to the required location:
BEGIN {
$ENV{SIGNALP} = 'bin/signalp/4.1';
}
TMHMM¶
Website: http://www.cbs.dtu.dk/services/TMHMM/
There is a download page http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm for academic users; other users are requested to contact CBS Software Package Manager at software@cbs.dtu.dk.
Files required by InterProScan:
bin/tmhmm/2.0c/decodeanhmm
data/tmhmm/2.0c/TMHMM2.0c.model
Example inteproscan.properties configuration:
tmhmm.signature.library.release=2.0c
binary.tmhmm.path=bin/tmhmm/2.0c/decodeanhmm
tmhmm.model.path=data/tmhmm/2.0c/TMHMM2.0c.model
Providing your feedback¶
Support requests¶
Support requests should be sent by using EBI Support & Feedback. We will endeavour to respond to support requests as quickly as possible.
General discussion and suggestions¶
Send your comments and suggestions about InterProScan 5 to EBI’s Support & Feedback as well.
Known issues¶
This page documents the latest list of known issues, and we are working to fix them as soon as possible. For assistance with other InterProScan problems, please contact us using EMBL EBI’s support form.
1. CDD/RPSBlast errors¶
On some linux systems, you may get rpsblast errors like
bin/blast/ncbi-blast-2.10.1+/rpsbproc: error while loading shared libraries: libgomp.so.1: cannot open shared object file: No such file or directory
The missing library is libgomp1. On Ubuntu you might install it as follows:
sudo apt-get install -y libgomp1
On other systems, you have similar installation commands
2. Coils errors¶
If you see an error concerning Coils, for example the error below, it means the binary we provide is not compatible with your system.
Cannot run program ".../bin/ncoils/2.2.1/ncoils": error=2, No such file or directory
In this case, you may need to compile the Coils binary and it is straight forward as follows.
cd src/coils/ncoils/2.2.1
make
cd ../../../..
cp src/coils/ncoils/2.2.1/ncoils bin/ncoils/2.2.1/ncoils
These steps should update the Coils binary.
3. Prosite/pfsearchV3 errors¶
On some linux systems, you may get pfsearchV3 errors like
- ::
Error output from binary: bin/prosite/pfsearchV3: error while loading shared libraries: libpcre2.so: cannot open shared object file: No such file or directory
The missing library is pcre2. On Ubuntu you might install it as follows:
sudo apt-get install -y libpcre2-dev
On other systems, you have similar installation commands.
Additionally on some systems, specially on cluster environments, the following error has been reported
Error setting affinity!
Error running prosite binary bin/prosite/pfsearchV3
This means the provided binary is not compatible with your system. We have included alternative binaries that you can try, by updating you interproscan.properties file to point to the alternative binaries:
binary.prosite.pfscanv3.path=${bin.directory}/prosite/altbin/pfscanV3.noaf
binary.prosite.pfsearchv3.path=${bin.directory}/prosite/altbin/pfsearchV3.noaf
If the problem persist, you may need to compile those binaries from source, making sure to build without affinity by compiling with the flag
cmake -DUSE_AFFINITY=OFF ..
4. HMMER errors¶
The HMM libraries provided by some member databases (SUPERFAMILY and SFLD) are not compatible with newer HMMER versions and an error will occur when those libraries are being indexed by hmmpress version greater than ‘3.1b1’. To avoid this issue we recommend using the HMMER binaries bundled with interproscan.
If you encounter errors not listed above, please contact us using EMBL EBI’s support form.
Contacting us¶
please give us enough background information when you contact us, such as:
the linux distribution and version
the InterProScan version
the java version
command line used
the complete error log if possible
FAQ¶
What should I do if one of the binaries included with InterProScan doesn’t work on my system?¶
Please see the section Compiling binaries for instructions on how to compile the various binaries on your own system.
Where can I find the XSD of the XML output?¶
The XML Schema Definition (XSD) is linked under the Extensible Markup Language (XML) section of the InterProScan OutputFormats page.
Can I use different binary versions than listed?¶
InterProScan 5 is designed to run with the same binaries used by the supported member database analysis versions. This ensures that the output results returned are as the member database intended. This is why for example you will find multiple versions of HMMER (e.g. for the SMART and Pfam analyses) bundled with InterProScan and referenced in the interproscan.properties configuration file.
Swapping the binary versions is not recommended. InterProScan could fail (e.g. if the input/output of the binary has changed and is no longer recognised). Even if no errors are thrown, you would be running with an unexpected binary and we cannot guarantee the results would match what the analysis intended.
If you are having problems running the provided versions of certain binaries on your system, please follow these instructions.
Which cluster does InterProScan support?¶
In theory InterProScan is written flexible enough to run on any cluster platform and not only on LSF and SGE. But LSF and SGE are the only platforms we can test here at the EBI. We had feedback from users who run it successfully on a PBS cluster. For further info on how to configure your cluster version please follow the documentation.
Is there Galaxy has a wrapper for InterProScan?¶
Do you want to add InterProScan 5 to your Galaxy analysis pipeline? You can find the wrapper for InterProScan 5 on GitHub.
When using InterProScan 5 with Galaxy, the cluster integration is done via Galaxy, which means you cannot use InterProScan 5’s in-built CLUSTER mode.
Documentation and contact details¶
Galaxy Tool Shed link for InterProScan 5: http://toolshed.g2.bx.psu.edu/view/bgruening/interproscan5
Contact: Bjoern Gruening (bjoern.gruening@gmail.com)
Publication¶
Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. PeerJ 1:e167 (http://dx.doi.org/10.7717/peerj.167)
I get Java errors on running InterProScan¶
If a simple test of InterProScan fails please check your installed version of Java is suitable, see installation requirements for more details. The latest version run with Java 11.
How to analyse a huge amount of protein sequences (>30000)?¶
The following guidance I would say is good practice, when you use InterProScan to annotate large sequence sets. To give you an example about InterProScan’s run time, inhouse we are able to annotate a complete Escherichia coli proteome (~3.000 protein sequences) on our farm (standalone mode) within ~1hour. Other sequence sets of 16,000 protein sequences have taken ~5 hours on an a machine with 8 cores and 8GB RAM.
If you want to annotate huge ammounts of protein sequences we would strongly recommend to chunk your input sequences into chunks for lets say 80,000 sequences. If you are analysing nucleic acid sequences, the chunk size should be even smaller. And then you would run individual InterProScan jobs for each chunk file. That way you make sure you get intermediate results and if lets say your InterProScan program crashes on half way you do not lose everything. see improving performance
Should I filter by e-value?¶
The e-values are specific to each individual InterPro member database and therefore cannot be compared directly, or a single threshold applied to them all. This is because some member databases use the e-values for post-processing (e.g. SMART, Panther), others just output it as part of their results but actually use other measures for filtering of results (e.g. Pfam and the Hmmer GA cut-off). Therefore as far as InterProScan is concerned, if a match is in the output then it is a match!
Why do I see “Pre-calculated match lookup service failed - analysis proceeding to run locally”?¶
This is a warning to say that the match lookup service you are trying to use could not be used, therefore InterProScan will calculate the results locally on your system instead. In this situation InterProScan will continue run, however this is likely to result in slower performance than normal.
This warning could occur because the lookup service your installation of InterProScan is configured to use is either: * Not (or no longer) compatible with your version of InterProScan. * Is not accessible through your internet, proxy or firewall system configuration. * Is temporarily down.
See more information about the lookup service to understand what is does and how to configure it.
How is InterProScan 5 different from InterProScan 4? How do I migrate?¶
InterProScan 4 is way way obsolete! But if you are still using InterProScan 4 then we recommend you send us a support request as soon as possible.
InterProScan 5 differs from InterProScan v4.x in the following ways:
New analysis type: Phobius for transmembrane and signal peptide prediction
New feature: ability to map InterPro results back to the original nucleotide sequences that were submitted
New feature: option to look up biological pathways that the protein is potentially involved in
New output formats: “IMPACT” XML format and GFF3.0
InterProScan 4.8 is no longer supported or updated. For more details on how to migrate to InterProScan 5 send us a support request.
Installing and compiling binaries used in Interproscan¶
The binaries that we distribute with InterProScan should work on most linux systems. However, in some cases they may not work on a particular system. If you are trying to run InterProScan and you get an error then you may need to compile the binary causing the error on your own system in order for it to work.
Once a binary has been compiled you can either: - Replace the binary in the relevant bin subdirectory in your InterProScan installation with your newly compiled version - Or update the location of the binary in your interproscan.properties configuration to point to your newly compiled version
InterProScan is designed to work with the same binary versions as used by the supported member database analyses. Therefore it is important to use the binary version numbers listed below, see the FAQ for more information.
cath-resolve-hits is a tool written c/c++ and is used as part of the postprocessing for CATH-Gene3D. The binary bundled in InterProScan should work on most systems. If you get errors, download cath-resolve-hits v0.15.2 that corresponds to your system from the following page https://github.com/UCLOrengoGroup/cath-tools/releases/tag/v0.16.10 into bin/gene3d/4.3.0/ and rename it to bin/gene3d/4.3.0/cath-resolve-hits.
If the precompiled binary doesnt’ solve your problems, compile the binary for your system by following instructions on http://cath-tools.readthedocs.io/en/latest/build/
Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are:
cath.resolve.hits.path=bin/gene3d/4.3.0/cath-resolve-hits
PfscanV3 and PFsearchV3 binary packages for your platform can be downloaded from https://ftp.expasy.org/databases/prosite/ps_scan/.
Alternatively, you may download the source code from https://github.com/sib-swiss/pftools3 and compile the binaries yourself.
Then either replace the relevant files with your new ones or update the relevant interproscan.properties values to point at the new file locations:
binary.prosite.pfscanv3.path=bin/prosite/pfscan
binary.prosite.pfsearchv3.path=bin/prosite/pfsearch
wget ftp://selab.janelia.org/pub/software/hmmer/2.3.2/hmmer-2.3.2.tar.gz
tar -xzvf hmmer-2.3.2.tar.gz
cd hmmer-2.3.2
./configure --enable-threads
make
make check
make install
Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are:
binary.hmmer2.hmmsearch.path=bin/hmmer/hmmer2/2.3.2/hmmsearch
binary.hmmer2.hmmpfam.path=bin/hmmer/hmmer2/2.3.2/hmmpfam
Instructions for downloading and compiling Hmmer 3.1b1 can be found at: http://hmmer.org/download.html
Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are:
binary.hmmer3.path=bin/hmmer/hmmer3/3.1b1
binary.hmmer3.hmmscan.path=bin/hmmer/hmmer3/3.1b1/hmmscan
binary.hmmer3.hmmsearch.path=bin/hmmer/hmmer3/3.1b1/hmmsearch
If you get Coils (ncoils) errors, you may need to compile the Coils binary and it is straight forward as follows.
cd src/coils/ncoils/2.2.1
make
cd ../../../..
cp src/coils/ncoils/2.2.1/ncoils bin/ncoils/2.2.1/ncoils
The steps above normally solve the problem.
Instructions for compiling the “ncoils” binary can also be found in the src/coils/ncoils/2.2.1/README file in your extracted InterProScan 5 distribution (release 5.17-56.0 onwards).
Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are:
binary.coils.path=bin/ncoils/2.2.1/ncoils
Instructions for compiling the “fingerPRINTScan” binary can be found in the src/prints/fingerprintscan/3597/INSTALL file in your extracted InterProScan 5 distribution (release 5.17-56.0 onwards) and are summarised as below:
cd src/prints/fingerprintscan/3597/
./configure
make
cd _interproscan_dir
cp src/prints/fingerprintscan/3597/fingerPRINTScan bin/prints/
where “_interproscan_dir” is the directory where you have installed InterProScan 5.
If you choose not to replace the relevant binary with your new one then instead you can update the relevant interproscan.properties values to point at the new file location. The default property values are:
binary.fingerprintscan.path=bin/prints/fingerPRINTScan
There are two seperate application from NCBI that CDD uses for analysis in InterProScan. If the applications rpsblast and rpsbproc provided in InterProScan are not working for you,
download rpsblast/rpsbproc from NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
for rpsblast, it is part of the main blast package, so download https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.11.0+-x64-linux.tar.gz and look for rpsblast after uncompressing the tar file.
for rpsbproc, get it from ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/
if they dont work, then you have to compile these binaries for your system.
We are working on a summary of how to compile rpsblast/rpsbproc for the latest Blast release - ncbi-blast-2.11.0+.
For an older release ncbi-blast-2.6.0+, below are the instructions. They could be adapted to work for ncbi-blast-2.11.0+.
Instructions on how to compile rpsblast/rpsbproc for interproscan are summarised as follows:
First check the c++ compiler version
c++ --version
if the c++ version is less than 4.8 compilation will most likely fail and you should upgrade to a c++ compiler version 4.8 or above.
If you have a c++ version 4.8 or above then follow the instructions below.
mkdir cddblast
cd cddblast
wget ftp://ftp.ncbi.nih.gov/blast/executables/blast+/2.6.0/ncbi-blast-2.6.0+-src.tar.gz
wget ftp://ftp.ncbi.nih.gov/blast/executables/blast+/2.6.0/ncbi-blast-2.6.0+-src.tar.gz.md5
md5sum -c ncbi-blast-2.6.0+-src.tar.gz.md5
# Above command should return "ncbi-blast-2.6.0+-src.tar.gz: OK" if download successful
tar xvzf ncbi-blast-2.6.0+-src.tar.gz
cd ncbi-blast-2.6.0+-src/c++/src/app/
wget -r --no-parent -l 1 -np -nd -nH -P rpsbproc ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/rpsbproc-src/
#edit Makefile.in and make sure SUB_PROJ is assigned two applications as follows: SUB_PROJ = blast rpsbproc
cd ../../
./configure
/usr/bin/make
#after compilation is complete
cp ReleaseMT/bin/rpsblast <interproscan_install_dir>/bin/blast/ncbi-blast-2.6.0+/
cp ReleaseMT/bin/rpsbproc <interproscan_install_dir>/bin/blast/ncbi-blast-2.6.0+/
The complete instruction set can be found here: ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/README
If you choose not to replace the relevant binary with your new one then instead you can update the relevant interproscan.properties values to point at the new file location. The default property values are:
binary.rpsblast.path=bin/blast/ncbi-blast-2.6.0+/rpsblast
binary.rpsbproc.path=bin/blast/ncbi-blast-2.6.0+/rpsbproc
Instructions for compiling the “sfld_preprocess” and “sfld_postprocess” binaries can be found in the src/sfld/1/README file in your extracted InterProScan 5 distribution (release 5.22-61.0 onwards).
Then either replace the relevant binary with your new one or update the relevant interproscan.properties values to point at the new file location. The default property values are:
sfld.postprocess.command=bin/sfld/sfld_postprocess
By default the Phobius, SignalP and TMHMM member database analyses are deactivated because they contain licensed components. For instructions on how to activate these analyses, obtain the relevant licenses and compile the binaries please see “activating licensed analyses”.
Configuration Options¶
This page will give you an overview and a detailed description about some of the available configuration options in your InterProScan 5 properties file (interproscan.properties).
Option |
Description |
Default setting |
---|---|---|
Precalculat ed match lookup and proxy setup |
||
precalculated .match.lookup .service.prox y.host |
Host name of your proxy (e.g. http://proxy.examp le.ebi.ac.uk). You would need to set that option, if the pre-calculated match lookup service is enabled and you have a proxy (communication layer) between you and the world wide web. Please note user proxy-authenticati on is not supported at the moment. |
Not set |
precalculated .match.lookup .service.prox y.port |
Open port of your proxy (e.g. 8080) |
Not set |
precalculated .match.lookup .service.url |
Web address of the precalculated match lookup service. Used if the pre-calculated match lookup service is enabled. You would only want to change that, if you have installed a local version of the lookup service |
http://www.ebi.ac.uk/in terpro/match-lookup |
Other properties |
||
exclude.sites .from.output |
Calculate residue level annotation and include in the output where available? |
false |
Cluster mode benchmark run¶
We have ran InterProScan 5 (I5) in CLUSTER mode against a complete Escherichia coli proteome to give you some benchmark figures in terms of analysis runtime.
This documentation could be seen as a reference point for runtime, but also on how to set up I5 appropriate for speed improvement.
How was the set of input sequences assembled for this run?¶
For this run we decide to run I5 against the complete proteome of Escherichia coli (Taxon 83333). We’ve downloaded the proteome from the Reference proteomes website (RELEASE 2014_04).
RELEASE 2014_04 is based on UniProt Release 2014_04, Ensembl release 75 and Ensembl Genome release 21. The E.coli proteome for this release contains 4303 protein sequences. You can download the sequence file here.
Which I5 command was used for this run?¶
We switched off the pre-calculated match lookup service and turned on the CLUSTER mode.
./interproscan.sh
-i 83333.fasta
-dp
-f tsv,html
--goterms
-mode cluster
-clusterrunid benchmark-5.7-48.0
How does the interproscan.properties file look like?¶
The default settings are very conservative. To speed up the CLUSTER mode we’ve added or changed the values of the following 6 attributes within the DEFAULT interproscan.properties file:
grid.throttle=false
master.steps.to.consumer.ratio=1
steps.to.consumer.ratio=1
max.tier.depth=2
thinmaster.number.of.embedded.workers=5
thinmaster.maxnumber.of.embedded.workers=5
The full setting file can be found here.
On which cluster/farm did we run I5?¶
We ran I5 on our internal LSF cluster. As of 1st of May 2014, there are approximatley 680 nodes comprising 16,000 hyper-threaded CPU cores.
| | Run 1 | Run 2 | Run 3 | |:|:———-|:———-|:———-| |Wall clock time|1h 37min | N/A | N/A | |Max number of workers| 45 | N/A | N/A |
Contact us¶
For further assistance with installing and using InterProScan, please reach out to us through our help desk or create an issue on our GitHub repository.