Running InterProScan 5 in Cluster Mode

In the “cluster” mode, InterProScan 5 activates a master/worker parallelisation mode which takes advantage of your cluster capabilities to distribute the analysis components on the cluster making large jobs complete faster. The benefits of this mode will be seen with larger inputs (approx >32000 protein sequences depending on resources). However, for smaller inputs the default “standalone” mode (or “singleseq” mode for one sequence) will still be preferable due to the overhead in initialising InterProScan in cluster mode.

This documentation should be read in conjunction with the information on the page Running InterProScan 5.

Currently we support Load Sharing Facility (LSF) and Sun Grid Engine (SGE) now known as Oracle Grid Engine. InterProScan 5 has been tested on SGE 8.1.2 running 64 bit linux. However, currently “clustermode” is not as fault tolerant as the default “standalone” mode, so we recommend the more stable “standalone” mode.

You can configure InterProScan 5 to run on other clusters by changing the submission commands below.

Initial Setup

Before running InterProScan 5 in cluster mode, the following configuration must be completed correctly for your cluster setup.

Edit the interproscan.properties file.

Add or modify the properties below appropriately for your cluster.

Note - you must set the submission command including the ‘QUEUE_NAME’ correctly for your LSF, SGE or other cluster.

If you are in any doubt about any of these settings, you should consult the systems administrator who maintains your cluster.

#Specify your cluster (LSF, SGE or any other cluster)
grid.name=lsf
#grid.name=other-cluster

#Java Virtual Machine (JVM) maximum idle time for jobs.
#Default is 180 seconds, if not specified. When reached the worker will shutdown.
jvm.maximum.idle.time.seconds=180

#JVM maximum life time for workers.
#Default is 14400 seconds, if not specified. After this period has passed the worker will shutdown unless it is busy.
jvm.maximum.life.seconds=14400

#Maximum number of jobs per clusterRunId. Default is 3000.
grid.jobs.limit=3000

#commands to start new jvms
worker.command=java -Xms256m -Xmx1024m -jar interproscan-5.jar
worker.high.memory.command=java -Xms256m -Xmx2048m -jar interproscan-5.jar

#directory for any log files generated by InterProScan
log.dir=logs

Cluster submission commands

On your cluster the following submission command properties should be configured. LSF example:

#Grid submission commands (e.g. LSF bsub or SGE qsub) for starting remote workers
#The following 2 commands are used by the master to spawn normal or high memory workers
grid.master.submit.command=bsub -q QUEUE_NAME
grid.master.submit.high.memory.command=bsub -q QUEUE_NAME -M 8192

#The following 2 commands are used by workers to spawn normal or high memory workers
grid.worker.submit.command=bsub -q QUEUE_NAME
grid.worker.submit.high.memory.command=bsub -q QUEUE_NAME -M 8192

#network growth
#if the main/master !InterProScan job runs on a submission node and other nodes cannot submit jobs set max.tier.depth to 1 else it can be greater than 1

max.tier.depth=1

SGE equivalent:

grid.master.submit.command=qsub -cwd -V -b y -N i5t1worker
grid.master.submit.high.memory.command=qsub -cwd -V -b y -N i5t1hmworker

grid.worker.submit.command=qsub -cwd -V -b y -N i5t2worker
grid.worker.submit.high.memory.command=qsub -cwd -V -b y -N i5t2hmworker

We would like to recommend to read the SGE manual (http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html) for the different qsub options.

Note The SGE cluster mode is a new feature that has not been tested extensively and we would welcome any Feedback you may have.

Other clusters

For other clusters, change the submission property grid.master.submit.command to suit your cluster requirements.

Master configuration options

If you require that the master InterProScan should not run any analysis but only do housekeeping, change the following property to false (from version 5.1-44.0 onwards).

#allow master interproscan to run binaries
master.can.run.binaries=false

Example usage on an LSF, SGE and other clusters

To enable InterProScan 5 to “farm out” analysis components on LSF, it is necessary to run the interproscan.sh script with the -mode cluster switch. This turns on the ability for the “master” to create child “worker” processes on the cluster that are able to take analysis steps from the master and run them remotely.

As an example:

./interproscan.sh -mode cluster -clusterrunid uniqueName -i /path/to/sequences.fasta -b /path/to/output_file

Please note, in cases where the main (master) InterProScan jvm dies unexpectedly you might still see workers running, but they will shutdown as soon as they reach their maximum idle time.

clusterrunid

--clusterrunid (alias -crid) is a mandatory option that takes an argument.

This can be used for monitoring your distributed jobs within a single run. On LSF clusters, the value for --clusterrunid is passed as the LSF project option -P.

In cluster mode InterProScan 5 spawns new “worker” Java processes according to the volume of analysis that needs to be performed.

In house tested cluster versions

Platform LSF

Version

Result

8.0.1

Tested successfully

9.1.1.1

Tested successfully 1)

1) From this LSF version on you have to include the -n option in your bsub command, if you want to set more then 1 CPU for workers (1 CPU is the default value in this version). We strongly recommend to do that, otherwise InterProScan will be much slower in CLUSTER mode. How much CPUs you need to reserve depends on your cluster nodes and your binary CPU settings. If you need help on that, please don’t hesitate to contact us using EMBL-EBI’s support form.

SGE

Version

Result

8.1.2

Tested successfully