Nucleic acid sequences scan¶
The Open Reading Frame prediction tool¶
InterProScan 5 takes advantage of the Open Reading Frame (ORF) prediction tool Emboss getorf. The getorf application itself and all of its dependencies are integrated in InterProScan. You do not need to install the Emboss package on your own, but you may use a local installation if you wish.
If you want to use a local installation you must edit the interproscan.sh script. This script sets 2 environment variables for Emboss getorf. Set these to the correct paths for your installation of Emboss.
# set environment variables for getorf
In addition open and edit your properties file (interproscan.properties), which you will find in your InterProScan root directory. Search for the property ‘binary.getorf.path’ and change the path to your local getorf binary.
How can I scan nucleic acid sequences in InterProScan 5?¶
./interproscan.sh -t n -i /path/to/nucleic_acid_sequences.fasta
or run the following commands:
#translate the nucleic_acid_sequences
./bin/nucleotides/translate -i /path/to/nucleic_acid_sequences.fasta -o /path/to/output_orfs_sequences.fasta
#if output_orfs_sequences.fasta has more than 32,000 sequences then chunk the file then send the chunks to InterProScan
#run InterProScan on the translated output
./interproscan.sh -i /path/to/output_orfs_sequences.fasta
Which output formats are supported?¶
Supported output formats are GFF3 and XML, which allow you to trace back from the match to the position inside your nucleic acid sequence. Please not that the TSV format is not available for nucleic acid sequence analysis.
Redundant sequences and identifiers in your FASTA file¶
InterProScan 5 is able to handle FASTA file entries with the same sequence, but different identifiers. For instance you have the following 2 sequences in your input file:
InterProScan 5 will condense these into a single sequence with two identifier cross-references in the XML output file:
<orf strand="SENSE" start="1" end="3">
and in the GFF3 output:
##sequence-region sequence_1|sequence_2 1 3
sequence_1|sequence_2 provided_by_user nucleic_acid 1 3
Entries with the same identifier and the same sequence will be merged into one.
Please note: non unique identifiers are not supported. InterProScan 5 will exit (with exit code 0) and will print out a list of all non unique identifiers.
InterProScan does not select one best ORF from the getorf output, instead it takes the ORFs generated and select N longest ORFs and inputs them for analysis. The number selected depends on the binary.getorf.parser.filtersize property mentioned below. The default is 8. This means analysing nucleotide sequences can take much longer than analysing protein sequences.
To improve InterProScan performance while running large nucleotide input files (> 10,000 sequences) you can:
First use an external program to translate your input. This is the best approach. There are various options, one of which is emboss-transeq (http://emboss.open-bio.org/rel/rel6/apps/transeq.html) from emboss. If you use transeq then please use the -clean option to change STOP codon positions from ‘*’ to ‘X’ because Interproscan does not accept sequences with the ‘*’ character.
Chunk the input and then send the chunks to InterProScan. For tips on configuring the general InterProScan CPU usage see also improving performance.
Selecting the ORFs to analyse¶
For improved performance, Interproscan will select the longest 8 ORFs predicted for each nucleic acid sequence. This can be changed using the new “binary.getorf.parser.filtersize” setting in the interproscan.properties file