Precalculated match lookup

InterProScan uses a lookup service to check whether or not a protein submitted to it has been encountered before and, therefore, if matches exist. (see “How to Run” in User documentation). This generic mechanism is based upon a REST web service that retrieves data from a BerkeleyDB database.

The client to this service is built into the InterProScan software, to allow lookup from the web service. The web service can be installed and run “out of the box”, using Jetty.

The service support two simple queries:

  • “Do these sequences need to be analysed?” This query returns protein sequences that have not been analysed previously. Proteins are considered to have been analysed previouslyeven if they have no matches.

    • Input: Set of protein sequence MD5 checksums

    • Output: MD5 checksums of proteins that have not been analysed previously

  • “What are the matches for these sequences?”

    • Input: Set of protein sequence MD5 checksums

    • Output: Simple “BekerkeleyMatchXML” document containing all matches.

Both of these services are used in InterProScan - the former to ensure that protein sequences with no matches are not re-analysed needlessly.

Incorporation into InterProScan

The hook into this service is from the ProteinLoader class, into which is injected a BerkeleyPrecalculatedProteinLookup, which is an implementation of the PrecalculatedProteinLookup interface.

A MatchHttpClient instance is injected into the BerkeleyPrecalculatedProteinLookup class, which is used to query the web service. The client is configured from properties to set the URL of the web service, should users wish to install the web service locally.

The BerkeleyPrecalculatedProteinLookup then uses the client to query for pre-calculated matches / proteins that have been previously analysed. Complete InterProScan Protein objects with a set of Matches are returned from the BerkeleyPrecalculatedProteinLookup to the ProteinLoader instance. The ProteinLoader then persists these matches and ensures that the Protein objects included are not scheduled for reanalysis.