The HitKeeper Homepage

What is HitKeeper?

query triangle

HitKeeper [1] is a collection of software tools, mainly written in Perl/SQL, intended to help bioinformatic researchers with "friendly" tools. It allows fully automatic handling of multiple databases on a large scale, calculation of "hits", and querying.

Status: Production!

HitKeeper is developed at the Swiss Institute of Bioinformatics. It forms the "back end" of the MyHits website [2], which is in turn an extension to the database Hits [3]. - The metamotif query language is based on experience with mmsearch [4].

MyHits is in full production status since 2003 and currently handles more than 7 mio sequences (with weekly updates!), and 20 mio hits on these. With regard to these data, HitKeeper can be considered as robust and scalable.

Further publications related to HitKeeper are given below.

Features

The software is originally designed for the investigation of the relationships between protein sequences and motifs defined (or predicted) on them. These "motifs" are defined by a (heterogeneous) collection of "predictors", which include RegEx, generalized profiles, and hidden Markov models. In contrast to other projects with similar scope, HitKeeper attempts to provide a generic solution to the handling of redundancy and incremental update of biological databases. It also promotes the development of original query tools.

Some particular features are:

HitKeeper integrates sequence (e.g. peptide) and motif (e.g. pattern, HMM) and classification (e.g. taxonomy) databases, and controls the automatic computation of "hits" between them.
It implements an automatic mechanism to incorporate the frequent updates ("continuous data flow") of public databases, and to keep databases and computational results synchronised.
It uses an algorithm that avoids unnecessary and time-consuming recalculations: only data that have not been calculated before, or where the entries have been modified, are recomputed.
Once updated, HitKeeper can automagically deploy these "unified" databases to other computing services on a site.
HitKeeper implements a query syntax to retrieve information. These queries enable the user to express any constraints for searching proteins, e.g. retrieving sequences that contain specific motifs, or a defined arrangement of motifs ("metamotifs"), or queries based on the classification of sequences.
Modular and extensible structure (e.g. simple integration of custom parsers/data formats).
To interact with other services in the "Bioinformatics world", HitKeeper makes use of Web services.
The documentation is interactive - examples are available online and can be imported directly into the Web interface.

In the long run, the target of HitKeeper is to arrive at a system that is capable of storing biological sequences in the widest sense, including protein, DNA, and 3D-structures. Consequently, large and partially redundant collections of sequences as well as heterogeneous formats are expected -- so the system was designed to be modular and highly flexible.

Authors

The main authors are Marco Pagni and Jörg Hau, with contributions - be it with suggestions, testing, discussions or code - from Dmitry Kuznetsov, Vassilios Ioannidis, Laurent Falquet, Lorenzo "Luli" Cerutti, Monique Zahn-Zabal, Michael Muller, and Victor Jongeneel. Thank you!

License

HitKeeper and all related software are Free Software and are published under Version 2 of the GNU General Public License (GPL). You can redistribute it and/or modify it under the same term, which ensures that its source code is free and that any derivatization, or implementation of it in other software, will also remain free.

The difference between "free software" and "freeware" is of legal importance. If you do not understand any portion of this license, please seek appropriate professional legal advice. If you do not or if - for any reason - you can not accept all of the conditions of the GPL, then you must not use nor distribute this software.

System Requirements

HitKeeper is a collection of scripts - mainly written in Perl - that interact with each other and with an SQL database engine. The system is primarily developed using MySQL, and in parallel it is ported to Oracle. HitKeeper runs on most "unixoid" Platforms (Linux and OSX tested with success).

Detailed instructions for installation, operational qualification testing, and maintenance are provided inside the distribution.

Where to get it

HitKeeper is hosted at sourceforge.net. To download, you have several possibilities:

Visit the Sourceforge project page.
If you want "just" to browse the code and the documentation, use this viewer.
If you want to download everything in one archive file, download a GNU tarball.
If you have a subversion client installed, run svn co https://hitkeeper.svn.sourceforge.net/svnroot/hitkeeper hitkeeper

References and related papers

[1] J. Hau, M. Muller, M. Pagni. HitKeeper, a generic software package for hit list management. Source Code for Biology and Medicine 2 (2007), 2.

[2] M. Pagni, V. Ioannidis, L. Cerutti, M. Zahn-Zabal, C.V. Jongeneel, L. Falquet. MyHits: a new interactive resource for protein annotation and domain identification. Nucleic Acids Res. 32 (2004), W332-335 (Web Server issue).

[3] M. Pagni, C. Iseli, T. Junier, L. Falquet, V. Jongeneel, P. Bucher. TrEST, trGEN and Hits: Access to databases of predicted protein sequences. Nucleic Acids Research 29 (2001), 148-151.

[4] T. Junier, M. Pagni, P. Bucher. mmsearch: a motif arrangement language and search program. Bioinformatics 17 (2001), 1234-1235.

[5] M. Pagni, J. Hau, H. Stockinger. A Multi-protocol Bioinformatics Web Service: Use SOAP, Take a REST or Go with HTML. Eighth IEEE International Symposium on Cluster Computing and the Grid (ccgrid) (2008), 728-734.

[6] M. Pagni, V. Ioannidis, L. Cerutti, M. Zahn-Zabal, C. V. Jongeneel, J. Hau, O. Martin, D. Kuznetsov, L. Falquet. MyHits: improvements to an interactive resource for analyzing protein sequences. Nucleic Acids Res. 35 (2007), W433-W437.