Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Lucene is now considered one of the most successful open source tools and the absolute reference for search engine systems.
Solr is a high performance search server built using Lucene Java, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface. We recently published a review of the book Solr 1.4 Enterprise Search Server from Pakt Publishing.
Nutch is an open source web-search software. It builds on Apache Hadoop and SOLR, adding web-specifics such as a crawler, a link-graph database, parsers for HTML and other document formats. We have contributed a lot to Nutch in recent years and Julien has recently been elected Nutch project leader. We provide competitive monitoring and hosting solutions for Nutch, please contact us if you are interested.
GATE is one of the most widely used human language processing systems in the world. It is developed and maintained at the University of Sheffield in the Natural Language Processing group. It provides ways to get structured information from unstructured textual data and is the perfect complement to IR tools such as Lucene.
UIMA is an Apache project inherited from IBM, which is comparable to GATE. It is geared towards multimodal analysis, scalabilty and interoperability.
Over the years we have actively contributed to some of the projects above and combined them to build bespoke solutions on numerous occasions. We have a strong focus on very large scale processing and have developed solutions based on Hadoop and deployed them on Amazon EC2.
Our open source project Behemoth allows to facilitate the deployment of GATE or UIMA-based applications over a Hadoop cluster. Why not giving it a try?