digitalpebble
 
         
 
Digital Pebble is a consulting company specialised in linguistic engineering, document management, information retrieval and extraction. Our expertise is based on open source solutions, such as Lucene or Gate  

Text Classification is the automatic assignation of a label to a span of text, be it an entire document, a paragraph or a sentence.

This is usually achieved by using Machine Learning techniques, which require a collection of labelled documents and no human intervention for coding rules or heuristics. Once a Machine Learning algorithm has generated a model of the training data, this model can be used to classify new un-labelled documents automatically.

Text Classification can be used for filtering a collection of documents. Imagine the case of an Information Retrieval system based for instance on Lucene, Solr or Nutch. A filtering functionality would allow automatic detection of documents you do not want to present to your users, for example because their content is not suitable. Filtering the data also improves the quality of the result by removing documents from the index which might have a high ranking but be irrelevant, such as junk pages.

DigitalPebble has developed an efficient solution for Text Classification which is based on OpenSource Machine Learning implementations. This tool is available also as a GATE plugin, which enables its combination with Natural Language Processing modules (stemmer, part of speech tagger, etc...).

Our Text Classification API can be used in other frameworks such as UIMA or embeded directly in any existing Java application.

An overview of the tool can be found here:  (pdf)

The API is an Open Source component under Apache License and is available from our Resources page.