![]() |
||||||||||
|
||||||||||
![]() |
Text Classification is the
automatic assignation of a label to a span of text, be it an entire
document, a paragraph or a sentence. This is usually
achieved by using Machine Learning
techniques, which require a collection of labelled documents and no
human intervention for coding rules or heuristics. Once a Machine
Learning algorithm has generated a model of the training data, this
model can be used to classify new un-labelled documents automatically. Text Classification
can be used for filtering a collection of documents. Imagine the case
of an Information Retrieval
system based for instance on Lucene, Solr or Nutch. A filtering
functionality would allow automatic detection of documents you do not
want to present to your users, for example because their content is not
suitable. Filtering the data also improves the quality of the result by
removing documents from the index which might have a high ranking but
be irrelevant, such as junk pages. DigitalPebble has developed an efficient solution for Text Classification which is based on OpenSource Machine Learning implementations. This tool is available also as a GATE plugin, which enables its combination with Natural Language Processing modules (stemmer, part of speech tagger, etc...). Our Text Classification API can be used in other frameworks such as UIMA or embeded directly in any existing Java application. An overview of the tool can be found here: (pdf) The API is an Open Source component under Apache License and is available from our Resources page. |
|