|
Text Classification API
The version 1.4 of our Text Classification API is now available under the Apache Licence and can be used freely.
A (slightly dated) overview can be found here: (pdf). The best reference at this stage is the javadoc included
in the archive or the test classes in the source code.
The project is now hosted on GitHub, any questions or contributions can be made there.
GATE Toolbox
DigitalPebble's GATE Toolbox
is a collection of Processing Resources for GATE. It contains the following
components:
SentenceSplitter
based on JavaCC. It differs from the default GATE component by:
- speed: 4-5 times faster
and more robust
- autonomous: independent from
Tokenizer
- extra parameters: allow to choose between
splitting on single or multiple '\n' + find sentences inside an
existing annotation (e.g. Paragraph)
- coverage: better recognition of
acronyms and abbreviations
Regular
Expression Annotator
- takes as input a tab separated file containing a pattern \t annotation type
- easier to use than JAPE
for extracting simple entities
Language
Identifier
- identifies 16 languages
(da-de-ee-el-en-es-fi-fr-hu-is-it-nl-no-pl-pt-ru-sv-th)
- can be applied to a whole document or any text covered by an
annotation (e.g.
Sentence)
Version: 1.1
Download: Toolbox.tar.gz
License: LGPL
RASP4UIMA
DigitalPebble has ported the RASP
system to Apache UIMA.
RASP is a domain-independent, robust parsing system for English.
For ease of installation, the system is distributed in the form of
binaries for 3 widespread unix architectures (Intel-32bit and
-64bit/Linux, and Sparc/Solaris), with source code for most of the
modules. It is free for research purposes. RASP was originally
developed on a UK
EPSRC-funded project. Since the end of that project it has
continued to be extended and enhanced on an on-going basis. An
informal
description of the RASP system is online, with examples of system
output. The first public release of the system was in January 2002;
the second release (RASPv2) is now available. To obtain it, go to the
RASP licence and download page.
RASP4UIMA wraps the NLP
modules of RASP (Sentence Parser, Tokenizer, Part of Speech Tagger,
Morphological Analyser and Dependency Parser) as UIMA Analysis Engines.
Version: 1.1 beta
Download: rasp4uima1.1.pear
Documentation: (html)
RASP2 plugin for GATE
DigitalPebble has ported the RASP
system to GATE. The RASP plugin wraps the NLP modules
of RASP (Tokenizer, Part of Speech Tagger, Morphological Analyser
and Dependency Parser) as individual GATE Processing Resources,
which allows them to be easily replaced or combined with existing
GATE PRs.
This component is part of the standard distribution of GATE
and is documented here.
|