|
Text Classification API
The version 1.4 of our Text Classification API is now available under the Apache Licence and can be used freely.
A (slightly dated) overview can be found here: (pdf). The best reference at this stage is the javadoc included
in the archive or the test classes in the source code.
The project is now hosted on Google Code, any questions or contributions can be made there.
GATE Toolbox
DigitalPebble's GATE Toolbox
is a collection of Processing Resources for GATE. It contains the following
components:
SentenceSplitter
based on JavaCC. It differs from the default GATE component by:
- speed: 4-5 times faster
and more robust
- autonomous: independent from
Tokenizer
- extra parameters: allow to choose between
splitting on single or multiple '\n' + find sentences inside an
existing annotation (e.g. Paragraph)
- coverage: better recognition of
acronyms and abbreviations
Regular
Expression Annotator
- takes as input a tab separated file containing a pattern \t annotation type
- easier to use than JAPE
for extracting simple entities
Language
Identifier
- identifies 16 languages
(da-de-ee-el-en-es-fi-fr-hu-is-it-nl-no-pl-pt-ru-sv-th)
- can be applied to a whole document or any text covered by an
annotation (e.g.
Sentence)
Version: 1.1
Download: Toolbox.tar.gz
License: LGPL
RASP4UIMA
DigitalPebble has ported the RASP
system to Apache UIMA.
RASP is a domain-independent, robust parsing system for English.
For ease of installation, the system is distributed in the form of
binaries for 3 widespread unix architectures (Intel-32bit and
-64bit/Linux, and Sparc/Solaris), with source code for most of the
modules. It is free for research purposes. RASP was originally
developed on a UK
EPSRC-funded project. Since the end of that project it has
continued to be extended and enhanced on an on-going basis. An
informal
description of the RASP system is online, with examples of system
output. The first public release of the system was in January 2002;
the second release (RASPv2) is now available. To obtain it, go to the
RASP licence and download page.
RASP4UIMA wraps the NLP
modules of RASP (Sentence Parser, Tokenizer, Part of Speech Tagger,
Morphological Analyser and Dependency Parser) as UIMA Analysis Engines.
Version: 1.1 beta
Download: rasp4uima1.1.pear
Documentation: (html)
RASP2 plugin for GATE
DigitalPebble has ported the RASP
system to GATE. The RASP plugin wraps the NLP modules
of RASP (Tokenizer, Part of Speech Tagger, Morphological Analyser
and Dependency Parser) as individual GATE Processing Resources,
which allows them to be easily replaced or combined with existing
GATE PRs.
This component is part of the standard distribution of GATE
and is documented here.
Java API for Web-1T Corpus
The Web
1T 5-gram corpus contains n-grams from unigrams through to
5-grams compiled from counts on a one trillion word corpus. It is
distributed by the Linguistic Data Consortium for researchers.
We have developed a Java API which allows to query the Web 1t corpus
(or any corpus at a similar format). Unlike Get1T, our API
allows allows on-the-fly queries of the full set of Web 1T
n-grams - even on a machine with modest hardware. The API
also helps creating n-gram corpora from other sources (Lucene
indices, BNC corpus).
Contact us for more details and the
terms of use.
|