digitalpebble
 
         
 
Digital Pebble is a consulting company specialised in linguistic engineering, document management, information retrieval and extraction. Our expertise is based on open source solutions, such as Lucene or Gate  

Text Classification API

The version 1.4 of our Text Classification API is now available under the Apache Licence and can be used freely.
A (slightly dated) overview can be found here:  (pdf). The best reference at this stage is the javadoc included in the archive or the test classes in the source code.
The project is now hosted on GitHub, any questions or contributions can be made there.


GATE Toolbox
DigitalPebble's GATE Toolbox is a collection of Processing Resources for GATE. It contains the following components:

SentenceSplitter based on JavaCC. It differs from the default GATE component by:

  • speed: 4-5 times faster and more robust
  • autonomous: independent from Tokenizer
  • extra parameters: allow to choose between splitting on single or multiple '\n' + find sentences inside an existing annotation (e.g. Paragraph)
  • coverage: better recognition of acronyms and abbreviations
Regular Expression Annotator
  • takes as input a tab separated file containing a pattern \t annotation type
  • easier to use than JAPE for extracting simple entities 
Language Identifier
  • identifies 16 languages (da-de-ee-el-en-es-fi-fr-hu-is-it-nl-no-pl-pt-ru-sv-th)
  • can be applied to a whole document or any text covered by an annotation (e.g. Sentence)
Version: 1.1
Download: Toolbox.tar.gz
License: LGPL


RASP4UIMA

DigitalPebble has ported the RASP system to Apache UIMA.

RASP is a domain-independent, robust parsing system for English. For ease of installation, the system is distributed in the form of binaries for 3 widespread unix architectures (Intel-32bit and -64bit/Linux, and Sparc/Solaris), with source code for most of the modules. It is free for research purposes. RASP was originally developed on a UK EPSRC-funded project. Since the end of that project it has continued to be extended and enhanced on an on-going basis. An informal description of the RASP system is online, with examples of system output. The first public release of the system was in January 2002; the second release (RASPv2) is now available. To obtain it, go to the RASP licence and download page.

RASP4UIMA wraps the NLP modules of RASP (Sentence Parser, Tokenizer, Part of Speech Tagger, Morphological Analyser and Dependency Parser) as UIMA Analysis Engines.

Version: 1.1 beta
Download: rasp4uima1.1.pear
Documentation: (html)


RASP2 plugin for GATE

DigitalPebble has ported the RASP system to GATE. The RASP plugin wraps the NLP modules of RASP (Tokenizer, Part of Speech Tagger, Morphological Analyser and Dependency Parser) as individual GATE Processing Resources, which allows them to be easily replaced or combined with existing GATE PRs.
This component is part of the standard distribution of GATE and is documented here.