digitalpebble
 
         
 
 
RASP4UIMA

DigitalPebble has ported the RASP system to Apache UIMA.

RASP is a domain-independent, robust parsing system for English. For ease of installation, the system is distributed in the form of binaries for 3 widespread unix architectures (Intel-32bit and -64bit/Linux, and Sparc/Solaris), with source code for most of the modules. It is free for research purposes. RASP was originally developed on a UK EPSRC-funded project. Since the end of that project it has continued to be extended and enhanced on an on-going basis. An informal description of the RASP system is online, with examples of system output. The first public release of the system was in January 2002; the second release (RASPv2) is now available. To obtain it, go to the RASP licence and download page.

RASP4UIMA wraps the NLP modules of RASP (Sentence Parser, Tokenizer, Part of Speech Tagger, Morphological Analyser and Dependency Parser) as UIMA Analysis Engines.

Version: 1.1 beta
Download: rasp4uima1.1.pear
Documentation: (html)


 
GATE Toolbox
DigitalPebble's GATE Toolbox is a collection of Processing Resources for GATE. It contains the following components:

SentenceSplitter based on JavaCC. It differs from the default GATE component by:
  • speed: 4-5 times faster and more robust
  • autonomous: independent from Tokenizer
  • extra parameters: allow to choose between splitting on single or multiple '\n' + find sentences inside an existing annotation (e.g. Paragraph)
  • coverage: better recognition of acronyms and abbreviations
Regular Expression Annotator
  • takes as input a tab separated file containing a pattern \t annotation type
  • easier to use than JAPE for extracting simple entities 
Language Identifier
  • identifies 16 languages (da-de-ee-el-en-es-fi-fr-hu-is-it-nl-no-pl-pt-ru-sv-th)
  • can be applied to a whole document or any text covered by an annotation (e.g. Sentence)
Version: 1.1
Download: Toolbox.tar.gz
License: LGPL

 
RASP2 plugin for GATE

DigitalPebble has ported the RASP system to GATE. The RASP plugin wraps the NLP modules of RASP (Tokenizer, Part of Speech Tagger, Morphological Analyser and Dependency Parser) as individual GATE Processing Resources, which allows them to be easily replaced or combined with existing GATE PRs.

This component is part of the standard distribution of GATE.

 
Java API for Web-1T Corpus

The Web 1T 5-gram corpus contains n-grams from unigrams through to 5-grams compiled from counts on a one trillion word corpus. It is distributed by the Linguistic Data Consortium for researchers.

We have developed a Java API which allows to query the Web 1t corpus (or any corpus at a similar format). Unlike Get1T, our API allows allows on-the-fly queries of the full set of Web 1T n-grams - even on a machine with modest hardware. The API also helps creating n-gram corpora from other sources (Lucene indices, BNC corpus).

Contact us for more details and the terms of use.