RASP4UIMA 1.0 Beta

Previous: Overview, Installation and Test   Next: BNC    

RASP Analysis Engines

RASP4UIMA provides 5 types of Analysis Engines, each one of them requires the annotation types from the previous ones.
http://www.informatics.sussex.ac.uk/research/groups/nlp/rasp/offline-demo.html contains a description of the different modules in the original RASP.

SentenceSplitter

Takes as input the text of a document and creates annotations of Type Sentence. The sentence splitter of RASP identifies acronyms and common abbreviations (e.g. Dr.)

Tokenizer

Creates annotations of type Token, using the information about Sentences. A Token is a simple annotation which contains only a list of WordForms. This separation between Tokens and WForms is based on the MAF ISO proposal

POS Tagger

The WordForms are created by the Part of Speech tagger. The tagset is close to CLAWS C7 (see e.g. Appendix C of Jurafsky, D. and Martin, J. Speech and Language Processing, Prentice-Hall, 2000 for more details), although it is in fact a cut down version of the CLAWS C2 tagset.

A WordForm gets a POS attribute, which is a simple String and a probability.

The POS Tagger takes two parameters:  a String which corresponds to the parameters used in the original RASP and a boolean parameter indicating whether or not to generate different WordForms for a Token.

Morpher

Next the tagger output is lemmatized, based on the tags assigned to word tokens. See Briscoe and Carroll (2002) for further details and a reference to a detailed paper describing this module. In RASP4UIMA the Morpher adds an attribute lemma to the WordForms.

Parser

The probabilistic parser analyses the PoS tag sequence or chart of initial more probable tags and generates a parse forest representation containing all possible subanalyses with associated probabilities. From this representation it is able to construct the n-best syntactic trees and / or (weighted) grammatical relations.

The parser generates annotations of the type Clause and Dependency. A Clause has an attribute rule and contains an array of subclauses. A subclause can be a WordForm or another Clause element. Dependencies have a type and subtype and link two WordForms as head and dependency.

The main parameter of the Parser is the string specifying its output. Please refer to the documentation of RASP for more details about the values allowed for this parameter and the other options available.

Note that the Parser requires a recent machine with minimum 1.5G of RAM.

RASP4UIMA Type System

The illustration below summarizes the Types used in RASP4UIMA.