
RASP4UIMA 1.0 Beta
Previous: Overview, Installation and Test Next: BNC
RASP Analysis Engines
RASP4UIMA provides 5 types of Analysis Engines, each one of them requires the annotation types from the previous ones.
http://www.informatics.sussex.ac.uk/research/groups/nlp/rasp/offline-demo.html contains a description of the different modules in the original RASP.
SentenceSplitter
Takes as input the text of a document and creates annotations of Type Sentence. The sentence splitter of RASP identifies acronyms and common abbreviations (e.g. Dr.)
Tokenizer
Creates annotations of type Token, using the information about Sentences. A Token is a simple annotation which contains only a list of WordForms. This separation between Tokens and WForms is based on the MAF ISO proposal
POS Tagger
The WordForms are created by the Part of Speech tagger. The tagset is close to CLAWS C7 (see e.g. Appendix C of Jurafsky,
D. and Martin, J. Speech and Language Processing, Prentice-Hall, 2000
for more details), although it is in fact a cut down version of the
CLAWS C2 tagset.
A WordForm gets a POS attribute, which is a simple String and a probability.
The POS Tagger takes two
parameters: a String which corresponds to the parameters used in
the original RASP and a boolean parameter indicating whether or not to
generate different WordForms for a Token.
Morpher
Next the tagger output is lemmatized, based on the tags
assigned to word tokens. See Briscoe and Carroll (2002) for further details
and a reference to a detailed paper describing this module. In RASP4UIMA the Morpher adds an attribute lemma to the WordForms.
Parser
The probabilistic parser analyses the PoS tag sequence or
chart of initial more probable tags and generates a parse forest
representation containing all possible subanalyses with associated
probabilities. From this representation it is able to construct
the n-best syntactic trees and / or (weighted) grammatical relations.
The parser generates annotations of the type Clause and Dependency. A Clause has an attribute rule and contains an array of subclauses. A subclause can be a WordForm or another Clause element. Dependencies have a type and subtype and link two WordForms as head and dependency.
The main parameter of the Parser is the string specifying its output.
Please refer to the documentation of RASP for more details about the
values allowed for this parameter and the other options available.
Note that the Parser requires a recent machine with minimum 1.5G of RAM.
RASP4UIMA Type System
The illustration below summarizes the Types used in RASP4UIMA.
