PK
OOA7`#x
x
desc/BNCCPE.xml
InputDirectory
/data/BNCXML/corpus/A_All
OutputDirectory
/home/pebble/data/BNCXML/corpus/A-done
-1
immediate
PK
G|7m= = desc/BNCTypeSystemDescriptor.xml
BNCTypeSystemDescriptor
1.0
com.digitalpebble.bncxml.SourceDocumentInformation
uima.tcas.Annotation
uri
uima.cas.String
PK
G|7*[; ) desc/BNCXMLCollectionReaderDescriptor.xml
org.apache.uima.java
com.digitalpebble.bncxml.BNCXmlCollectionReader
BNCXMLCollectionReaderDescriptor
Reads BNC XML files from a directory
1.0
InputDirectory
String
false
true
com.digitalpebble.rasp.Token
com.digitalpebble.rasp.Sentence
com.digitalpebble.bncxml.SourceDocumentInformation
true
false
true
PK
G|7* ! desc/BNCXMLConsumerDescriptor.xml
org.apache.uima.java
com.digitalpebble.bncxml.BNCXMLConsumer
BNCXMLConsumerDescriptor
Generates a XML representation of BNC documents. Needs to have Sentences and Tokens in the default View and the original XML of the file in a view called 'xml'
1.0
DigitalPebble
OutputDirectory
String
false
true
OutputDirectory
temp-uima-output/bncxml
com.digitalpebble.rasp.WordForm
com.digitalpebble.rasp.Token
com.digitalpebble.rasp.Sentence
com.digitalpebble.rasp.Dependency
com.digitalpebble.bncxml.SourceDocumentInformation
xml
false
false
false
PK
G|7}t7 7 desc/CombinedRASPDescriptor.xml
org.apache.uima.java
false
CombinedRASPDescriptor
1.0
SentenceSplitter
Tokenizer
PosTagger
Morpher
Parser
true
true
false
PK
G|78! ! desc/Morpher.xml
org.apache.uima.java
true
com.digitalpebble.rasp.morph.MorphoAnnotator
Morpher
1.0
com.digitalpebble.rasp.Token
com.digitalpebble.rasp.Sentence
true
true
false
PK
G|7,W desc/Parser.xml
org.apache.uima.java
true
com.digitalpebble.rasp.parser.ParserAnnotator
Parser
1.0
output
specifies the type and content of annotations generated by the Parser
-oa : trees labelled with grammar aliases
-ot : trees labelled with grammar rule names;
-otg : rule-labelled trees and grammatical relations;
-og : grammatical relations
-ogio : grammatical relations weighted using a variant of the inside-outside algorithm
-ogw : weighted grammatical relations computed from the top n trees (only useful if the number of parses option -n is set to >1)
-otgio
String
false
true
parseNum
Give the maximum number of parses that should be produced for each sentence. The default is 1; a value of zero indicates all parses.
Integer
false
false
time
Set a CPU time limit (in seconds) for the processing of each sentence (default 20).
Integer
false
false
subcategorisation
Turn on the use of verb subcategorisation frame probabilities; there is built-in information for around 500 verbs.
Boolean
false
false
phrasalVerbs
use of a list of phrasal verbs that normally allows more accurate identification of verb-particle constructions.
Boolean
false
false
parseNum
1
time
20
subcategorisation
false
phrasalVerbs
true
output
-otg
com.digitalpebble.rasp.WordForm
com.digitalpebble.rasp.Token
com.digitalpebble.rasp.Sentence
com.digitalpebble.rasp.Dependency
true
true
false
PK
t7Z[ desc/PosTagger.xml
org.apache.uima.java
true
com.digitalpebble.rasp.tagger.PosTagger
PosTagger
1.0
parametersString
Parameters for the POS executable. See RASP documentation for more details. Only the parameter specifying the format (i.e multiple tags) is set implicitly
String
false
true
generateMultipleTags
If true, a Token will get one or more WordForm with a probability otherwise there will be only one WordForm per Token with a probability set to 1.0
Boolean
false
true
parametersString
B1 b C1 N t auxiliary_files/slb.trn d auxiliary_files/seclarge.lex j auxiliary_files/unkstats-seclarge m auxiliary_files/tags.map
generateMultipleTags
true
com.digitalpebble.rasp.Sentence
com.digitalpebble.rasp.Token
com.digitalpebble.rasp.Token
com.digitalpebble.rasp.WordForm
en
true
true
false
PK
LE,7# # desc/RASPModulesCPE.xml
InputDirectory
/usr/local/bin/apache-uima/examples/data/xml
Language
en
TokenizerPath
/usr/local/bin/RASP/token/token.ix86_linux
-1
immediate
PK
G~;7nI desc/RASPTypes.xml
RASPTypes
1.1
com.digitalpebble.rasp.Token
A token for Rasp
uima.tcas.Annotation
wordForms
A Token is related to one or more WordForm
uima.cas.FSArray
com.digitalpebble.rasp.WordForm
com.digitalpebble.rasp.Sentence
Annotation for a Sentence
uima.tcas.Annotation
com.digitalpebble.rasp.WordForm
A WordForm consists of a POS tag, a lemma and possibly a probability. There is one or more WordForm per Token (as in the MAF ISO Norm)
uima.tcas.Annotation
lemma
lemma of the Form
uima.cas.String
POS
POS tag for a given form
uima.cas.String
probability
uima.cas.Double
suffix
uima.cas.String
com.digitalpebble.rasp.Dependency
A dependency between two word forms
uima.tcas.Annotation
deptype
uima.cas.String
subtype
uima.cas.String
head
com.digitalpebble.rasp.WordForm
dep
com.digitalpebble.rasp.WordForm
com.digitalpebble.rasp.Clause
A clause as returned by the RASP analyser. It can contain one or more word forms or clauses
uima.tcas.Annotation
rule
uima.cas.String
subclauses
array of subelements. contains WordForms or Clauses
uima.cas.FSArray
uima.tcas.Annotation
PK
G|7k> desc/SentenceSplitter.xml
org.apache.uima.java
true
com.digitalpebble.rasp.splitter.SentenceSplitter
SentenceSplitter
Simple sentence splitter which calls an external command
1.0
com.digitalpebble.rasp.Sentence
true
true
false
PK
G|7+
> > desc/Tokenizer.xml
org.apache.uima.java
true
com.digitalpebble.rasp.tokenizer.NativeTokenAnnotator
Tokenizer
1.0
com.digitalpebble.rasp.Sentence
com.digitalpebble.rasp.Token
true
true
false
PK
G|7D" D" doc/BNC.html
RASP4UIMA

RASP4UIMA 1.0 Beta
Previous: Modules
Resources for the BNC XML Edition
RASP4UIMA contains two additional resources for the XML Edition of the British National Corpus. The British National Corpus (BNC) is a 100 million word collection of samples of written
and spoken language from a wide range of sources, designed to represent a wide
cross-section of British English from the later part of the 20th century, both spoken and
written. The latest edition is the BNC XML Edition, released in 2007.
These resources are provided as-is and are meant essentially to
illsutrate the use of RASP4UIMA. The corresponding XML descriptors can
be found in the /desc directory of RASP4UIMA.
BNC XML Collection Reader
This resource takes as input a directory containing XML documents
at the BNC DTD and convert them into UIMA CASes. The original XML
is kept in a separate SOFA while the <s> and <w> elements
are converted into annotations of type com.digitalpebble.rasp.Sentence and com.digitalpebble.rasp.Token.
The POS tagger, Morpher and Parser of RASP4UIMA can then be used on these documents.
BNC XML Consumer
This resource takes a CAS generated by the BNC XML Collection Reader and annotated by RASP4UIMA.
It regenerates the XML content of the original document but
replaces the original <s> and <w> elements with the
corresponding information generated by the RASP modules. In
short, the POS tags and lemmas are replaced with the RASP equivalents.
In addition the grammatical relations found by RASP are added to the XML.
The example below shows a sentence from a BNC document as generated by the BNC XML Consumer.
<s n="5">
<w id="1" pos="ICS">After</w>
<w id="2" pos="AT1">a</w>
<w id="3" pos="VVN">varied</w>
<w id="4" pos="NN1">career</w>
<w id="5" pos="II">in</w>
<w id="6" pos="NN1">teaching</w>
<w id="7" pos=",">,</w>
<w id="8" pos="PPHS1">he</w>
<w id="9" pos="VVD">became</w>
<w id="10" pos="VVN">involved</w>
<w id="11" pos="IW">with</w>
<w id="12" pos="JJR">older</w>
<w id="13" pos="NN">people</w>
<w id="14" pos="CS">while</w>
<w id="15" pos="VVG">taking</w>
<w id="16" pos="AT1">a</w>
<w id="17" pos="JB">postgraduate</w>
<w id="18" pos="NN1">degree</w>
<w id="19" pos="CC">and</w>
<w id="20" pos="NN1">training</w>
<w id="21" pos="CSA">as</w>
<w id="22" pos="AT1">a</w>
<w id="23" pos="JJ">social</w>
<w id="24" pos="NN1">worker</w>
<w id="25" pos=".">.</w>
<grlist>
<gr type='ncmod' subtype='_' head='9' modifier='1' />
<gr type='dobj' subtype='' head='1' modifier='4' />
<gr type='det' subtype='' head='4' modifier='2' />
<gr type='ncsubj' subtype='' head='3' modifier='4' />
<gr type='ncmod' subtype='_' head='4' modifier='3' />
<gr type='passive' subtype='' head='3' />
<gr type='ncmod' subtype='_' head='4' modifier='5' />
<gr type='dobj' subtype='' head='5' modifier='6' />
<gr type='ncsubj' subtype='' head='9' modifier='8' />
<gr type='xcomp' subtype='_' head='9' modifier='10' />
<gr type='xcomp' subtype='_' head='10' modifier='14' />
<gr type='iobj' subtype='' head='10' modifier='11' />
<gr type='passive' subtype='' head='10' />
<gr type='dobj' subtype='' head='11' modifier='13' />
<gr type='ncmod' subtype='_' head='13' modifier='12' />
<gr type='xcomp' subtype='_' head='14' modifier='15' />
<gr type='dobj' subtype='' head='15' modifier='19' />
<gr type='det' subtype='' head='19' modifier='16' />
<gr type='ncmod' subtype='_' head='19' modifier='17' />
<gr type='conj' subtype='' head='19' modifier='18' />
<gr type='conj' subtype='' head='19' modifier='20' />
<gr type='ncmod' subtype='_' head='20' modifier='21' />
<gr type='dobj' subtype='' head='21' modifier='24' />
<gr type='det' subtype='' head='24' modifier='22' />
<gr type='ncmod' subtype='_' head='24' modifier='23' />
</grlist>
</s>
PK
G|7!M< < doc/index.html
RASP4UIMA

RASP4UIMA 1.0 beta
Overview
RASP4UIMA is
an integration of the RASP
System into the Apache UIMA
framework.
RASP is a
domain-independent, robust parsing system for English. For
ease of installation, the system is distributed in the form of binaries
for 3 widespread unix architectures (Intel-32bit and -64bit/Linux, and
Sparc/Solaris). It is free for
research purposes. RASP is described in :
UIMA is an
Apache project in incubation which provides a component
framework for analysing unstructured content such as text, audio and
video. It comprises an SDK and tooling for composing and running
analytic components written in Java and C++.
Please contact the respective projects for any question related to RASP or UIMA. You can use the DigitalPebble user group for any question specific to RASP4UIMA.
Installation
This version of RASP4UIMA has been tested on Apache UIMA 2.1.0. It is
available as a PEAR package and can be installed with the PEAR installer. You
will also need to download and install RASP2 from the RASP
project page.
Run the PearInstaller
(e.g. /usr/local/bin/apache-uima/bin/runPearInstaller.sh). Select
the RASP4UIMA pear file and a target directory for the installation. In
this manual we assume RASP4UIMA has been
installed in /usr/local/bin/RASP4UIMA.

Note: RASP4UIMA relies on a system environment (rasp.home) to determine where the original RASP executables are located. See $RASP4UIMA/metadata/setenv.txt for more details.
Make sure you specify the location of RASP with -Drasp.home when
you call the UIMA executables. For instance, if you want to run your
component in the Collection Processing Engine Configurator GUI
application, you need to add the environment variables settings from
the component's setenv.txt file to the cpeGui.bat (cpeGui.sh) script
file in the <UIMA_HOME>/bin directory.
Test
Once RASP4UIMA has been installed with the PEAR installer, you
can test the installation with the Collection Processing Engine (CPE).
Please refer to the UIMA documentation for more details on the use of
this tool. Don't forget to add the variables from setenv.txt to the script (see above).
For this test we'll use the Collection Reader
and Xmi Writer
CAS Consumer
available in the UIMA examples. These two components are
respectively in charge of converting a collection of documents into
CASes and serialize the CASes into XML files. Their descriptors can be found in the /examples directory of UIMA.
Click on the button Add
of the section Analysis
Engines and select the file SentenceSplitter.xml located
in the /desc
directory of the the RASP4UIMA installation (e.g. /usr/local/bin/RASP4UIMA/desc).
Repeat the procedure for the files Tokenizer.xml, POStagger.xml, Morpher.xml and Parser.xml.
You should get something similar to the screenshot below:

Click on the Play button, after a while you should get a summary of the
process. You can use the AnnotationViewer
of
UIMA which takes as input a directory containing XML files at the xmi
format and a TypeSystem file. The TypeSystem description for
RASP4UIMA is in the file /desc/RASPTypes.xml.
Once you've specified both the input directory and the TypeSystem,
click on View and
double click on one of the documents of the list. You should get
something similar to the screenshot below. More details about the
Annotation Types generated by RASP4UIMA can be found in the Modules
section.

PK
G|7ă
doc/logo2.gifGIF89a6<