RASP4UIMA 1.0 Beta

Previous: Modules    

Resources for the BNC XML Edition

RASP4UIMA contains two additional resources for the XML Edition of the British National Corpus. The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.

These resources are provided as-is and are meant essentially to illsutrate the use of RASP4UIMA. The corresponding XML descriptors can be found in the /desc directory of RASP4UIMA.

BNC XML Collection Reader

This resource takes as input a directory containing XML documents at the BNC DTD and convert them into UIMA CASes. The original XML is kept in a separate SOFA while the <s> and <w> elements are converted into annotations of type com.digitalpebble.rasp.Sentence and com.digitalpebble.rasp.Token.

The POS tagger, Morpher and Parser of RASP4UIMA can then be used on these documents.

BNC XML Consumer

This resource takes a CAS generated by the BNC XML Collection Reader and annotated by RASP4UIMA.  It regenerates the XML content of the original document but replaces the original <s> and <w> elements with the corresponding information generated by the RASP modules.  In short, the POS tags and lemmas are replaced with the RASP equivalents. In addition the grammatical relations found by RASP are added to the XML.

The example below shows a sentence from a BNC document as generated by the BNC XML Consumer.

<s n="5">
<w id="1" pos="ICS">After</w>
<w id="2" pos="AT1">a</w>
<w id="3" pos="VVN">varied</w>
<w id="4" pos="NN1">career</w>
<w id="5" pos="II">in</w>
<w id="6" pos="NN1">teaching</w>
<w id="7" pos=",">,</w>
<w id="8" pos="PPHS1">he</w>
<w id="9" pos="VVD">became</w>
<w id="10" pos="VVN">involved</w>
<w id="11" pos="IW">with</w>
<w id="12" pos="JJR">older</w>
<w id="13" pos="NN">people</w>
<w id="14" pos="CS">while</w>
<w id="15" pos="VVG">taking</w>
<w id="16" pos="AT1">a</w>
<w id="17" pos="JB">postgraduate</w>
<w id="18" pos="NN1">degree</w>
<w id="19" pos="CC">and</w>
<w id="20" pos="NN1">training</w>
<w id="21" pos="CSA">as</w>
<w id="22" pos="AT1">a</w>
<w id="23" pos="JJ">social</w>
<w id="24" pos="NN1">worker</w>
<w id="25" pos=".">.</w>

<grlist>
<gr type='ncmod' subtype='_' head='9' modifier='1' />
<gr type='dobj' subtype='' head='1' modifier='4' />
<gr type='det' subtype='' head='4' modifier='2' />
<gr type='ncsubj' subtype='' head='3' modifier='4' />
<gr type='ncmod' subtype='_' head='4' modifier='3' />
<gr type='passive' subtype='' head='3' />
<gr type='ncmod' subtype='_' head='4' modifier='5' />
<gr type='dobj' subtype='' head='5' modifier='6' />
<gr type='ncsubj' subtype='' head='9' modifier='8' />
<gr type='xcomp' subtype='_' head='9' modifier='10' />
<gr type='xcomp' subtype='_' head='10' modifier='14' />
<gr type='iobj' subtype='' head='10' modifier='11' />
<gr type='passive' subtype='' head='10' />
<gr type='dobj' subtype='' head='11' modifier='13' />
<gr type='ncmod' subtype='_' head='13' modifier='12' />
<gr type='xcomp' subtype='_' head='14' modifier='15' />
<gr type='dobj' subtype='' head='15' modifier='19' />
<gr type='det' subtype='' head='19' modifier='16' />
<gr type='ncmod' subtype='_' head='19' modifier='17' />
<gr type='conj' subtype='' head='19' modifier='18' />
<gr type='conj' subtype='' head='19' modifier='20' />
<gr type='ncmod' subtype='_' head='20' modifier='21' />
<gr type='dobj' subtype='' head='21' modifier='24' />
<gr type='det' subtype='' head='24' modifier='22' />
<gr type='ncmod' subtype='_' head='24' modifier='23' />
</grlist>
</s>