Programs to Extract the Annotations from Raw Documents – Part 2

We have previously written – Programs to Extract the Annotations from Raw Documents – Part 1.

Here is the program to get the annotated value from XML File and put into the Database.

AnnotationImplementationPosTaggerDB.java

 a> Logical stop words for GATE annotation processing –

 static String stop_words[] = {“few days”,“Some people”,“toes”,“or run”,“There”,“weeks”,“the pain”,“Both the pain”,“It”,“the body”,“the tongue”,“The size”,“dark”

 b> Set the GATE Home

 Gate.setGateHome(new File(“E:/To Take/gate/gate-7.0-build4195-ALL/”));

 c> Initilisation of Gate

 Gate.init();

 Database connection opening related infrastructure code is not shown here, as the attached artifacts contains all the project files.

 d> Open file one by one

 File f = new File(E:/gate/symptom_gate/symptoms/+fileNames[i]);

 e> Open the Gate Document –

 Document doc = (Document) Factory.createResource(“gate.corpora.DocumentImpl”,

Utils.featureMap(gate.Document.DOCUMENT_URL_PARAMETER_NAME,f.toURL(),

gate.Document.DOCUMENT_MIME_TYPE_PARAMETER_NAME, “text/xml”));

 f> // get the annotation set

AnnotationSet annSet = doc.getAnnotations();

Set annotTypesRequired = newHashSet();

annotTypesRequired.add(“CandidateTermAdj”);

annotTypesRequired.add(“CandidateTermPreposition”);

annotTypesRequired.add(“CandidateTermWh”);

annotTypesRequired.add(“NounChunk”);

 g> Get Annotation Iterator

 Set<Annotation> peopleAndPlaces = new HashSet<Annotation>(annSet.get(annotTypesRequired));

 h> Get the Annotation Value

 Iterator it = peopleAndPlaces.iterator(); Annotation currAnnot;

String originalContent = ((DocumentContent)doc.getContent()).toString();

 g> Check the annotation value is stopword or not

 currAnnot = (Annotation) it.next();

long insertPositionStart = currAnnot.getStartNode().getOffset().longValue();

long insertPositionEnd = currAnnot.getEndNode().getOffset().longValue();

//System.out.println( originalContent.substring((int)insertPositionStart, (int)insertPositionEnd));

String theWord = originalContent.substring((int)insertPositionStart, (int)insertPositionEnd);

for(int j=0;j<stop_words.length;j++)

{

if(theWord.equalsIgnoreCase(stop_words[j]))

{

stopword = true;

break;

}

}

 h> Insert the total row as the symptom description reference in database –

 if(!stopword)

 String annotatedString = originalContent.substring((int)insertPositionStart, (int)insertPositionEnd);

out.write(originalContent.substring((int)insertPositionStart, (int)insertPositionEnd));

 Statement stmt1=null;

String insertannotatedString = StringEscapeUtils.escapeSql(annotatedString);

String insertsymptomDesc = StringEscapeUtils.escapeSql(symptomDesc);

String inserthyperlinkText = StringEscapeUtils.escapeSql(hyperlinkText);

String insertDiseaseText = StringEscapeUtils.escapeSql(hyperlinkText);

String insertDiseaseName = StringEscapeUtils.escapeSql(diseaseName);

 String insertSql = “Insert into search_symptom_gate(symptom_text,disease_code,disease_url,disease_name) values (‘”+ insertannotatedString +“‘,'”+insertsymptomDesc+“‘,'”+ inserthyperlinkText +“‘,'”+insertDiseaseName+“‘)”;

 Here are the attached Java Files.

Enter your email address:

Delivered by FeedBurner

Programs to Extract the Annotations from Raw Documents – Part 1

In our work, after writing the jape files and populating gazatteer files(*.lst) files,

we have used two program files to process all the annotations to db, from which we actually served in web application (Symptom Search).

 We will describe those two files line-by-line and attach other files in the project with broad description of the files in 2 part of Post.

Below is the First Program

 1>BatchProcessAppPosTagger.java

This program is used to process all the symptom description files and process the annotations.

a> Open the Application file –

gappFile = new File(“E:/gate/symptom_gate/PosTaggerApplication.gapp”);

b> Set Home for GATE

Gate.setGateHome(new File(“E:/To Take/gate/gate-7.0-build4195-ALL/”));

c> Initialisation of Gate

Gate.init();

d> load the saved application

CorpusController application = (CorpusController)PersistenceManager.loadObjectFromFile(gappFile);

e> Create a Corpus to use. We recycle the same Corpus object for each

iteration. The string parameter to newCorpus() is simply the

GATE-internal name to use for the corpus. It has no particular

significance.

Corpus corpus = Factory.newCorpus(“Symptom Corpus”);

application.setCorpus(corpus);

f> Open folder of the Symptom Description files and process the files one by one

File f = new File(“E:/gate/symptom_gate/symptoms”);

g> Open a document in gate recognised format –

Document doc = Factory.newDocument(docFile.toURL(), encoding);

h> put the document in the corpus

corpus.add(doc);

i> run the application

application.execute();

j> remove the document from the corpus again

corpus.clear();

k> if we want to just write out specific annotation types, we must

extract the annotations into a Set

if(annotTypesToWrite != null) {

l> Create a temporary Set to hold the annotations we wish to write out

Set annotationsToWrite = newHashSet();

m> only extract annotations from the default (unnamed) AnnotationSet

in this example

AnnotationSet defaultAnnots = doc.getAnnotations();

n> extract all the annotations of each requested type and add them to

the temporary set

AnnotationSet annotsOfThisType =

defaultAnnots.get((String)annotTypesIt.next());

if(annotsOfThisType != null) {

annotationsToWrite.addAll(annotsOfThisType);

o> create the XML string using these annotations

docXMLString = doc.toXml(annotationsToWrite);

p> Release the document, as it is no longer needed

Factory.deleteResource(doc);

q> Write the xml file with all the annotations –

// output the XML to <inputFile>.out.xml

String outputFileName = docFile.getName() + “.out.xml”;

File outputFile = new File(docFile.getParentFile(), outputFileName);

// Write output files using the same encoding as the original

FileOutputStream fos = new FileOutputStream(outputFile);

BufferedOutputStream bos = new BufferedOutputStream(fos);

OutputStreamWriter out;

if(encoding == null) {

out = new OutputStreamWriter(bos);

}

else {

out = new OutputStreamWriter(bos, encoding);

}

out.write(docXMLString);

Program formats are taken from GATE Site and then modified as per functional requirement.

Enter your email address:

Delivered by FeedBurner

Gate Ontology Update

Instuction for making Ontology in Gate Developer –

(1) Creole Plugins to be configured –

Tools
Ontology
Ontology_Based_Gazetteer
Ontology_Tools
ANNIE

Place the attached disease_symptom.owl in “<<Gate Installation>>\plugins\Ontology_Tools\resources”

Click on Language Resource and

add OWLIM Ontology –

Follow picture – OI-01.jpg

Double Click the DiseaseSymptom Ontology and see the OI-02.jpg

How to create classes there –

Follow the P-01, P-02 and P-03 jpg files and also the Gate Documentation.

Now Load the PRs in the following sequence

Document Reset PR
ANNIE English Tokeniser
ANNIE POS Tagger
GATE Morphological Analyser
ANNIE Sentence Splitter

As Processing Resource

Now Create Onto Root Gazetteer

from PR

Give value as

Ontology: select previously created DiseaseSymptom;
Tokeniser: select previously created Tokeniser;
POS Tagger: select previously created POS Tagger;
Morpher: select previously created Morpher.

Now create Flexible Gazetteer.

Select previously created OntoRoot Gazetteer for gazetteerInst.

For another parameter, inputFeatureNames, click on the button on the right and
when prompt with a window, add ’Token.root’ in the provided textbox, then click Add button.
Click OK, give name to the new PR (optional) and then click OK.

Create an application called symptom pipeline –

maintain the sequence with – OI-03.jpg

Run the application.

You can see the results in OI-04.jpg

Now If we create a ontogazetteer, then we have to map in mapping.def

sample mapping is done for one and attached is the “mapping.def”

replace this in <<gate installation folder>>\plugins\ANNIE\resources\gazetteer

Then we can create a OntoGazetteer.
Our steps will be

(1) Creating the Ontology In Gate Developer.

(2) Define the Mapping of the Class of the Ontology in Mapping.def wiith the annotation file list (i.e like diseasename.lst etc with the class)

Attached is one sample Ontology File is attached (human_disease.owl).

All the pictures and owl files are attached here and here.

Enter your email address:

Delivered by FeedBurner

An medical application made with GATE

An work which we have done with GATE –

Steps what we have taken –

1> Extracted all the symptom information from our disease database application and parsed and extracted clean text without html tags from there.

2> We have made a sample gazetteer from disease names.

3> We have made a sample gazetteer with some words as we thought to be useful from first ten documents.

4> First we have parsed through disease names as gazetteer rules.(First JAPE Rule)

5> We have taken symptoms gazetteer and parsed in jape rule.

6> Then we have put symptoms annotations and 2/3 words after that as jape rule (Second Jape Rule)

7> We have taken symptoms annotations and taken 2/3 words before it as jape rule (Third Jape Rule)

8> We have taken point no 6 contained sentences as jape rule(Fourth Jape Rule)

9> We have taken point no 7 contained sentences as jape rule(Fifth Jape Rule)

10> Now we have made the guidelines for the jape rules and refine the “symptom” gazetteer from sample 20-30 files and then should run the gate program to extract terms from over 300 doucments.

11> From these extracted annotations, we have 70-80% correct text terms. Which we can put in  lucene store with links to actual disease. Some of the terms, we have to delete manually, which did not carry much meaning. Then we linked these text terms with diseases.

We have used semantic annotations here with Tokenizer, Sentence Splitter, NE Transducer, Orthomatcher are all semantic annotation library to do our work.

Enter your email address:

Delivered by FeedBurner

Text Analysis with GATE – Part 7

JAPE: Regular Expressions over Annotations

JAPE is a Java Annotation Patterns Engine. JAPE provides finite state transduction over annotations based on regular expressions.
JAPE is a version of Common Pattern Specification Language.

JAPE allows to recognise regular expressions in annotations on documents. A regular language can only describe sets of strings, not graphs, and GATE’s model of annotations is based on graphs. Regular expressions are applied to character strings, a simple linear sequence of items, but here they are applied to much more complex data structure. The result is that in certain cases the matching process is non-deterministic (i.e. the results are dependent on random factors like the addresses at which data is stored in the virtual machine).

A JAPE grammar consists of a set of phases, each of which consists of a set of pattern/action rules. The phases run sequentially and constitute a cascade of finite state transducers over annotations. The left-hand-side (LHS) of the rules consist of an annotation pattern description. The right-hand-side (RHS) consists of annotation manipulation statements. Annotations matched on the LHS of a rule may be referred to on the RHS by means of labels that are attached to pattern elements.
An Example –

Phase: Jobtitle
Input: Lookup
Options: control = appelt debug = true
Rule: Jobtitle1
(
{Lookup.majorType == jobtitle}
(
{Lookup.majorType == jobtitle}
)?
)
:jobtitle
–>

:jobtitle.JobTitle = {rule = “JobTitle1”}
The LHS is the part preceding the ‘–>’ and the RHS is the part following it. The LHS specifies a pattern to be matched to the annotated GATE document, whereas the RHS specifies what is to be done to the matched text. In this example, we have a rule entitled ‘Jobtitle1’, which will match text annotated with a ‘Lookup’ annotation with a ‘majorType’ feature of ‘jobtitle’, followed optionally by further text annotated as a ‘Lookup’ with ‘majorType’ of ‘jobtitle’. Once this rule has matched a sequence of text, the entire sequence is allocated a label by the rule, and in this case, the label is ‘jobtitle’. On the RHS, we refer to this span of text using the label given in the LHS; ‘jobtitle’. We say that this text is to be given an annotation of type ‘JobTitle’ and a ‘rule’ feature set to ‘JobTitle1’.Phase: Jobtitle

 All these contents are collected from General Architechture of Text Enginnering Documentation User Guide.

We have only tried to extract information from the above document to understand the software perspective.

Enter your email address:

Delivered by FeedBurner

Creating and Running Application file in GATE

Running GATE on based of gazetteer –

Working logic to run the GATE Application was –

1>Take the lst files in gazetter

2>Map the lists.def file with updated gazetters.

3>Then put the PRs for processings in the Gate Application File.

4> Write Jape rules for taking the gazetter values and add some words to it.

5> Write the sentence of the Word Phrase with Gazetteer.

In GATE we have done the application creation with the following sequence maintained in GATE Developer-

1>Document Reset PR
2>ANNIE English Tokeniser
3>ANNIE Sentence Splitter
4>ANNIE POS Tagger
5>ANNIE NE Transducer
6>ANNIE Gazetteer
7>ANNIE OrthoMatcher
8>Jape Transducers(i.e jape programming files)

Want to get blog updates? Subscribe here
[newsletter_form]

Enter your email address:

Delivered by FeedBurner

Text Analysis with GATE – Part 6

Components of GATE

GATE Documents

Documents are modelled as content, annotations and features . The content of a document can be any form in GATE.
The features are <attribute, value> pairs stored a Feature Map. Attributes are String values while the values can
be any Java object. The annotations are grouped in sets . A document has a default annotations set and any number of named annotations sets.

Annotation Sets

A GATE document can have one or more annotation layers and as many named ones as necessary.

An annotation set holds a number of annotations and maintains a series of indices in order to provide fast access to the contained annotations. The GATE Annotation Sets are defined by the gate.AnnotationSet interface and there is a default implementation
provided – gate.annotation.AnnotationSetImpl annotation set.

Annotations

An annotation, is a form of meta-data attached to a particular section of document content. The connection between the annotation and the content it refers to is made by means of two pointers that represent the start and end locations of the covered content.

GATE Corpora

A corpus in GATE is a Java List (i.e. an implementation of java.util.List) of documents. GATE corpora are defined by the gate.Corpus interface and the following implementations are available – gate.corpora.CorpusImpl used for transient corpora.
gate.corpora.SerialCorpusImpl used for persistent corpora that are stored in a serial datastore (i.e. as a directory in a file system).

Processing Resources

Processing Resources (PRs) represent entities that are primarily algorithmic, such as parsers, generators or ngram modellers.
They are created using the GATE Factory in manner similar the Language Resources. Besides the creation-time parameters they also have a set of run-time parameters that are set by the system just before executing them. Analysers are a particular type of processing resources in the sense that they always have a document and a corpus among their run-time parameters.

Controllers

Controllers are used to create GATE applications. A Controller handles a set of Processing Resources and can execute them following a particular strategy. GATE provides a series of serial controllers.

 All these contents are collected from General Architechture of Text Enginnering Documentation User Guide.

We have only tried to extract information from the above document to understand the software perspective.

Enter your email address:

Delivered by FeedBurner

Text Analysis with GATE – Part 5

GATE Embedded

Integrating GATE-based language processing in applications using GATE Embedded (the GATE API) :

add $GATE_HOME/bin/gate.jar and the JAR files in $GATE_HOME/lib to the Java CLASSPATH ($GATE_HOME is the GATE root directory which is stored in Environment variables in OS)

To initialise GATE with gate.Gate.init();

We have worked with GATE in following areas (We will explain these in our later posts) –

Language Resources : (LRs) entities that hold unstructured raw data.

Processing Resources : (PRs) entities that process data.

Visualisation Resources : We have not used this in our application.

These resources are collectively named CREOLE resources.

All CREOLE resources have some associated meta-data in special XML file named creole.xml.
The most important role of that meta-data is to specify the set of parameters and default values for the processing resource.
The valid parameters for resource are described in the resource’s section of its creole.xml file or in Java annotations on the resource class.

All resource types have creation-time parameters for rsource initialisation phase. Processing Resources have run-time parameters that get used during execution.

Controllers are used to define GATE applications and have the role of controlling the processing of the documents in Corpora.

CREOLE resources are Java Beans. A resource will have a default constructor, setter parameters on the bean, and then calling an init() method. The Java class for Factory takes care of all this and also takes care of restoring data from DataStores.

 All these contents are collected from General Architechture of Text Enginnering Documentation User Guide.

We have only tried to extract information from the above document to understand the software perspective.

Enter your email address:

Delivered by FeedBurner

Text Analysis with GATE – Part 4

GATE comes with various built-in components:

  • Language Resources modelling Documents and Corpora, and various types of Annotation Schema.
  • Processing Resources that are part of the ANNIE system.
  • Gazetteers.
  • Ontologies.
  • Machine Learning resources.
  • Parsers and taggers.
  • Other miscellaneous resources.

ANNIE: a Nearly-New Information Extraction System

ANNIE components are

 1 Document Reset PR

 The document reset resource enables the document to be reset to its original state, by removing all the annotation sets and their contents, apart from the original markups.

 2 Tokeniser PR

 The tokeniser splits the text into very simple tokens such as numbers, punctuation and words of different types. For example, we distinguish between words in uppercase and lowercase, and between certain types of punctuation. 

For Tokeniser rules, Token Types, English Tokenisers – Please refer to 6.2.1, 6.2.2 and 6.2.3 of TAO.pdf.

 3 Gazetteer

The role of the gazetteer is to identify entity names in the text based on lists. The gazetteer lists used are plain text files, with one entry per line. Each list represents a set of names, such as names of cities, organisations, days of the week, etc.

An index file (lists.def) is used to access these lists; for each list, a major type is specified and, optionally, a minor type. It is also possible to include a language in the same way (fourth column), where lists for different languages are used, though ANNIE is only concerned with monolingual recognition. By default, the Gazetteer PR creates a Lookup annotation for every gazetteer entry it finds in the text. One can also specify an annotation type (fifth column) specific to an individual list. Each gazetteer list should reside in the same directory as the index file.

4 Sentence Splitter

The sentence splitter is a cascade of finite-state transducers which segments the text into sentences. This module is required for the tagger. The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

Each sentence is annotated with the type ‘Sentence’. Each sentence break (such as a full stop) is also given a ‘Split’ annotation.

The sentence splitter is domain and application-independent.

 There is an alternative ruleset for the Sentence Splitter which considers newlines and carriage returns differently. In general this version should be used when a new line on the page indicates a new sentence).

 5 Part of Speech Tagger

 The tagger produces a part-of-speech tag as an annotation on each word or symbol. The tagger uses a default lexicon and ruleset (the result of training on a large corpus taken from the Wall Street Journal). 

 6 Semantic Tagger

 Please see Jape Rule Section in GATE Documentation.

 All these contents are collected from General Architechture of Text Enginnering Documentation User Guide.

We have only tried to extract information from the above document to understand the software perspective.

Enter your email address:

Delivered by FeedBurner

Text Analysis with GATE – Part 3

The basic business of GATE is annotating documents.Core concepts are;

  • the documents to be annotated
  • corpora comprising sets of documents, grouping documents for the purpose of running uniform processes across them
  • annotations that are created on documents,annotation types such as ‘Name’ or ‘Date’
  • annotation sets comprising groups of annotations
  • processing resources that manipulate and create annotations on documents
  • applications, comprising sequences of processing resources, that can be applied to a document or corpus

Please refer to Chapter 3 of GATE Documentation to get comprehensive ideas on the above concepts.

GATE Framework

The framework performs these functions:

  • component initialising
  • management and visualisation of data structures for common information types.
  • generalised data storage and process execution.

A set of text analysis components plus the framework is a deployment unit which can be embedded in another application.

All GATE resources are Java Beans, which are simply Java classes that obey certain interface conventions.

We will show creation of GATE Resources as specified in GATE Documentation in later posts.

All these contents are collected from General Architechture of Text Enginnering Documentation User Guide.

We have only tried to extract information from the above document to understand the software perspective.

Enter your email address:

Delivered by FeedBurner