TIES v2.x

User's Guide

Table of Contents

Introduction

TIES (Trainable Information Extraction System) is a ML-based Information Extraction (IE) system currently under development at ITC-irst within the Dot.Kom project. TIES automatically learns rules from a corpus previously annotated with a predefined set of XML tags. The XML tags are intended to identify instances of entities from a set of relevant elements defined by the user.
TIES is a Information Extraction system developed in an object-oriented fashion with Java. The application packages supply a set of interfaces and classes for training, testing and running an extraction task both in traditional (natural text) and wrapper (machine-generated or rigidly-structured text) domains. TIES v1.x was essentially a reimplementation of the Boosted Wrapper Induction (BWI) algorithm devised by Dayne Freitag and Nicholas Kushmerick [1]. TIES v2.x incorporates the BWI algorithm as one of the ML algorithms that can be integrated within the general TIES architecture. Boosting is a technique for improving the performance of a simple machine learning algorithm (called weak learner) by repeatedly applying it to the training set with different example weightings. In BWI an algorithm that learns simple low-coverage wrapper-like extraction patterns is applied to IE problems using boosting. The TIES system architecture is strongly based on boosting and wrapper induction techniques, but it has a high degree of flexibility allowing programmers, if they like, to develop their own weak learner implementation, as well as to add new validation strategies. The default implementation exploits only simple orthographic features, which map an individual token to an arbitrary set of wildcards (e.g. capitalized, lower-case, punctuation), but more complex features (e.g., morpho-syntactic ones) can be added to improve the performances simply using a customized preprocessor. In this case a different feature extraction method has to be supplied. The system comes with default implementation of all the interfaces defined, therefore the application can also be used without programming experience.

In the remaining sections of the tutorial, you are provided with step-by-step instructions for installing, configuring and performing common tasks using TIES. You will benefit most from this tutorial if you complete these sections in order. You should perform these tasks in the actual TIES environment--not a simulation.

System Requirements

The TIES software is available on all platforms supporting Java 2.

On all systems you should have at least 5 megabytes of free disk space before installing the software. If you also install the separate test corpus download bundle, you need an additional 10 megabytes of free disk space.

Dependencies

TIES uses elements of the Java 2 API such as collections, and therefore building requires the Java 2 Standard Edition SDK (Software Development Kit). To run TIES, the Java 2 Standard Edition RTE (Run Time Environment) is required (or you can use the SDK, of course).

TIES is also dependent upon a few packages for general functionality. They are included in the build/lib directory for convenience, but the default build target does not include them. If you use the default build target, you must add the dependencies to your classpath.

Installation

  1. Make sure you have installed Sun's Java 2 Environment. The full version is available at http://java.sun.com/j2se. If you are limited by disk space or bandwidth, install the smaller, run-time only version at http://java.sun.com/j2se.

  2. Download the jar file of the installer here. Copy the file into the directory where you want to install the program and then run:

    jar -xvf ties-install-2.x.jar

Now you are ready to run TIES. The latest User Guide is available here and the Programmer's Guide is available here.

Developers can download source code here. They are all in tar format. Generally, you should use the version with the highest non-beta release.

Getting started

Suppose you have some textual data and you want to extract information from it. Usually documents are available as simple texts, or in HTML or XML format. However, the core ML engine of TIES expects them to be in TIES input format (introduced in the following subsection). This is necessary because token features have to be marked explicitly. This conversion to TIESIF can be easily done employing one of the Tokenizers provided. Note that, in spite of the use of the term Tokenizers (which suggests that they perform only tokenization), in TIES Tokenizers usually perform also more complex preprocessing tasks, such as feature extraction, as we will see in the following. If more complex features are needed the implementation of a more powerful tokenizer interface or the employment of an external feature extractor is required.

TIES input format

The current way of representing datasets is called TIESIF format. Figure 1 shows a fragment of a TIES input file for the standard "CMU seminar announcements" extraction task.

<?xml version="1.0"?>
<corpus>
  <text path="./input/seminar" name="cmu.andrew.org.cfa.cfa-today-19:0">
    <body>

      ...
      
      <token id="146" type="word" start="438" len="10" alpha_token="true" capitalized_token="true">University</token> 
      <token id="148" type="word" start="449" len="2" alpha_token="true" lower_case_token="true">of</token> 
      <token id="150" type="word" start="452" len="10" alpha_token="true" capitalized_token="true">Pittsburgh</token> 
      <token id="151" type="nl" start="462" len="1" nl_token="true">\n</token> 
      <token id="153" type="word" start="464" len="4" alpha_token="true" capitalized_token="true">Name</token> 
      <token id="154" type="punct" start="468" len="1" punct_token="true">:</token> 
      
      <token id="156" type="tag" start="470" len="9" open_tag="true">speaker</token> 
      
      <token id="157" type="abbrev" start="479" len="3" abbr_token="true">Dr.</token> 
      <token id="159" type="word" start="483" len="7" alpha_token="true" capitalized_token="true">Jeffrey</token> 
      <token id="161" type="abbrev" start="491" len="2" abbr_token="true">D.</token> 
      <token id="163" type="word" start="494" len="6" alpha_token="true" capitalized_token="true">Hermes</token> 
      
      <token id="164" type="tag" start="500" len="10" close_tag="true">/speaker</token> 
      
      <token id="165" type="nl" start="510" len="1" nl_token="true">\n</token> 
      <token id="167" type="word" start="512" len="11" alpha_token="true" capitalized_token="true">Affiliation</token> 
      <token id="168" type="punct" start="523" len="1" punct_token="true">:</token> 
      <token id="170" type="word" start="525" len="10" alpha_token="true" capitalized_token="true">Department</token> 
      
      ...
      
    </body>
  </text>
  
  ...
  
  <text>
  	...
  </text>
  
  ...
</corpus>
Figure 1 A fragment of tokenized dataset.

The input format is XML, this section describes each of the elements that are currently employed.

A collection of example texts are regarded as a corpus. The texts are marked with the <text> tag and may contain front matter, a text body, and back matter. The text body is tagged <body>. The overall structure of any text is thus defined by the following elements:

Labeling examples

Examples are labeled using a special token with attribute type equal to tag. The example is wrapped by a pair of tokens, specified using the two attributes open_tag and close_tag set to true. The contents of the two tokens are respectively the names of the labels: e.g., in the example shown in Figure 1, they are speaker for the token with id equal to 156 and /speaker for token with id equal to 164.

Note that some algorithms use labeled examples as positive examples and consider all the other tokens as negative examples, but this setting can be changed and negative examples can be explicitly labeled (see the match-all parameter in the FastWeakLearner class and DefaultFieldExtraction class). This feature can be exploited, for example, whenever a preprocessor that limits the candidate examples for the algorithm exists.

Feature extraction

The weak learner is a symbolic rule-learning algorithm that searches for extraction patterns based on token features. Examples are defined in terms of features, which are functions mapping examples to discrete values. That is, an example is transformed into a collection of features, thereby producing a N-dimensional vector. Each word (token) is treated as an example by the weak-learner and, for instance, a possible feature is capitalized, a function that maps a word to the set {true, false}. Given such a feature, we can express simple propositions about a specific tokens:

All features used by the algorithm must be declared explicitly in the input format file (see domain specific attributes), the only implicit features are the token itself and any_token, a feature that matches all tokens. The default implementation of the tokenizer encapsulates a basic feature extractor that exploits only simple features, namely:

The feature extraction method generates only the features judged to be active in the example. All other features are judged to be inactive, or false.

User defined features (e.g. morpho-syntactic features), if available, could be provided to the algorithm but a tokenizer reimplementation or an external feature extractor is needed. The features must be declared as a name-value pair. It's important to point out that a proposition about a particular token may be defined in several ways, for instance, a token's part of speech can be defined as:

Both declarations are valid but a uniform representation should be used in the same dataset.

Running TIES

TIES can perform three different tasks: learning, testing and extraction.

Since a GUI does not exist all configuration parameters can be set in one of following ways:

The following section describes in details the XML configuration, which is strongly recommended. For a complete API description see the TIES API documentation; a Developer's Guide is planned in a few months.

Configuration File

TIES is implemented using a set of modules. Each module has a number of settable properties and implements one or more interfaces, providing a piece of functionality.

The modules can be configured and assembled in several ways, but the most flexible mechanism uses XML files. Each module is described by an XML element, with subelements and attributes used to set module properties. By specifying which modules and their attributes to use, you have a lot of flexibility in controlling the features of your instance of TIES.

ties-config is the main element in the configuration file. It has multiple children describing the TIES modules: validation strategy, extraction strategy, weak learner, boosting, tokenizer and corpus loader. The directives controlling the input and output are also put into the configuration file, as well as cache configuration (that is essential to optimize the system performance and reduce the memory usage). Figure 2 shows an example of a configuration file for the standard "CMU seminar announcements" extraction task. The following subsections describe each part of the configuration file.

<!-- Configuration file for the standard "CMU seminar announcements" extraction task -->

<ties-config>

  <validation-strategy>
    <validation-class>org.itc.irst.tcc.ties.validation.NFoldCrossValidation</validation-class>
    <init-param>
      <param-name>n</param-name>
      <param-value>5</param-value>
    </init-param>
    <init-param>
      <param-name>hypothesis-file</param-name>
      <param-value>./bwi/sa/out.xml</param-value>
    </init-param>
    <init-param>
      <param-name>eval-file</param-name>
      <param-value>./bwi/sa/bwi-eval.csv</param-value>
    </init-param>
  </validation-strategy>

  <bwi>
    <extraction-strategy>
      <extraction-class>org.itc.irst.tcc.ties.bwi.DefaultFieldExtraction</extraction-class> 
      <init-param>
        <param-name>tau</param-name> 
        <param-value>0</param-value> 
      </init-param>
      <init-param>
        <param-name>match-all</param-name> 
        <param-value>true</param-value> 
      </init-param>
      <init-param>
        <param-name>close-all</param-name> 
        <param-value>false</param-value> 
      </init-param>
    </extraction-strategy>

    <weak-learner>
      <learner-class>org.itc.irst.tcc.ties.bwi.FastWeakLearner</learner-class>
      <init-param>
        <param-name>lookahead</param-name>
        <param-value>3</param-value>
      </init-param>
      <init-param>
        <param-name>cache-size</param-name>
        <param-value>5</param-value>
      </init-param>
      <init-param>
        <param-name>match-all</param-name> 
        <param-value>true</param-value> 
      </init-param>
    </weak-learner>

    <boosting>
      <boosting-class>org.itc.irst.tcc.dot.kom.bwi.boosted.BWI</boosting-class>
      <init-param>
        <param-name>iterations</param-name>
        <param-value>100</param-value>
      </init-param>
      <init-param>
        <param-name>labels</param-name>
        <param-value>stime,location</param-value>
      </init-param>
    </boosting>
  </bwi>

  <tokenizer>
    <tokenizer-class>org.itc.irst.tcc.util.tokenizer.EnglishTokenizer</tokenizer-class>
    <init-param>
      <param-name>input</param-name>
      <param-value>./input/seminar</param-value>
    </init-param>
    <init-param>
      <param-name>output</param-name>
      <param-value>./input/seminar.xml</param-value>
    </init-param>
    <init-param>
      <param-name>nl</param-name>
      <param-value>true</param-value>
    </init-param>
    <init-param>
      <param-name>sp</param-name>
      <param-value>false</param-value>
    </init-param>
  </tokenizer>
  
  <corpus-loader>
    <loader-class>org.itc.irst.tcc.ties.data.DefaultCorpusLoader</loader-class>
    <init-param>
      <param-name>input</param-name>
      <param-value>./input/input.xml</param-value>
    </init-param>
    <init-param>
      <param-name>atags</param-name>
      <param-value>stime,location,etime,speaker</param-value>
    </init-param>
    <init-param>
      <param-name>itags</param-name>
      <param-value>paragraph,sentence</param-value>
    </init-param>
    <init-param>
      <param-name>features</param-name>
      <param-value>alpha_token,lower_token,upper_token,cap_token,nl_token,
                   punc_token,schar_token</param-value>
    </init-param>
  </corpus-loader>
</ties-config>  
Figure 2 An example of configuration file.
Comments in Configuration File

Lines included within <!-- and --> are ignored, as empty lines.

Validation Strategy

The validation strategy module is responsible for running an experiment consisting in training and testing an algorithm over a specified dataset. Below it is described how to choose and configure a validation strategy.

The ties-config.validation-strategy.validation-class field is a compulsory field required to specify the validation strategy implementation. The value of this field is the name of the java class that implements the validation strategy interface. For example, in the file shown in Figure 2 the N-Fold Cross-Validation has been chosen setting the field value to org.itc.irst.tcc.ties.validation.NFoldCrossValidation. For a detailed description of the cross validation strategies see [2] or the TIES API documentation.

The ties-config.validation-strategy.init-param fields are used to specify initialization parameters for the specified validation strategy. The following subsections describes each validation strategy available.

N-Fold Cross-Validation

This strategy divides the entire data set into N subsets of (approximately) equal size. Then, it trains the system N times, each time leaving out one of the subsets from training and using such subset as test set. Table 1.a shows the parameters to set for NFoldCrossValidation. A popular way to perform this validation is to set N to 10. TenFoldCrossValidation has N set to 10 (Table 1.b). If N equals the sample size, this is called Leave-One-Out cross-validation. Table 1.c. shows the parameters to set for LeaveOneOut.

N-Fold Cross-Validation

org.itc.irst.tcc.ties.validation.NFoldCrossValidation

nnumber of folds or partitions of the dataset
hypothesis-file file where to output the hypotheses learned
eval-filefile where to output the evaluation measures
Table 1.a N-Fold Cross-Validation parameters description.

10-Fold Cross-Validation

org.itc.irst.tcc.ties.validation.TenFoldCrossValidation

hypothesis-file file where to output the hypotheses learned
eval-filefile where to output the evaluation measures
Table 1.b 10-Fold Cross-Validation parameters description.

Leave One Out

org.itc.irst.tcc.ties.validation.LeaveOneOut

hypothesis-file file where to output the hypotheses learned
eval-filefile where to output the evaluation measures
Table 1.c Leave-One-Out parameters description.

Random Split Cross-Validation

This strategy differs from N fold cross-validation only since the folds are randomly chosen. Table 2 shows the initial parameters to set for RandomSplitCrossValidation.

Random Split Validation

org.itc.irst.tcc.ties.validation.RandomSplitValidation

nnumber of runs of the experiment
percentage-splitpercentage of examples to use for training
hypothesis-file file where to output the hypotheses learned
eval-filefile where to output the evaluation measures
Table 2 Random Split Validation parameters description.
Single Split Validation

In this method only a single subset (the testing set) is used to estimate the generalization error, instead of N different subsets; i.e., there is no "crossing". Table 3 shows the initial parameters to set for this validation strategy. The peculiarity of this strategy is the possibility to set the texts to use for training and testing. This is done enumerating in training-file and testing-file the texts (one for line) for training and testing, respectively.

Single Split Validation

org.itc.irst.tcc.ties.validation.SingleSplitValidation

training-filetraining set file
testing-filetesting set file
hypothesis-filefile where to output the wrapper learned
eval-filefile where to output the evaluation measures
Table 3 Single Split Validation parameters description.

Multi Split Validation

This method extends the single split validation introducing cross validation. Table 4 shows the initial parameters to set for this validation strategy. As above the peculiarity of this strategy is the possibility of specifying the texts to use for training and testing. This is done enumerating in train.x and test.x the texts (one for line) for training and testing; where x is the partition number, from 1 to n. By convention, training-file and testing-file parameters do not include the partition number, i.e., if n is 2 and training-file is split/test and testing-file is split/test the system will learn on the texts specified in train.1 and train.2 and test on test.1 and test.2, respectively.

Multi Split Validation

org.itc.irst.tcc.ties.validation.MultiSplitValidation

n number of partitions of the dataset
training-file training set file
testing-file testing set file
hypothesis-file file where to output the hypotheses learned
eval-filefile where to output the evaluation measures
Table 4 Multi Split Validation parameters description.

Extraction Strategy

The extraction strategy module is delegated to extract a list of entities using the hypothesis learned. Changing parameter setting can force a different tradeoff between precision and recall. Below it is described how to choose and configure an extraction strategy.

The ties-config.bwi.extraction-strategy.extraction-class field is a compulsory field required to specify the extraction strategy implementation. The value of this field is the name of the java class that implements the extraction strategy interface. For example, in the file shown in Figure 2 the DefaultFieldExtraction has been chosen setting the field value to org.itc.irst.ties.bwi.DefaultFieldExtraction. For a detailed description of the extraction strategies see the Developer's Guide or the TIES API documentation.

The ties-config.bwi.extraction-strategy.init-param fields are used to specify initialization parameters for the specified extraction strategy. The following table shows all the parameters to be set for the DefaultFieldExtraction:

DefaultFieldExtraction

org.itc.irst.tcc.ties.bwi.DefaultFieldExtraction

tau

By varying this parameter one can influence the tradeoff between precision and recall (tau=0 for a full-recall setting)

match-all

true if labeled tokens have to be considered positive examples and all non-labeled tokens have to be implicitly considered negative examples; false if only explicitly labeled tokens have to be considered examples. For instance, if you want to recognize the speaker field in the CMU seminar announcements, you can decide to use a Named Entity (NE) recognizer during preprocessing so to explicitly tag people names; then the weak learner can consider as candidates for the speaker field only the entities tagged as people names by the NE module.Note that, the value of this parameter is strictly related with the one of the same name defined into the weak learner.
By default this parameter is false.

close-all

if true the extractor closes all the open tags for which a close tag is not found. The position where to place the close tag is chosen according the field length distribution. By default this parameter is false.

Table 5 Default Field Extraction parameters description.

Weak Learner

A learning algorithm is called a weak learner if it produces a weak hypothesis. A weak hypothesis approximates the target concept "a little better" than a random guess. Below it is described how to choose and configure a weak learner.

The ties-config.bwi.weak-learner.learner-class field is a compulsory field required to specify the weak learner implementation. The value of this field is the name of the java class that implements the weak learner interface. For example, in the file shown in Figure 2 the Default Weak Learner has been chosen setting the field value to org.itc.irst.tcc.ties.bwi.FastWeakLearner. For a detailed description of the weak learner see [1] or the TIES API documentation.

The ties-config.bwi.weak-learner.init-param fields are used to set initialization parameters for the specified weak learner. The following table shows all the parameters to be set for the FastWeakLearner:

FastWeakLearner

org.itc.irst.tcc.ties.bwi.FastWeakLearner

lookahead lookahead parameter
cache-size

size of the cache, this parameter can have an influence on the application performance: if too low computation-time grows very fast with the size of the lookahead and the number of wildcards employed; if too big the application requires much more memory to be allocated.

match-all

true if labeled tokens have to be considered positive examples and all non-labeled tokens have to be implicitly considered negative examples; false if only explicitly labeled tokens have to be considered examples. For instance, if you want to recognize the speaker field in the CMU seminar announcements, you can decide to use a Named Entity (NE) recognizer during preprocessing so to explicitly tag people names; then the weak learner can consider as candidates for the speaker field only the entities tagged as people names by the NE module. Note that, the value of this parameter is strictly related with the one of the same name defined into the extraction strategy.
By default this parameter is false.

Table 6 FastWeakLearner parameters description.

Boosting

A boosting algorithm (booster) is capable of producing an approximation of the target concept using another algorithm (weak learner), which is only able to weakly approximate the target concept. Below it is described how to choose and configure the boosting.

The ties-config.bwi.boosting.boosting-class field is a compulsory field required to specify the boosting implementation. The value of this field is the name of the java class that implements the boosting interface. For a detailed description of the boosting algorithm see [1] or the TIES API documentation.

The ties-config.bwi.boosting.init-param fields are used to set initialization parameters for the specified AdaBoost. The following table shows all the initial parameters to be set:

BWI

org.itc.irst.tcc.ties.bwi.BWI

iterationsnumber of boosting iterations (rounds of boosting)
labelslist of fields to be learned (i.e. location, speaker)
Table 7 BWI parameters description.
Tokenizer

Tokenizes the files that define examples to be learned by TIES. A single file or a directory (read recursively) can be specified. Plain text (TXT) and tagged (SGML, XML and HTML) documents are accepted by the tokenizer implementation as input.

The ties-config.tokenizer.tokenizer-class field is a compulsory field required to specify the tokenizer implementation. The value of this field is the name of the java class that implements the tokenizer interface. For example in the configuration file shown in Figure 2 the English Tokenizer has been chosen setting the field value to org.itc.irst.tcc.util.tokenizer.EnglishTokenizer. For a detailed description of the tokenizer see the TIES API documentation.

The ties-config.tokenizer.init-param fields are used to set initialization parameters for the specified tokenizer. The following tables show all the parameters to be set for the English and Italian Tokenizer:

English Tokenizer

org.itc.irst.tcc.util.tokenizer.EnglishTokenizer

inputinput file or directory
output*output file in TIES intermediate format, If omitted, a default file is created
nl*true if pattern may include newlines; false otherwise
sp*true if pattern may include spaces; false otherwise
* optional field
Table 8.a English Tokenizer parameters description.

Italian Tokenizer

org.itc.irst.tcc.util.tokenizer.ItalianTokenizer

inputinput file or directory
output*output file in TIES intermediate format, If omitted, a default file is created
nl*true if pattern may include newlines; false otherwise
sp*true if pattern may include spaces; false otherwise
* optional field
Table 8.b Italian Tokenizer parameters description.
Corpus Loader

This module loads the TIES input format into memory and marks the entities to learn (specified by the atags list) as positive examples. Moreover, a list of ignorable tags (itags) can be provided for skipping tags that should not be included into the wrapper learned.

The ties-config.corpus-loader.loader-class field is a compulsory field required to specify the corpus loader. The value of this field is the name of the java class that implements the corpus loader interface. For example in the configuration file shown in Figure 2 the Default Corpus Loader has been chosen setting the field value to org.itc.irst.tcc.ties.data.DefaultCorpusLoader. For a detailed description of the tokenizer see the TIES API documentation.

The ties-config.corpus-loader.init-param fields are used to set initialization parameters for the specified tokenizer. The following table shows all the parameters to be set for the Default Class Loader:

Default Corpus Loader

org.itc.irst.tcc.ties.data.DefaultCorpusLoader

input*input file in TIES input format. If omitted, a default file is loaded
atagslist of tags (entity labels) that mark positive examples
itagslist of tags that have to be ignored during learning (the list can be empty)
featureslist of features that have to be employed in learning
* optional field
Table 9 Default Corpus Loader parameters description.

Evaluating Learned Hypotheses

To run a cross validation on your dataset:

  1. Move to the directory that contains the application.

  2. Set the CLASSPATH environment variable to the TIES root directory and the following jar files (distributed in the lib directory):

  3. Define a new configuration file ``your-bwi-conf.xml'' as described in the previous section.

  4. Run the command:

    java -Dconfig.file=your-bwi-conf.xml -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser org.itc.irst.tcc.ties.bwi.BoostingFactory [-notokenizer] [> your-output-file]

You have to specify the configuration file in the command line using the -Dconfig.file option. By default the program look for a file ties-config.xml in the application directory.

The result of the cross validation is output in CSV format; for each boosting round the following figures are returned: true positive count (tp), false negative count (fn), false positive count (fp), recall, precision and F(1)-measure.

ntpfnfptotalrecallprecisionF(1)
06111771380.8472222220.7922077920.818791946
16422881520.7441860470.7272727270.735632184
25919821410.7564102560.7195121950.7375
35418721260.750.750.75
45029831330.6329113920.6024096390.617283951
Table 10 A 5-fold cross-validation result for the speaker field in the CMU seminar announcements.

In order to generate the final extractor, run the procedure as described above using N-Fold Cross-Validation and setting the n parameter to 1; the system will learn on the whole dataset.

Wrapper

The wrapper learned is stored in XML format and it can be used later for the extraction task. Figure 3 shows a fragment of a wrapper learned for the speaker field in the CMU seminar announcements.

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<wrapper label="speaker">
<fore-detector>
    <detector>
      <pattern type="prefix">
        <feature name="token" value="Who"/>
        <feature name="single_char_token" value="true"/>
      </pattern>
      <pattern type="suffix">
        <feature name="alpha_token" value="true"/>
      </pattern>
    <confidence-value>2.7587264482323546</confidence-value>
    </detector>
    <detector>
      <pattern type="prefix">
        <feature name="token" value="speaker"/>
        <feature name="single_char_token" value="true"/>
      </pattern>
      <pattern type="suffix">
      </pattern>
      <confidence-value>2.2216808574759974</confidence-value>
    </detector>
  ...
  
</wrapper>
Figure 3 A fragment of the wrapper learned encoded in XML.
1 P[token='Who', single_char_token='true'] S[alpha_token='true']
2 P[token='speaker', single_char_token='true'] S[]
3 P[] S[token='Dr.', capitalized_token='true']
4 P[token='Speaker', single_char_token='true'] S[capitalized_token='true']
...
Table 11 A fragment of the wrapper learned after applying a XSL transformation.

We plan to add some comments to better understand the example above.

Extraction

The following section describes how to perform the extraction task on a given input. It is strongly recommended to configure the extraction module using the XML configuration file. Please do not confuse the following configuration file, employed for extraction, with the preceding file, employed for learning the model. The result is output in XML format. If further manipulation of the extracted entities is required a java API is provided.

<!-- Configuration file for the standard "CMU seminar announcements" extraction task -->
<ties-config>
  <bwi>
    <classifier>
      <extract>
        <entity>
          <wrapper>./bwi/sa/new-out0-speaker.xml</wrapper>
          <output>./bwi/sa/res0-speaker.xml</output>
        </entity>  
        <entity>
          <wrapper>./bwi/sa/new-out0-location.xml</wrapper>
          <output>./bwi/sa/res0-location.xml</output>
        </entity>
      </extract>  
    </classifier>

    <extraction-strategy>
      <extraction-class>org.itc.irst.tcc.ties.bwi.DefaultFieldExtraction</extraction-class> 
      <init-param>
        <param-name>tau</param-name> 
        <param-value>0</param-value> 
      </init-param>
      <init-param>
        <param-name>match-all</param-name> 
        <param-value>true</param-value> 
      </init-param>
      <init-param>
        <param-name>close-all</param-name> 
        <param-value>false</param-value> 
      </init-param>
    </extraction-strategy>
  </bwi>

  <tokenizer>
    <tokenizer-class>org.itc.irst.tcc.util.tokenizer.EnglishTokenizer</tokenizer-class>
    <init-param>
      <param-name>input</param-name>
      <param-value>./input/seminar</param-value>
    </init-param>
    <init-param>
      <param-name>nl</param-name>
      <param-value>true</param-value>
    </init-param>
    <init-param>
      <param-name>sp</param-name>
      <param-value>false</param-value>
    </init-param>
  </tokenizer>

  <corpus-loader>
    <loader-class>org.itc.irst.tcc.ties.data.DefaultCorpusLoader</loader-class>
    <init-param>
      <param-name>atags</param-name>
      <param-value>stime,location,etime</param-value>
    </init-param>
    <init-param>
      <param-name>itags</param-name>
      <param-value>speaker,paragraph,sentence</param-value>
    </init-param>
    <init-param>
      <param-name>features</param-name>
      <param-value>alpha_token,lower_token,upper_token,cap_token,nl_token,
                   punc_token,schar_token</param-value>
    </init-param>
  </corpus-loader>
</ties-config>
				
Figure 4 An example of configuration file for extraction.

ties-config is the main element in the configuration file. It has multiple children describing the TIES modules employed for extraction: classifier, extraction strategy, tokenizer and corpus loader. Figure 4 shows an example of a configuration file for the standard "CMU seminar announcements" extraction task. The following subsections describe the configuration of the classifier module; the tokenizer and corpus loader modules are described in the Running TIES section.

Classifier configuration

This module provides methods for extracting information from a single text or a set of texts given the model (wrapper) learned during training. The result is output in a file encoded in XML.

The ties-config.classifier.extract.wrapper field is a compulsory field required to specify the model to apply for extraction.

The ties-config.classifier.extract.output field is used to set the output file. The following table shows all the parameters to be set for the provided Classifier:

Classifier

wrappermodel learned
output*output file in XML format
* optional field
Table 12 Classifier parameters description.
Running extraction

To perform the extraction task on an input file do the following:

  1. Move to the directory that contains the application.

  2. Set the CLASSPATH environment variable to the TIES root directory and the following jar files (distributed in the lib directory):

  3. Define a new configuration file ``your-classifier-config.xml'' as described in the previous section.

  4. Run the command:

    java -Dconfig.file=your-classifier-config.xml -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser org.itc.irst.tcc.ties.bwi.Classifier [> your-output-file]

<entity-list>
  <entity name="speaker" src=".\CMUAN~3G.DER" start="142" end="151">
    <token start="142" len="3">Mr.</token>
    <token start="146" len="5">Okada</token>
  </entity>
  <entity name="speaker" src=".\CMUAN~EU.ACE" start="330" end="348">
    <token start="330" len="3">Dr.</token>
    <token start="334" len="9">Stephanie</token>
    <token start="344" len="4">Shaw</token>
  </entity>
  <entity name="speaker" src=".\CMUAN~G2.CAR" start="624" end="640">
    <token start="624" len="3">Mr.</token>
    <token start="628" len="6">Andrew</token>
    <token start="635" len="5">Gault</token>
  </entity>
  <entity name="speaker" src=".\CMUAN~G2.CAR" start="810" end="826">
    <token start="810" len="3">Mr.</token>
    <token start="814" len="6">Jessie</token>
    <token start="821" len="5">Ramey</token>
  </entity>
  <entity name="speaker" src=".\CMUAN~G2.CAR" start="880" end="896">
    <token start="880" len="3">Dr.</token>
    <token start="884" len="4">Judi</token>
    <token start="889" len="7">Mancuso</token>
  </entity>
</entity-list>
Figure 5 A fragment of the output file obtained in the standard CMU seminar announcements extraction task: sequences of tokens extracted for the speaker field.
Classifier API

This section describes how to manipulate the extracted entities using the TIES API. ExtractionResult acts as a container for the result of an extraction task. ExtractionResult maps files to set of pairs consisting of a label and a set of entities. A map cannot contain duplicate files; each file can map to many entities. Each set of entities is sorted according to the extraction score.

// set the config file; alternatively, run the application with this
// param -Dconfig.file=your_classifier_conf.xml
System.setProperty("config.file", "your_classifier_conf.xml");

// instantiate the Classifier and extract the entities
// configuration parameters are loaded through the 
// configuration file
ExtractionResult er = new Classifier().extract();

Iterator it = er.entries().iterator();
while (it.hasNext())
{
  // iterate on the files
  Map.Entry e = (Map.Entry) it.next();
  // get the file
  File f = (File) e.getKey();
  // gets the list of entities for the file
  ExtractionResult.EntityMap em = (ExtractionResult.EntityMap) e.getValue();
  
  Iterator it1 = em.entries().iterator();
  while (it1.hasNext())
  {
    // iterate on the labels				
    Map.Entry e2 = (Map.Entry) it1.next();			
    String label = (String) e2.getKey();				
    SortedSet ss  = (SortedSet) e2.getValue();
		
    // get the entity with the highest score (ss.last() is the entity with the lowest score)
    Entity entity = (Entity) ss.first();

    // add code here
    
  } // end inner while

} // end outer while
				
Figure 6 A fragment of code for manipulating the information extracted.

Case of Study: CMU seminar announcements

The application comes with the dataset (/input folder) and configuration files (/SA folder) for the standard "CMU seminar announcements" extraction task. Try the following steps to extract the speaker, location, stime and etime fields:

  1. Move to the directory that contains the application.

  2. Set the CLASSPATH environment variable to the TIES root directory and the following jar files (distributed in the lib directory):

  3. Modify ./bwi/sa/ties-config.xml, as described in the Configuration File section, to set your cross-validation parameters.

  4. Run the cross-validation process:

    java -mx1024M -Dconfig.file=./bwi/sa/ties-config.xml -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser org.itc.irst.tcc.ties.bwi.BoostingFactory > ./bwi/sa/out.bwi

  5. Read the result of evaluation in ./bwi/sa/bwi-eval.csv.

  6. Repeat step 3 learning on the whole dataset, i.e. using N-Fold Cross-Validation and setting n to 1.

  7. Modify ./bwi/sa/classifier-config.xml, as described in the Classifier configuration section, to set your classifier parameters.

  8. Extract entities from a new text.

    java -Dconfig.file=./bwi/sa/classifier-config.xml -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser -mx256M org.itc.irst.tcc.ties.bwi.Classifier > ./bwi/sa/out1.bwi

Release Notes

See the Release Notes on the TIES web site for a summary of changes to the software since the previous version and other information pertaining to this release. The online release notes will be updated as needed, so you should check them regularly for the latest information.

Bibliography

[1] D. Freitag & N. Kushmerick, "Boosted wrapper induction", AAAI-00 (Austin), pp. 577-583.

[2] Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999, 416 pages, ISBN 1-55860-552-5.


If you have any comments about these pages, please contact: giuliano@itc.it