TIES 2.x

Developer's Guide

Table of Contents

Introduction

The ambitious goal we pursue is to provide a general framework for developing, testing, and collecting Information Extraction (IE) applications based on Machine Learning (ML) techniques. We have started thinking seriously about this idea while developing a Java reimplementation of the Boosted Wrapper Induction (BWI) algorithm devised by Dayne Freitag and Nicholas Kushmerick [1]. We realized that a few general packages could be reused for implementing other IE algorithms based on supervised learning methods. The challenge is to provide a general API that supports both developing and testing of IE algorithms, minimizing the effort required in implementing some standard modules of an IE application, and allowing the designer to concentrate on the algorithm details. We called this application TIES (Trainable Information Extraction System). TIES provides a Java API for representing data samples, defining machine learning algorithms, and running validation processes. The current version is still susceptible of change (it can be considered an alpha version in spite of the declared version 2.0), but it is stable enough to be employed for research purposes. We strongly recommend to extend and implement, respectively, the classes and interfaces provided instead of changing directly the source code; this will reduce future problems when we will change the API. Contributions and suggestions are welcome.

In this guide, we will give an overview of the basics of development using this framework, focusing on the two main areas for TIES usage:

The current implementation is distributed with the reimplementation of the BWI algorithm.

Dependencies

TIES uses elements of the Java 2 API such as collections, and therefore building requires the Java 2 Standard Edition SDK (Software Development Kit). To run TIES, the Java 2 Standard Edition RTE (Run Time Environment) is required (or you can use the SDK, of course).

TIES also is dependent upon a few packages for general functionality. They are included in the lib directory for convenience, but the default build target does not include them. If you use the default build target, you must add the dependencies to your classpath.

Resources

There are quite a few resources and examples available to the programmer, and we recommend that you look at our examples, documentation and even the source code. Some great sources are:

All directory references above are relative to the distribution root directory.

How TIES Works

In this section we give a high level description of how TIES works. When using TIES in an application program or stand-alone, you will generally execute the following steps:

We use the UML notation for a graphical description of the application packages. For a second source of information we suggest to see the TIES API documentation.

Training and Testing

The fragment of sequence diagram shown in Figure 1 gives a general idea of how the training and testing steps are performed. A ValidationStrategy object calls the learn method of a Learner instance which by contract learns a Hypothesis from a training sample (not shown in the diagram), and returns it; then the learned hypothesis is saved and used to extract, from a testing sample, a set of field instances (an instance of a field is a contiguous sequence of tokens appearing in the sample). An Evaluation object is created for evaluating the effectiveness of the learned hypothesis; finally the results of the evaluation are saved. If the ValidationStrategy object implements some sort of cross validation then this process is repeated several times. Once the algorithm performances have been evaluated, the Learner can be applied (and learns) on the whole dataset. This can be implemented as a [special] case of a validation strategy, for example, running a N-fold cross validation with N set to 1.

Figure 1 Sequence diagram of the validation phase.

Extraction

This step is partially described in the User's Guide

[...]

TIES API

TIES has four general purpose packages: data, ml, validation,and util. Moreover, it is provided with an implementation of the BWI algorithm (bwi package). This section describes the general packages. In the section Case of Study we describe the bwi package as an example of an implementation of an IE algorithm using the TIES API.

The data package

The package data provides classes and interfaces for dealing with data samples, examples, instances, and features in a manner independent from any specific feature-based algorithms. This means that your main application can be written to be data-independent, and it can rely upon a separate, dynamically-linked data model. This allows the flexibility of easily adding new features for characterizing the instances. Programmers have the possibility of extending the data representation [...] as well as exploiting the default one.

The class diagram of Figure 2.a shows the main classes and interfaces of the data package. The interface CorpusLoader plays a fundamental role in the package: classes that implement it are responsible of reading a labeled data sample and returning an ExampleSet and a TextList. The former class represents the example sample used by the ML algorithm, the latter class allows, during the extraction phase, to bind the pieces of information extracted from the example sample to the original text. Although formally the example sample is a sequence of instance-class pairs independently extracted from a uniform distribution, we take the liberty of calling it an example set. The Instance interface represents an object into a data sample. The instance is described by a collection of features. All general-purpose Instance implementation classes (which could implement Instance indirectly through one of its subclasses) should provide a "standard" implementation of the matches method. A Feature object is a name-value pair (note that in the next release this object will become an interface for handling user defined features). FeatureFactory is a class for vending Feature objects. Wherever possible, it hands out references to shared Feature instances (thus limiting the memory usage). Instance comes with a default implementation and can be redefined for any application specific use. An Example object is an instance with associated a class label and position into the corpus (Index). The example is positive (true) if the instance belongs to a specified class, negative (false) otherwise.

Figure 2.a Class diagram of the package data.

The package provides the default implementation of the interfaces defined. DefaultCorpusLoader is the implementation of the CorpusLoader interface: it loads a corpus in TIES input format (TIESIF). The corpus loader to be instantiated and its parameters can be specified in the ties-config.xml file. DefaultInstance implements an instance with symbolic features. For more details about the input format, the config file, and the default feature representation see the TIES user's guide.

The patterns package

The patterns package is a subpackage of data specialized for handling dynamic patterns as defined in [1] (consider that until release 2.0 this package was part of the data package).

The class diagram of Figure 2.c shows the classes and interfaces of the patterns package. A Pattern object implements the concept of pattern, represented as a specific sequence of Features. For instance, given a contiguous sequence of instances represented by a set of features as shown in Figure 2b, a pattern is a sequence of features taken one from each instance and in the same order. For a generic IE application, instances can be words or word sequences in a text, while features can be, for instance, the token itself, parts of speech (PoS), orthographic features, and relations between words. A pattern matches a contiguous sequence of instances iff it exists a sequence of features as they appear into the pattern.

Ij {fj,1, fj,2, ..., fj,n}
Ij+1 {fj+1,1, fj+1,2, ..., fj+1,m}
Ij+2 {fj+2,1, fj+2,2, ..., fj+2,l}
P1(fj,1, fj+1,m, fj+2,2)
P2(fj,1, fj+1,1, fj+2,1)
P3(fj,n, fj+1,2, fj+2,1)
Figure 2b On the left, a fragment of example set and, on the right, some patterns derived from it.

Objects that implement the PatternMatcher interface should match a specified pattern over an example set and return the list of instances' indices where the pattern matches a positive examples (ok match) and where it matches a negative example (wrong match). A PatternMatcher encapsulates the ok and wrong matches of a given pattern. The efficient pattern matcher implementation (FastPatternMatcher) uses an inverted index (InvertedIndex) that, given a feature, returns the list of positions where it appears (IndexList) within the example set.

Figure 2.c Class diagram of the package data.patterns.

[...]

The ml package

The ml package provides very general interfaces and classes for defining machine learning algorithms. This package by design does not provide any concrete class implementation; it delegates this task to the programmers who want to implement new ML algorithms.

The class diagram of Figure 3.a shows the classes and interfaces of the ml package. Every machine learning algorithm should implement the Learner interface. Classes that implement this interface "learn" a concept from a labeled training set. All Learner implementations must override the learn method that returns the learned Hypothesis.

Figure 3.a Class diagram of the package ml.

The Hypothesis interface is a representation of the concept learned by the machine learning algorithm. It provides methods for loading and saving the hypothesis, and a method for extracting a list of Fields from the specified example set.

The boosting package

The boosting package is a subpackage of ml that provides a boosting support (this package will be available since release 2.1).

[...]

Figure 3.b Class diagram of the package boosting.

[...]

The validation package

The validation package provides the classes necessary for performing validation. We implemented some of the most well-known validation strategies, such as N-fold cross validation, random split validation, and leave one out. The class diagram of Figure 4 shows the classes and interfaces of this package.

Classes that implement the ValidationStrategy interface by contract should reserve a certain amount of data for training and use the remainder for testing and then run the process described in the sequence diagram shown in Figure 1. Through the abstract class AbstractValidationStrategy we provide a skeletal implementation of the ValidationStrategy interface, to minimize the effort required to implement this interface.

For example, NFoldCrossValidation divides the data into n parts, in each of which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme trained on the remaining; then its error rate is calculated on the holdout set. For a description of the other validation strategies please refer to the API documentation and TIES user's guide.

Figure 4 class diagram of the package validation.

The learning procedure is executed a total of n times, every time with a different train/test split. Finally, the n error estimates are averaged to yield an overall error estimate.

ValidationStrategyFactory is a factory class for creating a validation strategy. It contains static methods for creating a validation strategy. The strategy to be instantiated and its parameters can be specified in the ties-config.xml file. See the TIES user's guide for an example of config file.

The Evaluation object evaluates the learned hypothesis. It provides standard precision, recall and f-measure figures. A ValidationResult is used for collecting a list of validation results. It provides the averaged result over the validation runs. This class provides a method for saving the validation result in CSV.

The util package

This package contains miscellaneous utility classes not described here. For details on the package contents see the TIES API documentation.

Extending TIES

In order to show that TIES can be extended easily, in the next subsection we propose as case of study the implementation of the Boosted Wrapper Induction.

[...]

Case of Study: BWI

This section describes the boosted wrapper induction object oriented implementation using the TIES API. Note that the boosting support will be generalized and moved in the ml.boosting package in the next release, but for the time being it is hardcoded as part of the bwi package.

The class diagrams of Figure 5.a and 5.b show the classes and interfaces of the bwi package. BWI is the main class of the package: it implements the Learner and Boosting interfaces. The former interface is described in The ml package. The latter, together with the WeakLearner interface, provides the abstraction of boosting.

As described in How TIES Works a validation object calls the learn method of BWI defined in the Learner interface. learn calls the boost method, defined in the Boosting interface, for learning a linear combination of weak hypotheses. Keeping separated the mechanism that runs the algorithm from its implementation gives a lot of flexibility. For instance, changing from AdaBoost to a different boosting technique requires only reimplementing the boost methods defined in Boosting, keeping untouch the other classes. In similar way, a new weak learner can be added just implementing the WeakLearner interface.

Figure 5.a Class diagram of the package bwi.

Once instantiated, the boosting algorithm repeatedly calls a weak learner implementation, for instance DefaultWeakLearner, over the training set and updates the examples weightings (Distribution). A WeakLearner object returns a weak hypothesis, i.e. a Detector. This process is represented in the sequence diagram shown in Figure 5.c.

The WeakLearnerFactory class is employed for creating a weak learner instance. The learner to be instantiated and its parameters can be specified in the ties-config.xml file. See the TIES user's guide for an example of config file.

Figure 5.b Class diagram of the package bwi.

The algorithm learns separately to recognize the beginning and end of the fields to be extracted and combines them, using the field length distribution, in a Wrapper object. The Wrapper class implements the Hypothesis interface providing methods for saving and loading a wrapper in XML format and for extracting a list of fields from an example set.

Figure 5.c Sequence diagram of the package bwi.

In TIES the procedure which in [1] is called BWI corresponds to the learn method of the BWI class, while the AdaBoost procedure corresponds to the boost method defined in Boosting. Finally, the procedure LearnDetector corresponds to the learnDetector defined in WeakLearner.

Plugging TIES

This section describes how to plug TIES in an application program.

[...]

Bibliography

[1] D. Freitag & N. Kushmerick, "Boosted wrapper induction", AAAI-00 (Austin), pp. 577-583.