The ambitious goal we pursue is to provide a general framework for developing, testing, and collecting Information Extraction (IE) applications based on Machine Learning (ML) techniques. We have started thinking seriously about this idea while developing a Java reimplementation of the Boosted Wrapper Induction (BWI) algorithm devised by Dayne Freitag and Nicholas Kushmerick [1]. We realized that a few general packages could be reused for implementing other IE algorithms based on supervised learning methods. The challenge is to provide a general API that supports both developing and testing of IE algorithms, minimizing the effort required in implementing some standard modules of an IE application, and allowing the designer to concentrate on the algorithm details. We called this application TIES (Trainable Information Extraction System). TIES provides a Java API for representing data samples, defining machine learning algorithms, and running validation processes. The current version is still susceptible of change (it can be considered an alpha version in spite of the declared version 2.0), but it is stable enough to be employed for research purposes. We strongly recommend to extend and implement, respectively, the classes and interfaces provided instead of changing directly the source code; this will reduce future problems when we will change the API. Contributions and suggestions are welcome.
In this guide, we will give an overview of the basics of development using this framework, focusing on the two main areas for TIES usage:
The current implementation is distributed with the reimplementation of the BWI algorithm.
TIES uses elements of the Java 2 API such as collections, and therefore building requires the Java 2 Standard Edition SDK (Software Development Kit). To run TIES, the Java 2 Standard Edition RTE (Run Time Environment) is required (or you can use the SDK, of course).
TIES also is dependent upon a few packages for general functionality. They are included in the lib directory for convenience, but the default build target does not include them. If you use the default build target, you must add the dependencies to your classpath.
[...].There are quite a few resources and examples available to the programmer, and we recommend that you look at our examples, documentation and even the source code. Some great sources are:
All directory references above are relative to the distribution root directory.
In this section we give a high level description of how TIES works. When using TIES in an application program or stand-alone, you will generally execute the following steps:
Learner)ValidationStrategy)Hypothesis)We use the UML notation for a graphical description of the application packages. For a second source of information we suggest to see the TIES API documentation.
The fragment of sequence diagram shown in Figure 1 gives a general idea of
how the training and testing steps are performed. A ValidationStrategy
object calls the learn method of a Learner instance
which by contract learns a Hypothesis from a training sample (not
shown in the diagram), and returns it; then the learned hypothesis is saved
and used to extract, from a testing sample, a set of field instances
(an instance of a field is a contiguous sequence of tokens appearing in the
sample). An Evaluation object is created for evaluating the
effectiveness of the learned hypothesis; finally the results of the evaluation
are saved. If the ValidationStrategy object implements some sort
of cross validation then this process is repeated several times. Once the algorithm
performances have been evaluated, the Learner can be applied (and
learns) on the whole dataset. This can be implemented as a [special] case of
a validation strategy, for example, running a N-fold cross validation
with N set to 1.
![]() |
| Figure 1 Sequence diagram of the validation phase. |
This step is partially described in the User's Guide
[...]
TIES has four general purpose packages: data, ml,
validation,and util. Moreover, it is provided with an implementation
of the BWI algorithm (bwi package). This section describes the
general packages. In the section Case of Study
we describe the bwi package as an example of an implementation of
an IE algorithm using the TIES API.
The package data provides classes and interfaces for dealing
with data samples, examples, instances, and features in a manner independent
from any specific feature-based algorithms. This means that your main application can be written
to be data-independent, and it can rely upon a separate, dynamically-linked
data model. This allows the flexibility of easily adding new features for
characterizing the instances. Programmers have the possibility of
extending the data representation [...] as well as exploiting the default
one.
The class diagram of Figure 2.a shows the main classes and interfaces of the
data package. The interface CorpusLoader plays a fundamental
role in the package: classes that implement it are responsible of reading a
labeled data sample and returning an ExampleSet and a TextList.
The former class represents the example sample used by the ML algorithm, the
latter class allows, during the extraction phase, to bind the pieces of information
extracted from the example sample to the original text. Although formally
the example sample is a sequence of instance-class pairs independently extracted from a uniform distribution, we take the liberty of calling it an example set.
The Instance interface represents an object into a data sample.
The instance is described by a collection of features. All general-purpose
Instance implementation classes (which could implement Instance
indirectly through one of its subclasses) should provide a "standard" implementation
of the matches method. A Feature object is a name-value
pair (note that in the next release this object will become an interface for
handling user defined features). FeatureFactory is a class for
vending Feature objects. Wherever possible, it hands out references
to shared Feature instances (thus limiting the memory
usage). Instance comes with a default implementation and can
be redefined for any application specific use. An Example object
is an instance with associated a class label and position into the corpus (Index).
The example is positive (true) if the instance belongs to a specified
class, negative (false) otherwise.
![]() |
| Figure 2.a Class diagram of the package data. |
The package provides the default implementation of the interfaces defined.
DefaultCorpusLoader is the implementation of the CorpusLoader
interface: it loads a corpus in TIES input format (TIESIF). The corpus loader
to be instantiated and its parameters can be specified in the ties-config.xml
file. DefaultInstance implements an instance with symbolic features.
For more details about the input format, the config file, and the default feature
representation see the TIES user's
guide.
The patterns package is a subpackage of data specialized
for handling dynamic patterns as defined in [1] (consider
that until release 2.0 this package was part of the data package).
The class diagram of Figure 2.c shows the classes and interfaces of the patterns
package. A Pattern object implements the concept of pattern, represented
as a specific sequence of Features. For instance, given a contiguous
sequence of instances represented by a set of features as shown in Figure 2b,
a pattern is a sequence of features taken one from each instance and in the
same order. For a generic IE application, instances can be words or word sequences
in a text, while features can be, for instance, the token itself,
parts of speech (PoS), orthographic features, and relations between words. A
pattern matches a contiguous sequence of instances iff it exists a sequence
of features as they appear into the pattern.
|
|
|||||||||
| Figure 2b On the left, a fragment of example set and, on the right, some patterns derived from it. | ||||||||||
Objects that implement the PatternMatcher interface should match
a specified pattern over an example set and return the list of instances' indices
where the pattern matches a positive examples (ok match) and where it matches
a negative example (wrong match). A PatternMatcher encapsulates
the ok and wrong matches of a given pattern. The efficient pattern matcher implementation
(FastPatternMatcher) uses an inverted index (InvertedIndex)
that, given a feature, returns the list of positions where it appears (IndexList)
within the example set.
![]() |
| Figure 2.c Class diagram of the package data.patterns. |
[...]
The ml package provides very general interfaces and classes for
defining machine learning algorithms. This package by design does not provide
any concrete class implementation; it delegates this task to the programmers
who want to implement new ML algorithms.
The class diagram of Figure 3.a shows the classes and interfaces of the ml
package. Every machine learning algorithm should implement the Learner
interface. Classes that implement this interface "learn" a concept from a labeled
training set. All Learner implementations must override the learn
method that returns the learned Hypothesis.
![]() |
| Figure 3.a Class diagram of the package ml. |
The Hypothesis interface is a representation of the concept learned
by the machine learning algorithm. It provides methods for loading and saving
the hypothesis, and a method for extracting a list of Fields from
the specified example set.
The boosting package is a subpackage of ml that provides
a boosting support (this package will be available since release 2.1).
[...]
![]() |
| Figure 3.b Class diagram of the package boosting. |
[...]
The validation package provides the classes necessary for performing
validation. We implemented some of the most well-known validation strategies, such
as N-fold cross validation, random split validation, and leave one out. The
class diagram of Figure 4 shows the classes and interfaces of this package.
Classes that implement the ValidationStrategy interface by contract
should reserve a certain amount of data for training and use the remainder
for testing and then run the process described in the sequence diagram shown
in Figure 1. Through the abstract class AbstractValidationStrategy
we provide a skeletal implementation of the ValidationStrategy
interface, to minimize the effort required to implement this interface.
For example, NFoldCrossValidation divides the data into n
parts, in each of which the class is represented in approximately the same proportions
as in the full dataset. Each part is held out in turn and the learning scheme
trained on the remaining; then its error rate is calculated on the holdout set.
For a description of the other validation strategies please refer to the API
documentation and TIES user's
guide.
![]() |
| Figure 4 class diagram of the package validation. |
The learning procedure is executed a total of n times, every time with a different
train/test split. Finally, the n
error estimates are averaged to yield an overall error estimate.
ValidationStrategyFactory is a factory class for creating a validation
strategy. It contains static methods for creating a validation strategy. The
strategy to be instantiated and its parameters can be specified in the ties-config.xml
file. See the TIES user's
guide for an example of config file.
The Evaluation object evaluates the learned hypothesis. It provides
standard precision, recall and f-measure figures. A ValidationResult
is used for collecting a list of validation results. It provides the averaged
result over the validation runs. This class provides a method for saving the
validation result in CSV.
This package contains miscellaneous utility classes not described here. For details on the package contents see the TIES API documentation.
In order to show that TIES can be extended easily, in the next subsection we propose as case of study the implementation of the Boosted Wrapper Induction.
[...]
This section describes the boosted wrapper induction object oriented implementation
using the TIES API. Note that the boosting support will be generalized and moved
in the ml.boosting package in the next release, but for the time being
it is hardcoded as part of the bwi package.
The class diagrams of Figure 5.a and 5.b show the classes and interfaces of
the bwi package. BWI is the main class of the package:
it implements the Learner and Boosting interfaces.
The former interface is described in The ml package.
The latter, together with the WeakLearner interface, provides the
abstraction of boosting.
As described in How TIES Works a validation object
calls the learn method of BWI defined in the Learner
interface. learn calls the boost method, defined in
the Boosting interface, for learning a linear combination of weak
hypotheses. Keeping separated the mechanism that runs the algorithm from its implementation
gives a lot of flexibility. For instance, changing from AdaBoost to a different
boosting technique requires only reimplementing the boost methods
defined in Boosting, keeping untouch the other classes. In similar
way, a new weak learner can be added just implementing the WeakLearner
interface.
![]() |
| Figure 5.a Class diagram of the package bwi. |
Once instantiated, the boosting algorithm repeatedly calls a weak learner implementation,
for instance DefaultWeakLearner, over the training set and updates the
examples weightings (Distribution). A WeakLearner
object returns a weak hypothesis, i.e. a Detector.
This process is represented in the sequence diagram shown in Figure 5.c.
The WeakLearnerFactory class is employed for creating a weak
learner instance. The learner to be instantiated and its parameters can be specified
in the ties-config.xml file. See the
TIES user's guide for an example of config file.
![]() |
| Figure 5.b Class diagram of the package bwi. |
The algorithm learns separately to recognize the beginning and end of the
fields to be extracted and combines them, using the field length distribution,
in a Wrapper object. The Wrapper class implements
the Hypothesis interface providing methods for saving and loading
a wrapper in XML format and for extracting a list of fields from an example set.
![]() |
| Figure 5.c Sequence diagram of the package bwi. |
In TIES the procedure which in [1] is called BWI corresponds to the learn
method of the BWI class, while the AdaBoost procedure corresponds to the boost
method defined in Boosting. Finally, the procedure LearnDetector
corresponds to the learnDetector defined in WeakLearner.
This section describes how to plug TIES in an application program.
[...]
[1] D. Freitag & N. Kushmerick, "Boosted wrapper induction", AAAI-00 (Austin), pp. 577-583.