TIES (Trainable Information Extraction System) is a
ML-based Information Extraction (IE) system currently under
development at ITC-irst within the Dot.Kom project. TIES
automatically learns rules from a corpus previously
annotated with a predefined set of XML tags. The XML tags are
intended to identify instances of entities
from a set of relevant elements defined by the user.
TIES is a Information Extraction system developed in an object-oriented fashion
with Java. The application packages supply a set of interfaces and classes
for training, testing and running an extraction task both in traditional (natural
text) and wrapper (machine-generated or rigidly-structured text) domains.
TIES v1.x was essentially a reimplementation of the Boosted Wrapper Induction
(BWI) algorithm devised by Dayne Freitag and Nicholas Kushmerick [1]. TIES v2.x incorporates the BWI algorithm as one of the ML
algorithms that can be integrated within the general TIES architecture. Boosting
is a technique for improving the performance of a simple machine learning algorithm
(called weak learner) by repeatedly applying it to the training set with
different example weightings. In BWI an algorithm that learns simple low-coverage
wrapper-like extraction patterns is applied to IE problems using boosting. The
TIES system architecture is strongly based on boosting and wrapper induction
techniques, but it has a high degree of flexibility allowing programmers, if
they like, to develop their own weak learner implementation, as well as to add
new validation strategies. The default implementation exploits only simple
orthographic features, which map an individual token to an arbitrary set of
wildcards
(e.g. capitalized, lower-case, punctuation), but more complex
features (e.g., morpho-syntactic ones) can be added to improve the performances
simply using a customized preprocessor. In this case a different feature extraction
method has to be supplied. The system comes with default implementation of all
the interfaces defined, therefore the application can also be used without programming
experience.
In the remaining sections of the tutorial, you are provided with step-by-step instructions for installing, configuring and performing common tasks using TIES. You will benefit most from this tutorial if you complete these sections in order. You should perform these tasks in the actual TIES environment--not a simulation.
The TIES software is available on all platforms supporting Java 2.
Linux/Intel Version This version of TIES is supported on Intel Pentium platform running the Linux kernel v 2.2.5 and glibc v 2.1, 256 megabytes RAM minimum. Recommended 1 gigabytes of RAM. KDE and KWM window managers.
Solaris/SPARC Version. Only Solaris versions 2.7 and 2.8 are supported. 256 megabytes RAM minimum, 1 gigabytes RAM recommended.
Mac OS Version Only Mac OS X (10.1 through 10.2 strongly recommended). 256 megabytes RAM minimum, 1 gigabytes RAM recommended.
Microsoft Windows Version for Microsoft Windows 98, NT 4.0, 2000 and XP on Intel hardware. A Pentium III or faster processor. 512 megabytes RAM minimum, 1 gigabytes RAM recommended.
On all systems you should have at least 5 megabytes of free disk space before installing the software. If you also install the separate test corpus download bundle, you need an additional 10 megabytes of free disk space.
TIES uses elements of the Java 2 API such as collections, and therefore building requires the Java 2 Standard Edition SDK (Software Development Kit). To run TIES, the Java 2 Standard Edition RTE (Run Time Environment) is required (or you can use the SDK, of course).
TIES is also dependent upon a few packages for general functionality. They are included in the build/lib directory for convenience, but the default build target does not include them. If you use the default build target, you must add the dependencies to your classpath.
Make sure you have installed Sun's Java 2 Environment. The full version is available at http://java.sun.com/j2se. If you are limited by disk space or bandwidth, install the smaller, run-time only version at http://java.sun.com/j2se.
Download the jar file of the installer here. Copy the file into the directory where you want to install the program and then run:
jar -xvf ties-install-2.x.jar
Now you are ready to run TIES. The latest User Guide is available here and the Programmer's Guide is available here.
Developers can download source code here. They are all in tar format. Generally, you should use the version with the highest non-beta release.
Suppose you have some textual data and you want to extract information from it. Usually documents are available as simple texts, or in HTML or XML format. However, the core ML engine of TIES expects them to be in TIES input format (introduced in the following subsection). This is necessary because token features have to be marked explicitly. This conversion to TIESIF can be easily done employing one of the Tokenizers provided. Note that, in spite of the use of the term Tokenizers (which suggests that they perform only tokenization), in TIES Tokenizers usually perform also more complex preprocessing tasks, such as feature extraction, as we will see in the following. If more complex features are needed the implementation of a more powerful tokenizer interface or the employment of an external feature extractor is required.
The current way of representing datasets is called TIESIF format. Figure 1 shows a fragment of a TIES input file for the standard "CMU seminar announcements" extraction task.
<?xml version="1.0"?> <corpus> <text path="./input/seminar" name="cmu.andrew.org.cfa.cfa-today-19:0"> <body> ... <token id="146" type="word" start="438" len="10" alpha_token="true" capitalized_token="true">University</token> <token id="148" type="word" start="449" len="2" alpha_token="true" lower_case_token="true">of</token> <token id="150" type="word" start="452" len="10" alpha_token="true" capitalized_token="true">Pittsburgh</token> <token id="151" type="nl" start="462" len="1" nl_token="true">\n</token> <token id="153" type="word" start="464" len="4" alpha_token="true" capitalized_token="true">Name</token> <token id="154" type="punct" start="468" len="1" punct_token="true">:</token> <token id="156" type="tag" start="470" len="9" open_tag="true">speaker</token> <token id="157" type="abbrev" start="479" len="3" abbr_token="true">Dr.</token> <token id="159" type="word" start="483" len="7" alpha_token="true" capitalized_token="true">Jeffrey</token> <token id="161" type="abbrev" start="491" len="2" abbr_token="true">D.</token> <token id="163" type="word" start="494" len="6" alpha_token="true" capitalized_token="true">Hermes</token> <token id="164" type="tag" start="500" len="10" close_tag="true">/speaker</token> <token id="165" type="nl" start="510" len="1" nl_token="true">\n</token> <token id="167" type="word" start="512" len="11" alpha_token="true" capitalized_token="true">Affiliation</token> <token id="168" type="punct" start="523" len="1" punct_token="true">:</token> <token id="170" type="word" start="525" len="10" alpha_token="true" capitalized_token="true">Department</token> ... </body> </text> ... <text> ... </text> ... </corpus> |
| Figure 1 A fragment of tokenized dataset. |
The input format is XML, this section describes each of the elements that are currently employed.
A collection of example texts are regarded as a corpus. The texts are marked with the <text> tag and may contain front matter, a text body, and back matter. The text body is tagged <body>. The overall structure of any text is thus defined by the following elements:
<corpus> contains multiple texts of any kind. No attributes.<text> contains a single text. Attributes:path the path of the file.name the name of the file.<body> contains the whole body of a single unitary text. No attributes.<div> divides the body in subsections. Attributes:type the division type (paragraph, sentence, etc.).<token> contains a token which is regarded as a unit. Attributes are of two types: compulsory and domain specific.id a unique identifier (of type integer) within the text.type the token type (word, tag, punctuation, etc.). If type is equal to tag, this means that the token identifies an example.start the offset (of type integer) with respect to the beginning of the text (in number of characters).len the token length (of type integer).Examples are labeled using a special token with attribute type equal to tag. The example is wrapped by a pair of tokens, specified using the two attributes open_tag and close_tag set to true. The contents of the two tokens are respectively the names of the labels: e.g., in the example shown in Figure 1, they are speaker for the token with id equal to 156 and /speaker for token with id equal to 164.
Note that some algorithms use labeled examples as positive examples and consider all the other tokens as negative examples, but this setting can be changed and negative examples can be explicitly labeled (see the match-all parameter in the FastWeakLearner class and DefaultFieldExtraction class). This feature can be exploited, for example, whenever a preprocessor that limits the candidate examples for the algorithm exists.
The weak learner is a symbolic rule-learning algorithm that searches for extraction patterns based on token features. Examples are defined in terms of features, which are functions mapping examples to discrete values. That is, an example is transformed into a collection of features, thereby producing a N-dimensional vector. Each word (token) is treated as an example by the weak-learner and, for instance, a possible feature is capitalized, a function that maps a word to the set {true, false}. Given such a feature, we can express simple propositions about a specific tokens:
capitalized("Home") = truecapitalized("work") = falseAll features used by the algorithm must be declared explicitly in the input format file (see domain specific attributes), the only implicit features are the token itself and any_token, a feature that matches all tokens. The default implementation of the tokenizer encapsulates a basic feature extractor that exploits only simple features, namely:
alpha_token true for tokens that contain only alphabetic charactersnum_token true for numbersperc_token true for percentagescapitalized_token true for tokens that begin with an upper-case letterlower_case_token true for tokens that contain only lower-case letterspunc_token true for punctuation tokensupper_case_token true for tokens that contain only upper-case letterssingle_char_token true for any one-character tokensdate_token true for datestime_token true for timesabbr_token true for abbreviationssymb_token true for symbolsThe feature extraction method generates only the features judged to be active in the example. All other features are judged to be inactive, or false.
User defined features (e.g. morpho-syntactic features), if available, could be provided to the algorithm but a tokenizer reimplementation or an external feature extractor is needed. The features must be declared as a name-value pair. It's important to point out that a proposition about a particular token may be defined in several ways, for instance, a token's part of speech can be defined as:
noun="true"pos="noun"Both declarations are valid but a uniform representation should be used in the same dataset.
TIES can perform three different tasks: learning, testing and extraction.
The following section describes in details the XML configuration, which is strongly recommended. For a complete API description see the TIES API documentation; a Developer's Guide is planned in a few months.
TIES is implemented using a set of modules. Each module has a number of settable properties and implements one or more interfaces, providing a piece of functionality.
The modules can be configured and assembled in several ways, but the most flexible mechanism uses XML files. Each module is described by an XML element, with subelements and attributes used to set module properties. By specifying which modules and their attributes to use, you have a lot of flexibility in controlling the features of your instance of TIES.
ties-config is the main element in the configuration file. It has multiple children describing the TIES modules: validation strategy, extraction strategy, weak learner, boosting, tokenizer and corpus loader. The directives controlling the input and output are also put into the configuration file, as well as cache configuration (that is essential to optimize the system performance and reduce the memory usage). Figure 2 shows an example of a configuration file for the standard "CMU seminar announcements" extraction task. The following subsections describe each part of the configuration file.
<!-- Configuration file for the standard "CMU seminar announcements" extraction task --> <ties-config> <validation-strategy> <validation-class>org.itc.irst.tcc.ties.validation.NFoldCrossValidation</validation-class> <init-param> <param-name>n</param-name> <param-value>5</param-value> </init-param> <init-param> <param-name>hypothesis-file</param-name> <param-value>./bwi/sa/out.xml</param-value> </init-param> <init-param> <param-name>eval-file</param-name> <param-value>./bwi/sa/bwi-eval.csv</param-value> </init-param> </validation-strategy> <bwi> <extraction-strategy> <extraction-class>org.itc.irst.tcc.ties.bwi.DefaultFieldExtraction</extraction-class> <init-param> <param-name>tau</param-name> <param-value>0</param-value> </init-param> <init-param> <param-name>match-all</param-name> <param-value>true</param-value> </init-param> <init-param> <param-name>close-all</param-name> <param-value>false</param-value> </init-param> </extraction-strategy> <weak-learner> <learner-class>org.itc.irst.tcc.ties.bwi.FastWeakLearner</learner-class> <init-param> <param-name>lookahead</param-name> <param-value>3</param-value> </init-param> <init-param> <param-name>cache-size</param-name> <param-value>5</param-value> </init-param> <init-param> <param-name>match-all</param-name> <param-value>true</param-value> </init-param> </weak-learner> <boosting> <boosting-class>org.itc.irst.tcc.dot.kom.bwi.boosted.BWI</boosting-class> <init-param> <param-name>iterations</param-name> <param-value>100</param-value> </init-param> <init-param> <param-name>labels</param-name> <param-value>stime,location</param-value> </init-param> </boosting> </bwi> <tokenizer> <tokenizer-class>org.itc.irst.tcc.util.tokenizer.EnglishTokenizer</tokenizer-class> <init-param> <param-name>input</param-name> <param-value>./input/seminar</param-value> </init-param> <init-param> <param-name>output</param-name> <param-value>./input/seminar.xml</param-value> </init-param> <init-param> <param-name>nl</param-name> <param-value>true</param-value> </init-param> <init-param> <param-name>sp</param-name> <param-value>false</param-value> </init-param> </tokenizer> <corpus-loader> <loader-class>org.itc.irst.tcc.ties.data.DefaultCorpusLoader</loader-class> <init-param> <param-name>input</param-name> <param-value>./input/input.xml</param-value> </init-param> <init-param> <param-name>atags</param-name> <param-value>stime,location,etime,speaker</param-value> </init-param> <init-param> <param-name>itags</param-name> <param-value>paragraph,sentence</param-value> </init-param> <init-param> <param-name>features</param-name> <param-value>alpha_token,lower_token,upper_token,cap_token,nl_token, punc_token,schar_token</param-value> </init-param> </corpus-loader> </ties-config> |
| Figure 2 An example of configuration file. |
Lines included within <!-- and --> are ignored, as empty lines.
The validation strategy module is responsible for running an experiment consisting in training and testing an algorithm over a specified dataset. Below it is described how to choose and configure a validation strategy.
The ties-config.validation-strategy.validation-class
field is a compulsory field required to specify the validation strategy implementation.
The value of this field is the name of the java class that implements the validation
strategy interface. For example, in the file shown in Figure 2 the N-Fold
Cross-Validation has been chosen setting the field value to org.itc.irst.tcc.ties.validation.NFoldCrossValidation.
For a detailed description of the cross validation strategies see [2]
or the TIES API documentation.
The ties-config.validation-strategy.init-param fields are used to specify initialization parameters for the specified validation strategy. The following subsections describes each validation strategy available.
This strategy divides the entire data set into N subsets of (approximately) equal size. Then, it trains the system N times, each time leaving
out one of the subsets from training and using such subset as test set. Table 1.a shows the parameters to set for NFoldCrossValidation. A popular way to perform this validation is to set N to 10. TenFoldCrossValidation has N set to 10 (Table 1.b). If N equals the sample
size, this is called Leave-One-Out cross-validation. Table 1.c. shows the parameters to set for LeaveOneOut.
| ||||||||
| Table 1.a N-Fold Cross-Validation parameters description. | ||||||||
| ||||||||
| ||||||||
| Table 1.b 10-Fold Cross-Validation parameters description. | ||||||||
| ||||||||
| ||||||||
| Table 1.c Leave-One-Out parameters description. | ||||||||
| ||||||||
This strategy differs from N fold cross-validation only since the folds are randomly chosen. Table 2 shows the initial parameters to set for RandomSplitCrossValidation.
| ||||||||||
| Table 2 Random Split Validation parameters description. | ||||||||||
In this method only a single subset (the testing set) is used to
estimate the generalization error, instead of N different subsets; i.e.,
there is no "crossing". Table 3 shows the initial parameters to set for this validation strategy.
The peculiarity of this strategy is the possibility to set the texts to use for training and testing. This is done enumerating in training-file and testing-file the texts (one for line) for training and testing, respectively.
| ||||||||||
| Table 3 Single Split Validation parameters description. | ||||||||||
| ||||||||||
This method extends the single split validation introducing
cross validation. Table 4 shows the initial parameters to set for this validation
strategy. As above the peculiarity of this strategy is the possibility of specifying
the texts to use for training and testing. This is done enumerating in train.x
and test.x the texts (one for line) for training and testing; where
x is the partition number, from 1 to n. By convention,
training-file and testing-file parameters do not
include the partition number, i.e., if n is 2 and training-file
is split/test and testing-file is split/test
the system will learn on the texts specified in train.1 and train.2
and test on test.1 and test.2, respectively.
| ||||||||||||
| Table 4 Multi Split Validation parameters description. | ||||||||||||
| ||||||||||||
The extraction strategy module is delegated to extract a list of entities using the hypothesis learned. Changing parameter setting can force a different tradeoff between precision and recall. Below it is described how to choose and configure an extraction strategy.
The ties-config.bwi.extraction-strategy.extraction-class
field is a compulsory field required to specify the extraction strategy implementation.
The value of this field is the name of the java class that implements the extraction
strategy interface. For example, in the file shown in Figure 2 the DefaultFieldExtraction
has been chosen setting the field value to org.itc.irst.ties.bwi.DefaultFieldExtraction.
For a detailed description of the extraction strategies see the
Developer's Guide or the TIES API documentation.
The ties-config.bwi.extraction-strategy.init-param
fields are used to specify initialization parameters for the specified extraction
strategy. The following table shows all the parameters to be set for the DefaultFieldExtraction:
| ||||||||
| Table 5 Default Field Extraction parameters description. | ||||||||
A learning algorithm is called a weak learner if it produces a weak hypothesis. A weak hypothesis approximates the target concept "a little better" than a random guess. Below it is described how to choose and configure a weak learner.
The ties-config.bwi.weak-learner.learner-class
field is a compulsory field required to specify the weak learner implementation.
The value of this field is the name of the java class that implements the weak
learner interface. For example, in the file shown in Figure 2 the Default Weak
Learner has been chosen setting the field value to org.itc.irst.tcc.ties.bwi.FastWeakLearner.
For a detailed description of the weak learner see [1] or
the TIES API documentation.
The ties-config.bwi.weak-learner.init-param fields
are used to set initialization parameters for the specified weak learner. The
following table shows all the parameters to be set for the FastWeakLearner:
| ||||||||
| Table 6 FastWeakLearner parameters description. | ||||||||
A boosting algorithm (booster) is capable of producing an approximation of the target concept using another algorithm (weak learner), which is only able to weakly approximate the target concept. Below it is described how to choose and configure the boosting.
The ties-config.bwi.boosting.boosting-class field
is a compulsory field required to specify the boosting implementation. The value
of this field is the name of the java class that implements the boosting interface.
For a detailed description of the boosting algorithm see [1]
or the TIES API documentation.
The ties-config.bwi.boosting.init-param fields
are used to set initialization parameters for the specified AdaBoost. The following
table shows all the initial parameters to be set:
| ||||||
| Table 7 BWI parameters description. | ||||||
Tokenizes the files that define examples to be learned by TIES. A single file or a directory (read recursively) can be specified. Plain text (TXT) and tagged (SGML, XML and HTML) documents are accepted by the tokenizer implementation as input.
The ties-config.tokenizer.tokenizer-class field
is a compulsory field required to specify the tokenizer implementation. The
value of this field is the name of the java class that implements the tokenizer
interface. For example in the configuration file shown in Figure 2 the English
Tokenizer has been chosen setting the field value to org.itc.irst.tcc.util.tokenizer.EnglishTokenizer.
For a detailed description of the tokenizer see the TIES API documentation.
The ties-config.tokenizer.init-param fields are used to set initialization parameters for the specified tokenizer. The following tables show all the parameters to be set for the English and Italian Tokenizer:
| ||||||||||||
| Table 8.a English Tokenizer parameters description. | ||||||||||||
| ||||||||||||
| Table 8.b Italian Tokenizer parameters description. | ||||||||||||
This module loads the TIES input format into memory and marks the entities to learn (specified by the atags list) as positive examples. Moreover, a list of ignorable tags (itags) can be provided for skipping tags that should not be included into the wrapper learned.
The ties-config.corpus-loader.loader-class field
is a compulsory field required to specify the corpus loader. The value of this
field is the name of the java class that implements the corpus loader interface.
For example in the configuration file shown in Figure 2 the Default Corpus Loader
has been chosen setting the field value to org.itc.irst.tcc.ties.data.DefaultCorpusLoader.
For a detailed description of the tokenizer see the TIES API documentation.
The ties-config.corpus-loader.init-param fields are used to set initialization parameters for the specified tokenizer. The following table shows all the parameters to be set for the Default Class Loader:
| ||||||||||||
| Table 9 Default Corpus Loader parameters description. | ||||||||||||
To run a cross validation on your dataset:
Move to the directory that contains the application.
Set the CLASSPATH environment variable to the TIES root directory
and the following jar files (distributed in the lib directory):
Define a new configuration file ``your-bwi-conf.xml''
as described in the previous section.
Run the command:
java -Dconfig.file=your-bwi-conf.xml -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
org.itc.irst.tcc.ties.bwi.BoostingFactory [-notokenizer] [> your-output-file]
You have to specify the configuration file in the command line using the -Dconfig.file option. By default the program look for a file ties-config.xml in the application directory.
The result of the cross validation is output in CSV format; for each boosting round the following figures are returned: true positive count (tp), false negative count (fn), false positive count (fp), recall, precision and F(1)-measure.
| ||||||||||||||||||||||||||||||||||||||||||||||||
Table 10 A 5-fold cross-validation result for the speaker field in the CMU seminar announcements. |
In order to generate the final extractor, run the procedure as described above using N-Fold Cross-Validation and setting the n parameter to 1; the system will learn on the whole dataset.
The wrapper learned is stored in XML format and it can be used later for the extraction task. Figure 3 shows a fragment of a wrapper learned for the speaker field in the CMU seminar announcements.
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?> <wrapper label="speaker"> <fore-detector> <detector> <pattern type="prefix"> <feature name="token" value="Who"/> <feature name="single_char_token" value="true"/> </pattern> <pattern type="suffix"> <feature name="alpha_token" value="true"/> </pattern> <confidence-value>2.7587264482323546</confidence-value> </detector> <detector> <pattern type="prefix"> <feature name="token" value="speaker"/> <feature name="single_char_token" value="true"/> </pattern> <pattern type="suffix"> </pattern> <confidence-value>2.2216808574759974</confidence-value> </detector> ... </wrapper> |
| Figure 3 A fragment of the wrapper learned encoded in XML. |
| |||||||||||||||
| Table 11 A fragment of the wrapper learned after applying a XSL transformation. | |||||||||||||||
We plan to add some comments to better understand the example above.
The following section describes how to perform the extraction task on a given input. It is strongly recommended to configure the extraction module using the XML configuration file. Please do not confuse the following configuration file, employed for extraction, with the preceding file, employed for learning the model. The result is output in XML format. If further manipulation of the extracted entities is required a java API is provided.
<!-- Configuration file for the standard "CMU seminar announcements" extraction task --> <ties-config> <bwi> <classifier> <extract> <entity> <wrapper>./bwi/sa/new-out0-speaker.xml</wrapper> <output>./bwi/sa/res0-speaker.xml</output> </entity> <entity> <wrapper>./bwi/sa/new-out0-location.xml</wrapper> <output>./bwi/sa/res0-location.xml</output> </entity> </extract> </classifier> <extraction-strategy> <extraction-class>org.itc.irst.tcc.ties.bwi.DefaultFieldExtraction</extraction-class> <init-param> <param-name>tau</param-name> <param-value>0</param-value> </init-param> <init-param> <param-name>match-all</param-name> <param-value>true</param-value> </init-param> <init-param> <param-name>close-all</param-name> <param-value>false</param-value> </init-param> </extraction-strategy> </bwi> <tokenizer> <tokenizer-class>org.itc.irst.tcc.util.tokenizer.EnglishTokenizer</tokenizer-class> <init-param> <param-name>input</param-name> <param-value>./input/seminar</param-value> </init-param> <init-param> <param-name>nl</param-name> <param-value>true</param-value> </init-param> <init-param> <param-name>sp</param-name> <param-value>false</param-value> </init-param> </tokenizer> <corpus-loader> <loader-class>org.itc.irst.tcc.ties.data.DefaultCorpusLoader</loader-class> <init-param> <param-name>atags</param-name> <param-value>stime,location,etime</param-value> </init-param> <init-param> <param-name>itags</param-name> <param-value>speaker,paragraph,sentence</param-value> </init-param> <init-param> <param-name>features</param-name> <param-value>alpha_token,lower_token,upper_token,cap_token,nl_token, punc_token,schar_token</param-value> </init-param> </corpus-loader> </ties-config> |
| Figure 4 An example of configuration file for extraction. |
ties-config is the main element in the configuration file. It has multiple children describing the TIES modules employed for extraction: classifier, extraction strategy, tokenizer and corpus loader. Figure 4 shows an example of a configuration file for the standard "CMU seminar announcements" extraction task. The following subsections describe the configuration of the classifier module; the tokenizer and corpus loader modules are described in the Running TIES section.
This module provides methods for extracting information from a single text or a set of texts given the model (wrapper) learned during training. The result is output in a file encoded in XML.
The ties-config.classifier.extract.wrapper field is a compulsory field required to specify the model to apply for extraction.
The ties-config.classifier.extract.output field is used to set the output file. The following table shows all the parameters to be set for the provided Classifier:
| ||||||||
| Table 12 Classifier parameters description. | ||||||||
To perform the extraction task on an input file do the following:
Move to the directory that contains the application.
Set the CLASSPATH environment variable to the TIES root directory and the following jar files (distributed in the lib directory):
Define a new configuration file ``your-classifier-config.xml''
as described in the previous section.
Run the command:
java -Dconfig.file=your-classifier-config.xml
-Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser org.itc.irst.tcc.ties.bwi.Classifier
[> your-output-file]
<entity-list> <entity name="speaker" src=".\CMUAN~3G.DER" start="142" end="151"> <token start="142" len="3">Mr.</token> <token start="146" len="5">Okada</token> </entity> <entity name="speaker" src=".\CMUAN~EU.ACE" start="330" end="348"> <token start="330" len="3">Dr.</token> <token start="334" len="9">Stephanie</token> <token start="344" len="4">Shaw</token> </entity> <entity name="speaker" src=".\CMUAN~G2.CAR" start="624" end="640"> <token start="624" len="3">Mr.</token> <token start="628" len="6">Andrew</token> <token start="635" len="5">Gault</token> </entity> <entity name="speaker" src=".\CMUAN~G2.CAR" start="810" end="826"> <token start="810" len="3">Mr.</token> <token start="814" len="6">Jessie</token> <token start="821" len="5">Ramey</token> </entity> <entity name="speaker" src=".\CMUAN~G2.CAR" start="880" end="896"> <token start="880" len="3">Dr.</token> <token start="884" len="4">Judi</token> <token start="889" len="7">Mancuso</token> </entity> </entity-list> |
Figure 5 A fragment of the output file obtained in the standard CMU seminar announcements extraction task: sequences of tokens extracted for the speaker field. |
This section describes how to manipulate the extracted entities
using the TIES API. ExtractionResult acts as a container for the
result of an extraction task. ExtractionResult maps files to set
of pairs consisting of a label and a set of entities. A map cannot contain duplicate files; each file can
map to many entities. Each set of entities is sorted according to the extraction score.
// set the config file; alternatively, run the application with this
// param -Dconfig.file=your_classifier_conf.xml
System.setProperty("config.file", "your_classifier_conf.xml");
// instantiate the Classifier and extract the entities
// configuration parameters are loaded through the
// configuration file
ExtractionResult er = new Classifier().extract();
Iterator it = er.entries().iterator();
while (it.hasNext())
{
// iterate on the files
Map.Entry e = (Map.Entry) it.next();
// get the file
File f = (File) e.getKey();
// gets the list of entities for the file
ExtractionResult.EntityMap em = (ExtractionResult.EntityMap) e.getValue();
Iterator it1 = em.entries().iterator();
while (it1.hasNext())
{
// iterate on the labels
Map.Entry e2 = (Map.Entry) it1.next();
String label = (String) e2.getKey();
SortedSet ss = (SortedSet) e2.getValue();
// get the entity with the highest score (ss.last() is the entity with the lowest score)
Entity entity = (Entity) ss.first();
// add code here
} // end inner while
} // end outer while
|
| Figure 6 A fragment of code for manipulating the information extracted. |
The application comes with the dataset (/input folder) and configuration files (/SA folder) for the standard "CMU seminar announcements" extraction task. Try the following steps to extract the speaker, location, stime and etime fields:
Move to the directory that contains the application.
Set the CLASSPATH environment variable to the TIES root directory
and the following jar files (distributed in the lib directory):
Modify ./bwi/sa/ties-config.xml, as described
in the Configuration File section, to
set your cross-validation parameters.
Run the cross-validation process:
java -mx1024M -Dconfig.file=./bwi/sa/ties-config.xml
-Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser org.itc.irst.tcc.ties.bwi.BoostingFactory
> ./bwi/sa/out.bwi
Read the result of evaluation in ./bwi/sa/bwi-eval.csv.
Repeat step 3 learning on the whole dataset, i.e. using N-Fold
Cross-Validation and setting n to 1.
Modify ./bwi/sa/classifier-config.xml, as described
in the Classifier configuration
section, to set your classifier parameters.
Extract entities from a new text.
java -Dconfig.file=./bwi/sa/classifier-config.xml -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
-mx256M org.itc.irst.tcc.ties.bwi.Classifier > ./bwi/sa/out1.bwi
See the Release Notes on the TIES web site for a summary of changes to the software since the previous version and other information pertaining to this release. The online release notes will be updated as needed, so you should check them regularly for the latest information.
[1] D. Freitag & N. Kushmerick, "Boosted wrapper induction", AAAI-00 (Austin), pp. 577-583.
[2] Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999, 416 pages, ISBN 1-55860-552-5.
If you have any comments about these pages, please contact: giuliano@itc.it