NOTE THAT LearningPinocchio IS NO LONGER UNDER ACTIVE DEVELOPMENT
Welcome to the page of LearningPinocchio, a prototype for adaptive
Information Extraction developed at
ITC-Irst .
What is LearningPinocchio
LearningPinocchio is the clever brother of Pinocchio . Differently from Pinocchio, which
is based on manually-developed cascades of rules, LearningPinocchio is based
on a covering algorithm that learns template filling rules directly from tagged
corpora, without other user intervention.
In the following pages you will see some of its performances on tagged
corpora.
Technical description
For a technical description see
Fabio
Ciravegna "Learning to Tag for Information Extraction
from Text" in Fabio Ciravegna, Roberto Basili, Robert Gaizauskas (eds.)
ECAI Workshop
on Machine Learning for Information Extraction,
workshop held in conjunction with ECAI2000, Berlin, August
2000.
A preliminary version of the manual is on-line
Scientific Experiments
Each experiment was performed by training
LearningPinocchio on a subset of the corpus (some hundreds of texts, depending
on the corpus) and testing the learned rules on unseen texts. In all the
experiments the task was to insert in the texts SGML tags indicating the
starting point and the ending point of the slot fillers. For
example <speaker> will indicate the point where the speaker's name starts
and </speaker> will indicate the ending point of the speaker's name.
For each of the corpora it is shown:
- the corpus texts with highlighted the template slot fillers found by
LearningPinocchio
- the scores in terms of precision and recall on the whole corpus for
each tag
- the comparison between the performances of LearningPinocchio and those of
other systems presented in literature (if any).
- Seminar Announcement taken from the CMU
mailing list (in English)
(corpus
provided by Dayne Freitag - CMU - and made available through the
RISE
initiative)
- Task: identifying speaker name, starting time, ending time and
location of each seminar. just one event per message.
- Job Announcements from misc.jobs.offered
(in English)
(corpus provided by Marie Elaine Califf - Univ. of Texas at Austin
- and made available through the RISE
initiative)
- Task: identifying announcement message id, title, salary, company,
recruiter, state, city, country, language, platform, application,
area, req_years_experience, desired_years_experience,
req_degree, desired_degree, post_date
- Seminar Announcement taken from the ITC mailing list (mixed Italian/English)
- Slots: identifying speaker, title of the seminar, location, date and
time of each seminar. More than one seminar can appear for each posting.
Industrial Applications
Real-world applications have been
developed or are under development for a number of companies. They
concern Named Entity Recognition from financial news for Italian, data
extraction from personal resumees in English, information extraction from
classified articles in Italian. LearningPinocchio is
also under consideration for two industrial projects in the
area of pharmaceutical research.
Completed Applications
Ongoing
applications
- Named Entity Recognition from financial news for
Italian,
- Information extraction from classified
articles in Italian.
LearningPinocchio
is also under consideration for two industrial projects
in the area of pharmaceutical research.
For further information, contact:
Alberto Lavelli (email: lavelli
itc.it)
ITC-Irst, Centro per la Ricerca
Scientifica e Tecnologica
Loc. Pantè di Povo, Trento, I-38050, Italia.
Page Created by:
Fabio Ciravegna