NOTE THAT LearningPinocchio IS NO LONGER UNDER ACTIVE DEVELOPMENT


 

Welcome to the page of LearningPinocchio,  a prototype for adaptive Information Extraction developed at ITC-Irst .
 

What is LearningPinocchio

LearningPinocchio is the clever brother of  Pinocchio . Differently from Pinocchio, which is based on manually-developed cascades of rules, LearningPinocchio is based on a covering algorithm that learns template filling rules directly from tagged corpora, without  other user intervention.

In the following pages you will see some of its  performances on tagged corpora.

Technical description

For a technical description see
Fabio Ciravegna    "Learning to Tag for Information Extraction from Text" in Fabio Ciravegna, Roberto Basili, Robert Gaizauskas (eds.) ECAI Workshop on Machine Learning  for Information Extraction,  workshop  held in conjunction with ECAI2000, Berlin, August 2000.
A preliminary version  of the manual is on-line

Scientific Experiments

Each experiment was performed by training LearningPinocchio on a subset of the corpus (some hundreds of texts, depending on the corpus) and testing the learned rules on unseen texts.  In all the experiments the task was to insert  in the texts SGML tags indicating the starting point   and  the ending point of the slot fillers. For example <speaker> will indicate the point where the speaker's name starts and </speaker> will indicate the ending point of the speaker's name.
For each of the corpora it is shown:

Industrial Applications

Real-world applications have  been developed or are under  development for a number of companies.  They concern Named Entity Recognition from financial news for Italian, data  extraction from personal resumees in English, information  extraction from classified  articles in Italian. LearningPinocchio   is  also under  consideration  for  two industrial projects in the area of pharmaceutical research.

Completed Applications

Ongoing applications LearningPinocchio   is  also under  consideration  for  two industrial projects in the area of pharmaceutical research.

For further information, contact:
Alberto Lavelli (email: lavelliitc.it)
ITC-Irst, Centro per la Ricerca Scientifica e Tecnologica
Loc. Pantè di Povo, Trento, I-38050, Italia.
 


Page Created by:
Fabio Ciravegna