Fabio Ciravegna and Alberto Lavelli
ITC-Irst, Centro per la Ricerca Scientifica e Tecnologica
via Sommarive 18, Povo, Trento, I-38050, Italy
Fax: +39-0461-302040 Telephone: +39-0461-314565
ciraveitc.it | lavelliitc.it


 

NOTE THAT LearningPinocchio IS NO LONGER UNDER ACTIVE DEVELOPMENT


 

Pinocchio is a toolkit for developing and running Information Extraction applications.Information Extraction (IE) is a technology that is futuristic from the user's point of view in the current information-driven world. Rather than simply indicating to the user the relevant documents (as search engines do), an IE system extracts pieces of information that are salient to the user's needs. The typical output of an IE system is a set of filled templates containing the extracted information. This information can then be either presented to the user or stored in a database for future use.

Pinocchio provides an environment for developing IE application that includes a rich graphical interface with specialised editors, graphers, debuggers and tracers. For running applications within the user environment, a delivery configuration is available with reduced hardware and software requirements. Both the configurations rely on the kernel, a generic machinery on the top of which configurations run. Pinocchio is largely language independent: it has been used in applications for Italian, but demonstrators exist for English and (partially) Russian.

Extracting Information from Texts

Pinocchio receives as input the result of a preprocessor that performs text zoning, morphological analysis, part of speech tagging and Named Entity recognition. The system is based on cascades of Finite State Transducers (FSTs). The first module applied is the parser. It assumes that there is one and only one possible correct parse tree for each sentence. Its goal is to produce a sufficient approximation of such a parse tree, i.e. a tree that captures all the relations useful for information extraction, while leaving the others implicit [Ciravegna et al., 1999b]. Parsing is performed in three steps: identification of NPs, VGs and PPs (chunking), A-structure recognition (i.e., recognition of subcategorization frames of verbs, some NPs, etc.), and modifier attachment. The output of the parser is a parse tree with the relevant modifiers attached, a Typed Feature Structure representing relations among constituents in the parse tree and a Quasi Logical Form (QLF) representing semantic realisations of the constituents. Following parsing, default reasoning is applied to introduce in the QLF additional information not explicitly contained in the text, but needed for template filling (e.g., "if a person working for company X is hired by company Y, s/he is no longer employee of X"). After this, discourse processing is performed: (pro)nominal references are resolved for both objects (people, organisations, physical objects) and events (e.g., hiring/firing); implicit relations are also captured (e.g., "The Bank of Japan decided .... The president said ...."). High accuracy is reached in discourse processing as the amount of available information (i.e., parse tree with associated information in the Typed Feature Structures and in the QLF) allows reliable choices among candidates. Templates are finally filled by using the final QLF (as produced by an additional default reasoning step). Template merging and recovery actions cope with missing information. The template format is adaptable to different user needs. One of the available formats is fully compatible with the MUC evaluation methodology.

The Development Environment

Portability is a main requirement for Pinocchio. The system can be ported to new applications and domains in a few weeks by just modifying declarative resources. The development configuration is specifically designed for helping in this process. Two types of resources are needed: static resources (the lexicon and the knowledge base) and dynamic resources (cascades of rules). The knowledge base defines the ontology for the application domain. Graphers and debuggers are provided for developing the knowledge base. For each word relevant for the domain the lexicon provides the mapping with the ontology and a description of the syntactic and semantic features (e.g. subcategorization frame).A specialised editor and a grapher are provided for lexicon development. The same formalism (and even the same primitives) is used for all the dynamic resources. This uniformity is an important advantage, allowing new applications to be developed by a single person. Compilers, graphers, debuggers and tracers are available for the cascades of rules. Browsers allow to inspect partial results of the whole IE process (see figure). The development environment makes available facilities for automatic comparison of system results with the templates provided by a human being for user-defined application corpora. At the end of the application development process, resources are compiled into source code to be used by the delivery configuration. The development environment also allows to port the system to new languages. This part of Pinocchio is based on Geppetto, an environment for LE application development [Ciravegna et al. 1998].

As mentioned only declarative resources must be modified when changing domain or language. The distinction (see next figure) between domain dependent (blu areas) and language dependent (i.e. domain independent, green areas) resources simplifies porting .


Applications

Pinocchio was initially developed as the IE module within LE-FACILE, a very successful project for text classification and IE funded by the European Union [Ciravegna et al. 1999a]. We have been developing different applications by using Pinocchio. For Italian one application about bond issues has been fully developed and two others have reached the level of demonstration (management succession and company financial results). Demonstrators were developed for English (for the field of economic indicators) and partially for Russian (for bond issues). Concerning Italian, for the application to bond issues, the user designed a template composed by 12 different slots (type of bond, issuer, global amount, placement date, announcement date, global rate, etc.). The system was tested on 95 texts from ANSA, Radiocor, Il SOLE 24 Ore (10,672 words); the formal results of the evaluation are Precision=.80 Recall=.72 F-measure=.76. Speed of analysis was 1,125 word/minute on a Sun Ultra 5, 128M RAM. The figures were obtained by automatically comparing the system results against a user-tagged corpus. The MUC scorer was used for the comparison.

The Italian Named Entity Recogniser (developed by using the FACILE Preprocessor [Black et al. 1998]) obtained: Precision=.96 Recall=.96 F-measure=.96. Porting to new applications required one to three man months, depending on the application. Porting to another language required from 3 to 5 months.

Several other applications are under study. As LE-FACILE follow-up we will continue to work in the financial domain. We plan to combine Pinocchio with speech technology; in particular we are addressing the problem of integrating it with the A.Re.S system [Angelini et al., 1994], a speaker independent dictation system developed at ITC-Irst, currently adopted by dozens of Italian hospitals for medical applications. We are also working on a project for IE for indexing of  medical diagnostic reports.

Software Details

Pinocchio has been developed in Allegro Common Lisp and runs under Unix on Sun workstations. Porting to Windows environments is under implementation.

Hardware requirements: Operating system: Unix Solaris 2.7. Hardware: Sun UltraSparc with at least 128M RAM and 1G HD. A Windows version is under development.

See the new version of Pinocchio with Learning  capabilities!
 

Contacts and Information

Further information about Pinocchio and about ITC-Irst can be found contacting:

Alberto Lavelli (email: lavelliitc.it)
ITC-Irst, Centro per la Ricerca Scientifica e Tecnologica
Loc. Pantè di Povo, Trento, I-38050, Italia.

References

[Angelini et al. 1994] B. Angelini, G. Antoniol, F. Brugnara, M. Cettolo, M. Federico, R. Fiutem and G. Lazzari Radiological Reporting by Speech Recognition: the A.Re.S. System in Proceedings . of the International Conference on Spoken Language Recognition, Yokohama, Japan, 1994

[Black et al., 1998] W. J. Black, F. Rinaldi, and D. Mowatt. FACILE: Description of the NE System Used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7).

[Ciravegna et al. 1998] F. Ciravegna, A. Lavelli, D. Petrelli, F. Pianesi. Developing Language Resources and Applications with Geppetto. in Proceedings of the First International Conference on Language Resources and Evaluation. Granada, Spain, 1998.

[Ciravegna et al. 1999a] F. Ciravegna, A. Lavelli, N. Mana , L. Gilardoni, S. Mazza, J. Matiasek, W. Black, F. Rinaldi, D. Mowatt: Classifying Texts Integrating Pattern Matching and Information Extraction Sixteenth International Joint Conference on Artificial Intelligence (IJCAI99), Stockholm, August, 1999

[Ciravegna et al. 1999b] Fabio Ciravegna and Alberto Lavelli: Full Text Parsing using cascades of Rules in Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL99), Bergen, Norway, June,1999

[Ciravegna 1999] Fabio Ciravegna: Pinocchio V3.1: user manual Technical Report, IRST, Trento, June 1999