HLT Seminars Page

FBK-irst, Trento, Italy

The Seminars will take place every Thursday, usually from 11:00 to 12:00, during the months from December 2006 to June 2007 as specified in the program's overview included below.

The purpose of the seminars is to:

This page is updated regularly and it is kept as accurate as possible. However, details for the seminars may be changed at short notice and some of the seminars may not always take place as announced.

For more information, please contact: giuliano [at] itc.it

Program (December-June)

DateTimeLocationSpeakeraffiliationTitleType
December 7 14:30-15:30 Sala Grande Est Bernardo Magnini ITC-irst Question Answering at ITC-irst introduction
December 14 14:30-15:30 Sala Consiglio Scientifico Ido Dagan Bar Ilan University Textual entailment as a framework for applied semantics invited talk
December 21 14:30-15:30 Sala Grande Est Marcello Federico ITC-irst Machine Translation at ITC-irst introduction
January 11 11:00-12:00 Sala Grande Est Alberto Lavelli ITC-irst Information Extraction at ITC-irst introduction
January 18 11:00-12:00 Sala Grande Est Nicola Bertoldi ITC-irst Efficient Speech Translation through Confusion Network Decoding research results
January 25 11:00-12:00 Sala Grande Est Claudio Giuliano ITC-irst Relation Extraction and the Effect of Automatic Entity Recognition research results
February 1 11:00-12:00 Sala Grande Est Lorenza Romano ITC-irst Simple Information Extraction (SIE): A Portable and Effective IE System research results
February 15 11:00-12:00 Sala Grande Est Carlo Strapparava ITC-irst Dances with Words research results
February 22 11:00-12:00 Sala Grande Est Milen Kouylekov ITC-irst Recognising Textual Entailment with Tree Edit Distance: Application to Question Answering and Information Extraction research results
March 1 11:00-12:00 Sala Grande Est Daniele Pighin FBK-irst Tree Kernels for Statistical Natural Language Processing research results
March 8 11:00-12:00 Sala Grande Est Mauro Cettolo FBK-irst Handling Word-reordering Phenomena in Statistical Machine Translation research results
March 22 11:00-12:00 Sala Consiglio Scientifico Deepa Gupta FBK-irst POS-based Reordering Models for Statistical Machine Translation research results
March 29 11:00-12:00 Sala Grande Est Massimiliano Ciaramita Yahoo! Research Dependency Parsing with Semantic Information invited talk
April 12 11:00-12:00 Sala Grande Est Marco Baroni CIMeC, University of Trento Building Very Large Corpora from the Web invited talk
May 3 11:00-12:00 Sala Consiglio Scientifico Alfio Gliozzo FBK-irst Semantic Domains and Ontology Learning research results
May 10 11:00-12:00 Sala Consiglio Scientifico Marcello Federico FBK-irst Efficient Handling of N-gram Language Models for Statistical Machine Translation research results
May 17 11:00-12:00 Sala Consiglio Scientifico Fabio Brugnara FBK-irst Speech Recognition at FBK-irst introduction
May 24 11:00-12:00 Sala Grande Est Charles Callaway University of Edinburgh The use of ontologies in knowledge acquisition invited talk
May 31 11:00-12:00 Sala Grande Est Valentina Bartalesi Lenzi, Manuela Speranza and Rachele Sprugnoli CELCT & FBK-irst The Italian Content Annotation Bank (I-CAB) research results
June 7 11:00-12:00 Sala Grande Est Octavian Popescu FBK-irst Cross Document Coreference research results
June 14 11:00-12:00 Sala Grande Est Daniele Falavigna FBK-irst Use of Word Graphs in Automatic Speech Recognition research results

December

December 7, 14:30-15:30, Sala Grande Est
Question Answering at ITC-irst
Bernardo Magnini, ITC-irst

This seminar is intended to provide both an introduction to Open Domain Question Answering and an overview of the research carried on at ITC-irst on this topic in the last years. I will introduce the QA task as defined in evaluation campaigns which have been running under TREC and CLEF, pointing out achievements and current limitations. The Irst QA systems will be briefly described, as well as the contribution of related research topics such as answer validation, query expansion and automatic acquisition of answer patterns. Finally, I will mention the QALL-ME project, recently started at Irst.

December 14, 14:30-15:30, Sala Consiglio Scientifico
Textual entailment as a framework for applied semantics
Ido Dagan, Bar Ilan University

We have recently proposed Recognizing Textual Entailment (RTE) as a generic task that captures major semantic inferences across different natural language processing applications. The talk will first review the motivation and definition of the textual entailment task and the PASCAL RTE-1&2 Challenges benchmarks. Then we will demonstrate directions for building up textual entailment systems and utilizing them for concrete applications. Furthermore, we suggest that textual entailment modeling may become a comprehensive framework for applied semantics research. Such framework introduces novel useful variants for known semantic problems and also highlights important new problems which were hardly investigated so far within computational linguistics. This semantic modeling perspective will be reviewed and illustrated by a case study for an entailment variant of the word sense disambiguation problem.

December 21, 14:30-15:30, Sala Grande Est
Machine Translation at ITC-irst
Marcello Federico, ITC-irst

After a short introduction on statistical machine translation, I will overview research carried out and results achieved at ITC-irst during the last 5 years. Finally, I will give an outlook on current and future activities on SMT.

January

January 11, 11:00-12:00, Sala Grande Est
Information Extraction at ITC-irst
Alberto Lavelli, ITC-irst

The talk will provide both an introduction to Information Extraction and an overview of the research carried on in the TCC division on this topic in the last years. Various IE evaluation activities will be described, starting from the MUC conferences. Both the IE systems developed in the TCC division and the European projects involving the IE group will be briefly described, as well as other contributions to related research topics.

January 18, 11:00-12:00, Sala Grande Est
Efficient Speech Translation through Confusion Network Decoding
Nicola Bertoldi, ITC-irst

In the talk I will introduce the Spoken Language Translation task (SLT), highlighting the issues which make SLT more complex than Text Translation. I will present a state-of-the-art system for SLT, which exploits confusion networks as interface between automatic speech recognition and machine translation. In particular, I will describe a decoding algorithm for confusion networks which results as an extension of a state-of-the-art phrase-based text translation decoder.

January 25, 11:00-12:00, Sala Grande Est
Relation Extraction and the Effect of Automatic Entity Recognition
Claudio Giuliano, ITC-irst

We present an approach for extracting relations between entities from natural-language documents. The approach is based solely on shallow linguistic processing, such as tokenization, sentence splitting, part-of-speech tagging and lemmatization. It uses a combination of kernel functions to integrate two different information sources: (i) the whole sentence where the relation appears, and (ii) the local contexts around the interacting entities. We present the results of experiments on extracting five different types of relations from a data set of newswire documents and show that each information source provides a useful contribution to the recognition task. Moreover, we performed a set of experiments to assess the influence of the accuracy of named entity recognition on the performance of the relation extraction algorithm. Such experiments were performed using both the ``correct'' named entities (i.e., those manually annotated in the corpus) and the ``noisy'' named entities (i.e., those produced by a machine learning based named-entity recognizer). The results show that the approach is robust with respect to the noise introduced by a named-entity recognizer. Moreover, our approach significantly improves the previous results obtained on the same data set when a comparison is possible.

February

February 1, 11:00-12:00, Sala Grande Est
Simple Information Extraction (SIE): A Portable and Effective IE System
Lorenza Romano, ITC-irst

In this talk we present SIE (Simple Information Extraction) an information extraction system designed and developed in the context of the IST-Dot.Kom project (2002-2005). SIE is a supervised modular system based on a general purpose machine learning algorithm (SVM) combined with several customizable modules, and was designed with the goal of being easily and quickly portable across languages, tasks, and domains. A crucial role in the architecture is played by Instance Filtering, which allows increasing efficiency without reducing effectiveness. Results obtained by SIE on several standard data sets, representative of different tasks and domains, are reported.

February 15, 11:00-12:00, Sala Grande Est
Dances with Words
Carlo Strapparava, ITC-irst

Animated text is an appealing field of creative graphical design. Manually designed text animation is largely employed in advertising, movie titles and web pages. In this talk we propose to link, through state of the art NLP techniques, the affective content detection of a piece of text to the animation of the words in the text itself. This methodology allows us to automatically generate affective text animation and opens some new perspectives for many internet applications.

February 22, 11:00-12:00, Sala Grande Est
Recognising Textual Entailment with Tree Edit Distance: Application to Question Answering and Information Extraction
Milen Kouylekov, ITC-irst

This work addresses the problem of Recognizing Textual Entailment (i.e. recognizing that the meaning of a text entails the meaning of another text) using a Tree Edit Distance algorithm between the syntactic trees of the two texts. A key aspect of the approach is the estimation of the cost for the editing operations (i.e. Insertion, Deletion, Substitution) among words. Our aim is to compare the contribution of different resources providing entailment rules, including lexical rules from WordNet and the UniAlberta thesaurus, and syntactic rules automatically acquired by the Dirt and TEASE systems. We carried out a number of experiments over the PASCAL-RTE dataset in order to estimate the contribution of different combinations of the available resources. In addition, we have developed and evaluated an Answer Validation module for Question Answering and a Relation Extraction system, both of them based on textual entailment.

March

March 1, 11:00-12:00, Sala Grande Est
Tree Kernels for Statistical Natural Language Processing
Daniele Pighin, FBK-irst

Statistical classifiers are widely used for many NLP, Information and Relation Extraction tasks. The learning algorithms employed are typically trained using a combination of the linguistic information available for the text fragments at study. Nevertheless, (a) such lexical, morphological and syntactic features generally have to be defined on a per-language basis, and (b) the selection and representation of features can be hardened by the lack of a sound interpretation of the underlying linguistic phenomenon. Tree kernel functions alleviate these problems, as they can trigger automatic feature selection and evaluate the similarity between two parse tree fragments without requiring the explicit design and extraction of the attribute-value representation of the encoded texts. Simple manipulations of the parse tree fragments can also be performed, resulting in very accurate models for specific classification tasks.

March 8, 11:00-12:00, Sala Grande Est
Handling Word-reordering Phenomena in Statistical Machine Translation
Mauro Cettolo, FBK-irst

In machine translation (MT), one of the main problems to handle is word reordering. A word is ``reordered'' when it and its translation occupy different positions within the corresponding sentence. In statistical MT, word reordering is faced from two points of view: constraints and modeling. Constraints are introduced to limit the exponential number of possible word reorderings. Models, known also as distortion models, provide a measure of the plausibility of allowed reorderings. In this talk, I present an overview of some of the re-ordering constraints and models widespread employed in current MT systems.

March 22, 11:00-12:00, Sala Consiglio Scientifico
POS-based Reordering Models for Statistical Machine Translation
Deepa Gupta , FBK-irst

We present a novel word reordering model for phrase-based statistical machine translation. In particular, reordering of nouns, verbs and adjectives is modeled by exploiting inverse alignments, that take into account the distances between source as well as target words. The re-ordering model showed to be particularly effective for pairs of linguistically structured languages, namely Japaneses-English and German-English. The proposed model was applied as a set of additional feature functions to rescore N-best translation candidates generated by a statistical machine translation system. Experiments showed relative BLEU score improvement of 4.7-6.2% on the BTEC Japaneses-to-English task, and 3.9-5.6% on the Europarl German-to-English task.

March 29, 11:00-12:00, Sala Grande Est
Dependency Parsing with Semantic Information
Massimiliano Ciaramita, Yahoo! Research

In this talk I will present ongoing research which investigates new design options for the feature space of syntactic dependency parsers. We focus on one of the simplest parsing architectures, based on deterministic shift-reduce algorithms, trained with the perceptron. We show that by adopting second-order feature maps, the primal form of the perceptron produces models with comparable accuracy to more complex parsers, without need for approximations. Further, we explore the application of new features extracted from the annotations produced by a semantic tagger. These semantic features guarantee additional improvements in a accuracy and provide the first promising evidence of the usefulness of annotated semantic information for syntactic parsing. We provide standard experimental evaluations on the Wall Street Journal Penn Treebank.

April

April 12, 11:00-12:00, Sala Grande Est
Building Very Large Corpora from the Web
Marco Baroni, CIMeC, University of Trento

In this talk, I introduce a few initiatives I have been recently involved in (in particular, WaCky and CLEANEVAL) that aim at collecting, pre-processing, annotating and indexing large amounts of textual data crawled from the Web. I first motivate the idea of building corpora from the Web (vs. relying on a commercial search engine to gather linguistic data); then, I describe the Web corpus creation pipeline we developed, and shortly present a few applications of our Web corpora. Finally, I discuss what I believe, based on our experiences, to be the major challenges in this area for the near future.

May

May 3, 11:00-12:00, Sala Consiglio Scientifico
Semantic Domains and Ontology Learning
Alfio Gliozzo, FBK-irst

Knowledge acquisition from texts is an old problem in Artificial Intelligence. Recently, due to the increasing interest for the Semantic WEB, the research community is concentrating on learning domain ontologies from texts. Semantic Domains plays a crucial role in the acquisition process, as they show many interesting properties that can be exploited to enhance the performance of terminology induction, relation extraction and ontology pruning algorithms. In addition, they allow us to design ontology learning algorithms working on open domain corpora, specifying the domain of interest by simply querying the system in an Information Retrieval style.

May 10, 11:00-12:00, Sala Consiglio Scientifico
Efficient Handling of N-gram Language Models for Statistical Machine Translation
Marcello Federico, FBK-irst

Statistical machine translation, as well as other areas of human language processing, have recently pushed toward the use of large scale n-gram language models. This paper presents efficient algorithmic and architectural solutions which have been tested within the Moses decoder, an open source toolkit for statistical machine translation. Experiments are reported with a high performing baseline, trained on the Chinese-English NIST 2006 Evaluation task and running on a standard Linux 64-bit PC architecture. Comparative tests show that our representation reduces of 58% the memory required by SRI LM Toolkit, at the cost of 44% slower translation speed. However, as it can take advantage of memory mapping on disk, the proposed implementation seems to scale-up much better to very large language models: decoding with a 289-million 5-gram language model runs in 2.1Gb of RAM.

May 17, 11:00-12:00, Sala Consiglio Scientifico
Speech Recognition at FBK-irst
Fabio Brugnara, FBK-irst

This seminar will provide an introduction to the technologies involved in large vocabulary continuous speech recognition. Topics include hints on acoustic modeling, language modeling and their integration to the purpose of achieving an efficient decoding of speech signals, with an outline of the main algorithms and data structures used in a speech recognition system. The challenges raised by large size tasks as the transcription of spontaneous unconstrained speech will be briefly discussed. The seminar will include a demonstration of the performance of automatic transcription systems developed at Irst on some indicative domains.

May 24, 11:00-12:00, Sala Grande Est
The use of ontologies in knowledge acquisition
Charles Callaway, University of Edinburgh

Automatic Knowledge Acquisition comprises a diverse collection of approaches for the more efficient creation of represented semantic knowledge than manual knowledge engineering alone can expect. Producing an automatic text-based KA system requires solving a number of problems in text processing, ontology learning, formal knowledge representation, knowledge extraction, and verification. In this talk I focus on the scaffolding role that pre-existing and learned ontologies play in text-based KA by presenting examples where no, minimal, and extensive ontologies have different effects on incoming knowledge. I also present a prototype text-based KA system that extracts knowledge from unrestricted text in a concretely represented form, discuss different methods to evaluate its accuracy (both potential and implemented), and describe what ontological processing is sufficient and/or necessary for this KA system.

May 31, 11:00-12:00, Sala Grande Est
The Italian Content Annotation Bank (I-CAB)
Valentina Bartalesi Lenzi, Manuela Speranza and Rachele Sprugnoli, CELCT & FBK-irst

In this talk we will present the Italian Content Annotation Bank (I-CAB), a corpus of Italian news stories annotated with semantic information at three different levels: temporal expressions, different types of entities (i.e. persons, organizations, locations and geo-political entities), and relations between entities (e.g. the affiliation relation connecting a person to an organization). So far, I-CAB has been manually annotated with the first two levels, while the annotation of relations is work in progress. As we intend I-CAB to become a benchmark for various automatic Information Extraction tasks, we followed a policy of reusing already available markup languages. In particular, we adopted the annotation schemes developed for the ACE Entity Detection and Time Expressions Recognition and Normalization tasks, adapting them to the specific morpho-syntactic features of Italian and extending them to include a wider range of entities, such as conjunctions.

June

June 7, 11:00-12:00, Sala Grande Est
Cross Document Coreference
Octavian Popescu, FBK-irst

We present a general system for Person Cross Document Coreference that combines name frequency estimates with list of NEs. The system has three main modules: Person Name Splitter, Local Coreference and Global Coreference, corresponding to three main steps. The first step is to identify for each person name (PN) its type, first_name, last_name respectively. This information is used at the second step where the coreference among the names in the same document is established. The output of Local Coreference is a list of names, which represents the input of the third module, Global Coreference. Two names from different news corefer if their contexts are similar (context is considered a bag of words with no linguistics processing). The cluster of globally coreferred names represents the names of a unique person. These three steps are repeated till no new coreferences are made.

June 14, 11:00-12:00, Sala Grande Est
Use of Word Graphs in Automatic Speech Recognition
Daniele Falavigna, FBK-irst

Word graphs are largely used in automatic speech recognition because they can embed the correct recognition hypothesis inside a reduced search space. This allows to adopt complex acoustic and language models for exploring the search space, and to employ decoding techniques aimed at minimizing the word error rate instead of the sentence error rate. In the seminar word graph generation, word graph expansion and some minimum word error rate decoding methods proposed in the literature will be introduced and discussed. Some preliminary results obtained on speech data acquired during recent European Parliament sessions will be given.

CONTACT INFORMATION

Claudio Giuliano
FBK-irst - Centro per la Ricerca Scientifica e Tecnologica
via Sommarive, 18
I-38050 Povo (TN), ITALY
email: giuliano [at] itc [dot] it