Evaluating Coverage for Large
Symbolic NLG GrammarsCharles B. Callaway
ITC-irst Istituto per la Ricerca Scientifica e Tecnologica
via Sommarive, 18 - Povo
38050 Trento, Italy
callaway@itc.it
Abstract
After many successes, statistical approaches that have been popular in the parsing community are now making headway into Natural Language Generation (NLG). These systems are aimed mainly at surface realization, and promise the same advantages that make statistics valuable for parsing: robustness, wide coverage and domain independence. A recent experiment aimed to empirically verify the linguistic coverage for such a statistical surface realization component by generating transformed sentences from the Penn TreeBank corpus. This article presents the empirical results of a similar experiment to evaluate the coverage of a purely symbolic surface realizer. We present the problems facing a symbolic approach on the same task, describe the results of its evaluation, and contrast them with the results of the statistical method to help quantitatively determine the level of coverage currently obtained by NLG surface realizers.
1. Introduction
Like parsing, text generation offers enormous potential benefits for more natural interaction with computers. Examples of applications which could be greatly improved include automatic technical documentation, intelligent tutoring systems, and machine translation, among many others. Historically, natural language generation (NLG) has focused on the study of symbolic pipelined architectures which receive knowledge structures and goals from knowledge-based applications and which proceed to progressively add linguistic information.In the last few years, the same paradigm shift which occurred in the parsing community, the use of statistical/empirical methods, has begun to influence the NLG community as well. As with parsing, statistical generation promises benefits such as robustness in the face of bad data, wider coverage, domain and language independence, and less need for costly resources such as grammars. But unlike parsing, which starts with a very flat representation (text) which is easily accessible in large quantities to both statistical and symbolic methods, the semantic input for NLG is typically associated with large knowledge-based systems. The types of corresponding corpora which would be necessary for using statistical processes, pairs of subgraphs of knowledge bases and their texts, do not currently exist in large quantities.
Because of this representation problem, most statistical systems have concentrated on replacing existing individual components in the standard NLG pipelined architecture [Reiter 1994] without changing the remaining original symbolic modules. The most popular candidate has been the surface realization module [Elhadad, 1991; Bateman, 1995; Lavoie and Rambow, 1997; White and Caldwell, 1998], which is responsible for converting the syntactic representation of a sentence into the actual text seen by the user. Thus current statistical generators are still dependent on remaining architectural modules in a system to function and do not by themselves account for a large amount of linguistic phenomena: pronominalization, revision, definiteness, etc.
However, statistical surface realizers [Langkilde and Knight, 1998; Bangalore and Rambow, 2000; Ratnaparkhi, 2000; Langkilde-Geary, 2002] have focused attention on a number of problems facing standard, pipelined NLG that have until now been generally considered future work: large-scale, data-robust and language- and domain-independent generation. In addition, as Langkilde points out, empirical evaluation has not been standard practice in the NLG community, which has instead relied either on the software engineering practice of regression testing with a suite of examples or theoretical evaluations [Robin and McKeown 1995].
This paper presents the analogue of this recent statistical experiment using a well-known off-the-shelf symbolic surface realizer, using an augmented generation grammar that includes support for dialogue and additional syntactic coverage. We first describe in the following section the representations and processes needed to understand its evaluation. We then detail our implemented system for converting sentences from a large corpus into a systemic functional notation, present an evaluation of that system and the grammar itself using Section 23 of the Penn TreeBank [Marcus et al., 1993] and finally discuss the implications of that evaluation.
2. Sentence Representations
To undertake a large-scale evaluation of a symbolic surface realizer, we must first find a large quantity of sentence plans with which to produce text. However, most text planners cannot generate either the requisite syntactic variation or quantity of text, and we thus cannot turn to implemented generation systems as a source. To solve this problem, Langkilde trained a statistical algorithm [Langkilde-Geary, 2002] on a substitute set of sentence plans: the Penn TreeBank [Marcus et al., 1993], a collection of sentences from newspapers such as the Wall Street Journal, which have been hand-annotated for syntax by linguists. An example sentence is shown on the left side of Figure 1. Hierarchical syntactic/semantic bracketing is provided along with the syntactic categories of lexemes and symbols in the newspaper texts.
(S (PP (IN Without) (NP (NNP GM))) (, ,) (NP-SBJ (NP (JJ overall) (NNS sales)) (PP (IN for) (NP (DT the) (JJ other) (NNP U.S.) (NNS automakers)))) (VP (VBD were) (ADJP-PRD (RB roughly) (JJ flat) (PP (IN with) (NP (CD 1989) (NNS results)))))) ((cat clause) (circum ((accompaniment ((cat pp) (position front) (accomp-polarity -) (np ((cat proper) (lex "GM"))))))) (process ((type ascriptive) (tense past))) (participants ((carrier ((cat common) (lex "sale") (number plural) (describer ((cat adj) (lex "overall"))) (qualifier ((cat pp) (prep ((lex "for"))) (np ((cat common) (lex "automaker") (definite yes) (number plural) (status different) (classifier ((cat proper) (lex "U.S."))))))))) (attribute ((cat ap) (lex "flat") (modifier ((cat adv) (lex "roughly"))) (qualifier ((cat pp) (prep ((lex "with"))) (np ((cat common) (lex "result") (number plural) (classifier ((cat date) (year 1989))))))))))))Figure 1: A Penn TreeBank Annotated Sentence and Corresponding FUF/SURGE Functional Description.