The use of WordNet in Text Analysis
Lexicon -> corpus
- lexical information is used for different types of tagging
- existing taxonomies are used for semantic tagging
- lexical information is used for disambiguating naturally occurring texts
Corpus annotation with WordNet: semantic tagging
Texts are disambiguated by manually annotating content words in the texts with pointers to their appropriate WordNet synset. In particular, its wide range of senses makes possible a highly specific level of tagging.
- Over the past several years Princeton Group has been creating semantic concordances (a textual corpus and a lexicon so combined that every content word in the text is linked to its appropriate sense in the lexicon) of a restricted version of Brown Corpus.
- The senses of 25 of the most frequent verbs in 12,925 sentences of the Wall Street Journal Treebank corpus have been hand tagged with respect to senses in WordNet. See Wiebe et al., 1997, "WordNet sense tagging in the Wall Street Journal".
- In order to test a Word Sense Disambiguation program, a corpus has been gathered in which 192,800 word occurrences have been manually tagged with senses from WordNet. These 192,800 word occurrences consist of 121 nouns and 70 verbs which are the most frequently occurring and most ambiguous words in English.The sentences have been drawn from the combined corpus of the 1 million word Brown corpus and the 2.5 million word Wall Street Journal corpus. See Ng and Lee, 1996, "Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach".
- Underspecified semantic tagging:
traditional semantic tags that are based on discrete senses tend to be too fine-grained for practical use.
in order to avoid the problem of WordNet semantic tags
semantic database (CoreLex) of 126 semantic types, covering around 40,000 nouns and defining a large number of systematic polysemous classes that are derived by a careful analysis of sense distributions in WordNet. The semantic types are underspecified
representations based on generative lexicon theory.
The CoreLex approach, on the other hand, offers a concise set of 126 tags that are inherently more coarse-grained, by taking into account systematic polysemy and underspecification.
126 underspecified semantic types, the 317 systematic
polysemous classes they correspond to and all 39,937 noun instances for each class as derived from WordNet 1.5
Text analysis with the aid of WordNet as a tool
- Automatic disambiguation: the WordNet semantic hierarchy is a central resource for a variety of sense disambiguation algorithms.
- Lexical chaining.
Textual cohesion is the property of texts to "stick together" (Halliday and Hassan 1976) by using related words. Until now, lexical cohesion, arising from semantic connections between words, was successfully used as the only form of textual cohesive structure, known as lexical chains.
- WordNet in Content Analysis: at present, content analysis technology does not make use of WordNet (Kenneth C. Litkowski, personal communication). For a comparison between WordNet synsets and MCCA (Minnesota Contextual Content Analysis) categories and their use in tagging texts for content analysis see Litkowski, 1997, " Desiderata for Tagging with WordNet Synsets or MCCA Categories".
Have a look at the use of WordNet in specific Natural Language Processing tasks.
Up: Corpus Analysis
Previous: Corpus-based Lexicography
bentivo@itc.it
Last modified: Wed Feb 13 12:27:34 MET 2002