Corpus-based Lexicography
"Corpora are seen as representing living, developing, as opposed to fossilized, language" (Yorick Wilks, 1996, Electric Words)
"For many common words, the most frequent meaning is not the one that first comes to mind and takes pride of place in most dictionaries" (Sinclair, 1991, Corpus, Concordance, Collocation).
Corpus -> lexicon
- new lexical entries are found
- existing lexical entries are enriched by additional information extracted via corpus analysis (e.g. most common forms, connotation, etc.)
- important aspects of word meaning and grammar which were simply never noticed by linguists who had no data to work with are highlighted
- word frequency analysis is used for annotating lexical entries
- collocational information is collected, organized, and presented (e.g. idiom identification)
- (domain specific) knowledge is extracted
- lexical items unlikely to be found in dictionary sources are extracted (e.g. proper nouns)
- real examples showing how central and typical features of English are used are provided
- paradigmatic- and syntagmatic-driven semantic clustering is performed
Creation of dictionaries with the aid of text analysis
- Construction of generic dictionaries.
Dictionaries can be constructed by means of corpus analysis, which is to say by defining, or refining, word senses against their actual occurrence in very large volumes of natural texts. Moreover, examples in the dictionary can be taken from the corpus, which provides the lexicographers with hard measurable evidence on a large scale about word usage.
- (Semi-)Automatic construction of domain-specific lexica.
Althoug some semantic information is available in general-purpose knowledge bases, many applications require domain-specific lexica that represent words and categories for a particular topic and which are useful for minimizing ambiguity problems. A corpus-based method that can be used to build semantic lexicons for specific categories is presented in Riloff and Shepherd, 1997, "A corpus-based approach for building semantic lexicons". The input to the system is a small set of seed words for a category and a representative text corpus. The output is a ranked list of words that are associated with the category.
- Automatic acquisition of lexical semantics information.
Projects that center around extracting lexical information from MRD are inherently limited, since the set of entries within a dictionary is fixed. In order to find terms and expressions that are not defined in MRDs it it necessary toturn to other textual resources, such as large text corpora. See Hearst, 1992, "Automatic acquisition of hyponyms from large text corpora".
Upgrades ongoing
Up: Corpus Analysis
Previous: Repositories of Corpora
Next: The use of WordNet in Text Analysis
bentivo@itc.it
Last modified: Wed Feb 13 12:26:48 MET 2002