Repositories of Corpora
Italian version
Institutions distributing Corpora
- LDC (Linguistic Data Consortium).
The LDC is an open consortium of universities, companies and government research laboratories founded in 1992. It creates, collects and distributes speech and text databases, lexica, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution.
- ELRA (European Language Resources Association).
ELRA was established in 1995, with the goal of founding an organization to promote the creation, verification, and distribution of language resources in Europe. A non-profit organization, ELRA aims to serve as a focal point for information related to language resources in Europe. It will collect, market, distribute, and license European language resources. ELRA will help users and developers of language resources, government agencies, and other interested parties exploit language resources for a wide variety of uses.
- ICAME (International Computer Archive of Modern and Medieval English).
ICAME is an international organization of linguists and information scientists working with English machine-readable texts. The aim of the organization
is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress
on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions.
Websites containing lists of available Corpora
- BNC-Corpus Resources.
The British National Corpus page lists centres and projects from which language corpora (chiefly English language) are readily available.
- ACL NLP/CL Universe - Corpora. The list of corpora in the ACL NLP/CL Universe corpora page.
- UCREL-Lancaster University (University Centre for Computer Corpus Research on Language). This University has a wide variety of machine-readable corpora
- Corpora. The list of corpora in the corpus-based computational linguistics page maintained by Christopher Manning from SULTRY (Sydney University Language Technology Research Laboratory).
- Michael Barlow's Corpus Linguistics website. The section of the page devoted to corpora available in different languages.
- An annotated list of corpora. The chapter of the book "Data-Intensive Linguistics" written by Chris Brew and Marc Moens from Edinburgh University.
- Euralex - Corpora and Dictionaries. The Euralex (European Association for Lexicography) website on resources.
- Treebanks.The list of "treebanks" in the SIGLEX (ACL Special Interest Group on the Lexicon) resources links website.
The most important Corpora
- BRITISH NATIONAL CORPUS.
The BNC is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English.
- PENN-TREEBANK Corpus. The corpus consists of over 4.5 million words of American English. It has been annotated for part of speech information and for skeletal syntactic structure. It is released through the Linguistic Data Consortium.
- LOB Corpus.
The Lancaster - Oslo/Bergen Corpus is a million-word collection of present-day British English texts. Like its American counterpart, the Brown Corpus, it contains 500 text samples of approximately 2,000 words distributed over 15 text categories. All versions of the LOB Corpus text (tagged and untagged) are available only on tape and are available only for use by academic researchers.
- BROWN Corpus. This Standard Corpus of Present-Day American English consists of 1,014,312 words of running text of edited English prose printed in the United States during the calendar year 1961. So far as it has been possible to determine, the writers were native speakers of American English. The Corpus is divided into 500 samples of 2000 words each. Six versions of the Corpus are available free only for research purposes. Recently, part of the corpus has been semantically annotated with WordNet synsets.
- ECI/MCI.
ECI (European Corpus Initiative) has produced Multilingual Corpus I (ECI/MCI) of over 98 million words, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, Malay and more. The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material. The ECI/MCI is available on payment.
-
Corpus ICE-GB. ICE-GB is the British component of the International Corpus of English (ICE). ICE began in 1990 with the primary aim of providing material for comparative studies of varieties of English throughout the world. Twenty centres around the world are preparing corpora of their own national or regional variety of English. ICE-GB is a one million words corpus of spoken and written British English. It contains 83,394 parse trees, including 59,640 in the spoken part of the corpus. This is the biggest collection of parsed spoken material anywhere.
- Corpus SUSANNE. The SUSANNE Corpus contains 130,000-word cross-section of written American English (it is based on a subset of the million-word Brown Corpus) annotated in accordance with the SUSANNE annotation scheme. The SUSANNE Corpus is freely available without formalities for use by researchers anywhere.
- Corpus CHRISTINE. Sponsored by the Economic & Social Research Council (UK), the CHRISTINE project is setting out to extend the SUSANNE analytic scheme and Corpus to cover spoken English, and particularly spontaneous, informal spoken English.
- CPSA (Corpus of Professional Spoken American-English). The corpus, which has been constructed from a selection of existing transcripts of interactions in professional settings, contains two main sub-corpora of a million words each.
Upgrades ongoing
Previous: Corpus Analysis Next: Corpus-based Lexicography
bentivo@itc.it
Last modified: Wed Feb 13 12:25:06 MET 2002