Multilingual Corpora: Available Resources

Parallel Corpora

Multilingual Corpora

Projects

Institutions involved in the Production and Distribution of Corpora

Parallel Corpora Tools

Parallel Corpora

Subject to subscription/Licence fee

TRACTOR

Bulgarian, English, and French parallel texts: MS Word files containing source and target text on alternate lines. There are 20 files in different language pairs. Restrictions: Not available to industrial users. Resource provider: Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Sofia, Bulgaria.

IJS-ELAN corpus : contains 1 million words of parallel Slovene-English / English-Slovene texts of various domains. The corpus is composed of 15 source texts and is tokenised and aligned to the sentence level. Full documentation on the IJS-ELAN corpus is contained in its corpus and text TEI headers. The current version of the corpus is 1.1. The corpus has a web-based parallel concordancing service

Alice in Wonderland : The alignment was done with the help of the alignment software based on the Gale-Church algorithm and corrected manually because of the very high error rate of the Gale-Church algorithm on Alice. In spite of the manual correction, many errors still persist in the aligned files.

MULTEX-EAST : Fiction (100 000 words), newspapers (100 000 words), speech (2 000 words) and Orwell's 1984 (100 000 words), all with CES encoding. The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus: it is the parallel text, where the English original is sentence aligned with the six languages of the project (English, Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovenian), and each translation tagged for part-of-speech.

Swiss Government Multilingual (Fr-De-It) Corpus: Documents relating to the reform of the federal constitution (all HTML; RTF and PDF version also available). From Institut für Deutsche Sprache, Mannheim, Germany.

Texts from Deutsche Bundesregierung: Texts from Deutsche Bundesregierung (German Federal Government), Bonn and Berlin, Germany: German Resource File in HTML and Grundgesetz in French and English as Word documents. From Institut für Deutsche Sprache, Mannheim, Germany.

World Intellectual Property Organization (English and French): Intellectual Property and Copyright magazine in French and English versions, in MS Word files. Resource provider: Institut für Deutsche Sprache, Mannheim, Germany.

Translation of Plato's Republic : available in SGML, plain text and HTML formats, plus alignments with parallel texts in many languages (Slovak, Serbian, Slovene, Russian, Romanian, Polish, Lithuanian, Latvian, Hungarian, French, English, Finnish, Bulgarian, Croatian, Czech).

ELDA

ARCADE/ROMANSEVAL Corpus: was used as a reference corpus in two international competitions: ARCADE and ROMANSEVAL. The corpus contains raw data from the JOC corpus, composed of 1 million words in English French and Italian. The annotation and comprises:
- semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian;
- word-level alignment of all the occurrences of the test words between French and English.

GeFRePaC-German French Reciprocal Parallel Corpus:project is funded by ELDA/IDS and runs from July 1999 until March 2000. The project's aim is to build a 30 million word corpus (15 million for each language). It covers natural general language as used in public socio-political discourse and it has a focus on multilingual administration and commercial and legal documentation. The corpus will be encoded according to the PAROLE guidelines, it will be aligned on the sentence level and also for single word translation units on the lexical level, POS-tagged and validated according to the most current version of the ELRA guidelines.

MULTEXT JOC :This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission. This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains approx. 5 million words in English, French, German, Italian and Spanish (approx. 1 million words per language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level. The JOC corpus is delivered in Corpus Encoding Standard conformant format at each level of treatment : paragraph annotation level, conformant to the CESDOC specifications (1 M words * 5 languages); morpho-syntactic annotation level (PoS Tagging), conformant to CESANA specifications (200,000 words * 4 languages); parallel text alignment at sentence level, conformant to CESALIGN specifications (200,000 words * 4 languages). Only a few sample to download on this site

MLCC: consists of a parallel corpus of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data (provided by the European Commission EC) comprises two sub-corpora from the Official Journal of the European Communities

UCREL

ET10-63 Parallel Corpus: is a bilingual parallel corpus of English and French, containing EC offical documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized. Approximately 1,250,000 words of each language.

LDC

Hansard Canadian English-French Parallel Corpus:it is annotated and it consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches. There are several different versions.

Hong Kong Hansards Parallel Text : This corpus contains excerpts from the Official Record of Proceedings of the Legislative Council of the (HKSAR) from October 1995 to April 2000. The proceedings of the meetings are recorded verbatim in the Official Record of Proceedings of the Legislative Council (Hansard). The record of proceedings is in the original language delivered by the speakers (Floor Version). They are then translated into English and Chinese versions separately. There are 11.9 million English words and 18.15 million Chinese Characters.

Hong Kong Laws Parallel Text : This FTP publication was obtained during January 1999 from the bilingual website of the Department of Justice of the Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China. The retrieved files have been processed and sentence aligned. This corpora is organized into nineteen parallel file pairs for a total of thirty-eight files.

Hong Kong News Parallel Text: This FTP publication was created when the LDC collected parallel Chinese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the People's Republic of China. This corpora contains 18,147 aligned article pairs released by HKSAR from July 1, 1997 to April 30th, 2000. Automatic article alignment was done at the LDC.

UN Parallel Texts Corpus : This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names. CDs are also available separatedly. All parallel files in this corpus are English-based: for every file on the English disc, there will be a corresponding file on either the French or Spanish disc, or both.

Multiple Translation Chinese Corpus : To support the development of automatic means for evaluating translation quality, the LDC was sponsored to solicit 11 sets of human translations for a single set of Mandarin Chinese source materials

Available on the Web

BAF Corpus: is a corpus of French-English bi-texts, i.e. of pair of French and English texts which are mutual translations, and whose sentences have been aligned according to "Gale & Church". Most of the texts are of institutional genre for a total size close to 400000 words per language, but a few scientifical papers and a literary work were also included. BAF Version 1.1. is already available and can be freely downloaded in UNIX GZ format, ZIP and each file separatedly in TXT and CES formats.

The BiBle: The University of Maryland Parallel Corpus Project is acquiring and annotating texts in order to create multilingual corpora for linguistic research, particularly computational linguistics. Religious texts such as the Bible are widely available, carefully translated, and appear in a huge variety of languages. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones.

Europa The european Union on line

English-Spanish aligned texts: The texts have been extracted from the book Capitalism, socialism and democracy by Joseph Alois Schumpeter and its translation into Spanish.

English-Turkish Aligned Paralell Corpora: These parallel corpora contains aligned parallel texts in English and Turkish at the sentence level using Gale and Church's align code. There may be occasional problems due to missing sentence boundaries.

ITU or CRATER Parallel (Sp-Fr-En) Corpus: (The International Telecommunications Union\Crater Corpus): Multilingual Aligned Annotated Corpus (1,000,000-word) of Spanish, French and English, aligned at the sentence level, available from the School of Engineering, Computing and Mathematical Sciences, Lancaster University, UK. The corpus consists entirely of technical texts from the International Telecommunications Union. The texts are tagged with part-of-speech and morphological annotation.

Swedish political texts :This corpus contains texts from the Swedish government among which the declarations issued by the Swedish prime-minister when a new cabinet starts (REGERINGSFÖRKLARINGEN). These declarations are issued in five languages: Swedish, English, French, German and Spanish (since 1996). The current size of the corpus is 11,000 words. The documents have been converted to TEI-conformant SGML and the sentences in the different language issues have been aligned with the align program by Gale and Church. The result is this searchable parallel corpus. Resource provider: Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Sofia, Bulgaria. Restrictions: Not available to industrial users.

Knut Hofland English-French Aligned Texts : An online freely querable database of English-French aligned texts, processed by the same software (by Knut Hofland) used for the Oslo ENPC project.

VLC Hong Kong Virtual Language Center : Hong Kong Polytechnique University has set up a Virtual Language Centre on the Internet, Chris Greaves been the project director. The VLC offers on-line English and Chinese language learning materials. It also provides on-line dictionary and on-line concordancing. A small English-Chinese parallel corpus created by Dr. Xu Xunfeng is also accessible.

Portuguese-English parallel translation corpus (COMPARA): bi-directional parallel corpus freely available on the Web. It is an open-ended collection of Portuguese-English and English-Portuguese translations. You can use COMPARA to find out how translators have translated words and expressions from Portuguese into English and from English into Portuguese.

Not Available

ENPC-English-Norwegian Parallel Textx: consists of original texts and their translations (English to Norwegian and Norwegian to English). It is intended as a general research tool, available beyond the present project for applied and theoretical linguistic research. The corpus is now completed. The focus has been on novels and fairly general non-fictional books. The English part of the ENPC has been tagged for part-of-speech (POS). The tagging was done automatically. The Norwegian part of the corpus has recently been tagged (October 2001), and we are in the process of post-editing the tagged texts. Access to the Corpus is up today restricted only to researchers and students at the University of Oslo. Only the manual is freely available online.

English and Polish Parallel and Comparable Corpora:corpus projects at the PELCRA web site. An aligned parallel corpus provides an aid to human translators since it is possible to look up all sentences in which a word or phrase occurs in one language and find exactly how those sentences were translated into the other language. It contains a small number of texts from a variety of sources (magazines, books) which have been translated from one language into the other (to date, these texts are mainly from Polish into English). At present, we do not provide external access to our corpora only a small corpus sample from the PELCRA corpora can be downloaded directly from this page.

ESPC-English-Swedish Parallel Corpus: 2.8 million words: 64 English text samples and their translations into Swedish and 72 Swedish text samples and their translations into English; two main text categories (fiction and non-fiction); The corpus can only be used for research. No commercial use is permitted. Moreover, the corpus is only available for research at the Department of English at the Universities of Lund and Göteborg. Scholars and students outside these departments can gain access to the corpus by visiting, or cooperating with, one of these departments. The parallel corpus between Swedish and English is planned to consist of 1,600,000 words in different samples (each sample 10,000-15,000 words) from both directions. The corpus will become available as soon as all the copyright restrictions are resolved.

English-Finnish:consists of approximately 2 million words in parallel texts but have not yet been aligned. For the alignment of these texts the Finnish partner will be using a locally developed program. So far there has been no work on POS-tagging.

Kacenka : Parallel Corpus of English and Czech texts (mainly literary). Literature translations in Czech and English has been created by the Department of English, Faculty of Arts, Masaryk University during the year 1997 to support research and teaching in the field of translation. Currently, it has the size of 3,297,283 words (out of which, 1,689,513 have been acquired by means of scanning. KACENKA is stored on a single CD-ROM; its use is limited by copyright restrictions. Aligned text in the form of merged texts.

REAL Parallel Corpus : the corpus consists of English and German texts, ranging from contemporary British/American literature to scientific textbooks. It aims at creating a machine-readable and aligned corpus which will allow to discover and categorise translation equivalents for a number of linguistic items, such as prepositions, function verbs, deictic elements, metaphors or culture-specific structures. Go to the Chemnitz Internet Grammar and check the access to the online corpus: http://www.tu-chemnitz.de/InternetGrammar/

CEXI: project at the SSLMIT, University of Bologna in Forlì. The aim of the project is to create a resource which can be used by both students and researchers to learn about translation and translating. The English Italian Translational Corpus is planned as a four-million-word corpus subdivided into four components of one million words each, and can be described as a bi-directional parallel corpus of 10 to 15 thousand word samples from original Italian texts and their translations into English and viceversa, all published between 1945 and 2000. Each of the four components consists of equal proportions of fictional and non-fictional texts.

Tetun-Enghlish :A small Tetun (East Timorese)-English parallel corpus, manually sentence-aligned. It was used by the Statistical Machine Translation Team of Dan Melamed (cf. the Melamed's Tools file) and others exploiting the EGYPT statistical machine toolkit. The whole corpus has about 450.000 words for each language. BAF Version 1.1. Description, alignment conventions, encoding documentation, and a COAL Tools suite

Swedish Immigrant Newspaper (Invandrartidningen): the corpus is issued in nine languages: Swedish, Albanian, Arabic, English, Finnish, Persian, Polish, Serbo-Croatian and Spanish. Presently this corpus consists only of a few texts. We will obtain a larger number of their texts in the near future.

Linköping : consists of texts translated from English into Swedish. The texts are from manuals and novels. The overall corpus adds upp to about 1,600,000 words all tagged in SGML. Around 20,000 words have been aligned using two different alignment algorithms.

Scania Corpus: A collection of truck manuals from the Swedish truck manufacturing company Scania. The manuals are available in eight languages: Swedish (source language), Dutch, English, Finnish, French, German, Italian and Spanish. The Swedish component adds up to 300,000 words and is the largest part of the corpus. The smallest component, Finnish, consists of approximately 200,000 words.

TRIPTIC:(TRIlingual Parallel Text Information Corpus) is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch. There is not a TRIPTIC page on the web and all the informations are taken from Michael Barlow's Parallel Corpora Page. The corpus consists of 2,000,000 words, one half fiction, the other half non-fiction material. All paragraphs are aligned, allowing automatic selection of the n-th paragraph in the 3 languages.

Wang Lixun Chinese-English Parallel Corpus: The corpus, developed at Birmingham by Wang Lixun, consists of complete texts. Each individual 'text-based' file is one complete novel, essay, or other kind of articles. All the files are classified by their names. The corpus is carefully balanced: Half the texts are of English and half of Chinese origin, and the genres of texts are also properly balanced in both language sources.

Multilingual Corpora

Subject to subscription/Licence fee

TRACTOR

European Free Trade Organization Multilingual (De-En) Corpus : Texts from European Free Trade Organization (EFTA), Geneva, Switzerland in English and German (HTML and Word formats). From Institut für Deutsche Sprache, Mannheim, Germany.

NATO Multilingual (Fr-De-En) Corpus : HTML texts from North Atlantic Treaty Organization (NATO), Brussels, Belgium, in English, French and German. From Institut für Deutsche Sprache, Mannheim, Germany.

Texts from the German embassy in Paris (French and German) : Texts from Centre d'Information et de Documentation de l'Ambassade de la République Fédérale d'Allemagne, Paris, France in German and French, in HTML.Resource provider: Institut für Deutsche Sprache, Mannheim, Germany.

LDC

ECI/MCI 1 Corpus: a 98 million word corpus, covering most of the major European languages, as well as many others (Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan) for the linguistic research community. ECI/MCI has 46 subcorpora in 27 (mainly European) languages. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. Parallel texts in English, French and Spanish provided by the International Labor Organisation. Approximately 5 million words. Unusually cheap: the ECI/MCI is available directly from ECI at a price of 95 DFl (for payments made by credit card or Eurocheque); 110 DFl (for payments by bank transfer); or 120 DFl (for payments by cheques other than Eurocheques). Need only to sign a license agreement available (Pos

ECI-ELSNET Italian & German tagged sub-corpus: the objective is to provide a small but fine grained morphosyntactically tagged corpus, 50.000 running words for each of the two languages to be used in research work on tagging methods and models. words occurrences are tagged with very fine grained tagsets which are based on the EAGLES morphosyntactic guidelines.

European Language Newspaper Text: is also known as the French Language News Corpus. This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML.

OGI Multilanguage Corpus :The corpus consists of responses to prompts spoken over commercial telephone lines by speakers of English, Farsi (Persian), French, German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and Vietnamese. It contains a total of 1927 calls, an average of 175 calls per language.

TDT2 Multilanguage Text Corpus (Topic Detection and Tracking Multilingual Corpus): refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. This CD-ROM ROM release consists of the English and Mandarin text components of the TDT2 corpus. The two subcorpora were released also separatedly, cf. TDT2 English Text corpus Version 2 and TDT2 Mandarin Text Corpus.

Available on the Web

CHILDES Database : provides tools for studying child language data and conversational interactions. The CHILDES database is a large group of Children's Spoken and Written Language Corpora, all freely available for PC or MAC. It includes a vast amount of transcript data collected from children and adults who are learning languages. CHILDES corpora cover a 23 European and extra European languages: Cantonese, Catalan, Danish, Dutch, Estonian, French, German, Greek, Hebrew, Hungarian, Irish, Italian, Japanese, Mambila [Bantu], Mandarin, Polish, Portuguese, Russian, Spanish, Swedish, Tamil, Turkish, Welsh. The bulk of the collection is however English (see under the English section). There is also a remarkable Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish).

Not Available

BoLC - Bononia Legal Corpus: is an ongoing cross-disciplinary research project. It is aimed at the construction and analysis of a multilingual comparable legal corpus. It is being developed in CILTA at Bologna University since 1997. For the moment the corpus is formed of two subcorpora: one English, the other Italian, but it could be expanded at a later stage. The subcorpus consisting of Italian legal language and the subcorpus consisting of English legal language are taken to represent the two different legal systems. It is therefore intended to compare legal texts in different languages. In this dual perspective, for the purposes of initial inquiry, a pilot corpus was created, consisting of parallel corpora in Italian and English.

MLCC Multilingual and Parallel Corpora for CO-OPERATION:large collection of financial journalism text in electronic form from European financial newspapers. The corpus has two main components: one set to collect at least 5 million words in each of six European languages (Dutch, British English, French, German, Italian, Spanish), and to make them available on CD-ROM for research purposes in a uniform format and at a very low cost to allow comparable studies to be carried out in different languages. The result is the Polylingual Document Collection (ELRA- W0006).The second set is a Multilingual Parallel Corpus (ELRA-W0007) consisting of a parallel corpus of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The corpus collected by MLCC was used in the Multext project, which developed linguistic annotation and exploitation tools. All of the MLCC corpus will eventually be linguistically annotated using the Multext programs, with SGML down to the paragraph level. The project finished successfully in October 1995. The licence agreements and the SGML marked-up data have been passed on to ELRA who will make it publicly available in the near future.

Projects


CITATL Progetto nazionale (Corpora on line, ipermedia, traduzione, analisi testuale e apprendimento linguistico) mette insieme diverse risorse e competenze per l'elaborazione, la gestione e l'utilizzo di materiali linguistici. Margherita Ulrych, gruppo di Trieste ha raccolto un corpus parallelo di inglese e italiano parlato scaricabile previa autorizzazione. Maria Pavesi, gruppo di Pavia Corpus parallelo inglese-italiano di ambito medico-scientifico. Per eseguirlo è necessario scaricare il programma e seguire attentamente le istruzioni di installazione contenute nel file readme.txt.

The Scandinavian Project of Contrastive Corpus Studies: an ongoing Scandinavian project involving four partners in Norway, Finland, Denmark and Sweden. Swedish is represented by the Department of English at Lund University. Norway is represented by those who were already involved with the English-Norwegian Parallel Corpus. Finland's involvement is by way of the Finnish-English Contrastive Corpus Studies FECCS: a project in Constrastive Linguistics at the University of Jyväskylä, Finland No information about the Danish project. All four corpora will be built up according to the same structure. Each corpus consists of two parts: one parallel corpus of original texts together with their translations, and one comparable corpus consisting of original texts in both languages. The corpora are to be used in contrastive studies between the Scandinavian languages and English.

ETAP Project :This is a description of one of the research projects at the department of Linguistics in Uppsala University, Sweden: Languages: Swedish, Dutch, English, Finnish, French, German, Italian, Polish, Serbian-Bosnian-Croatian and Spanish. By October 1999, the project has resulted in four parallel, aligned, subcorpora: the Scania Corpus, the Swedish Statement of Government Policy Corpus(available on line), the Invandrartidningen I and II corpora.

The Lingua Project: Begun in 1994, financed by the European Union, this project involves the construction of multilingual corpora for English, Denmark, Italy, French, Greek and some others, for use in language pedagogy. The project is gathering and managing a multilingual parallel corpus which aims at developing translation aids and language learning tools in order to ease students' and teachers' work in second language learning. 11 organisations from 6 different countries participate in this project. At present the languages involved are English, French, German, Italian, Greek and Danish but the project is considering the addition of Spanish and Portugese. Up to now, a computer program called MultiConcord has been produced. List of multilingual files

PAROLE: (Preparatory Action for Linguistic Resources Organization for Language Engineering),is an EU funded project which aims to build generica lexica and corpora in all languages of the Union. The project is implementing research into standardising lexical data carried out by a pre-existing project, EAGLES. This means that the information encoded in both lexica and corpora for each language is of the same type (working within the limits of language specificity) and stored in the same framework.

PEDANT : consists of texts in several languages and aims at providing a wide collection of text types and language pairs in order to facilitate the creation of sub-set corpora for the specific purposes various researchers might have. Most text will be parallel in pairs.

INTERSECT: Project began in the Spring of 1994. The original aim was to construct and analyse a parallel bilingual corpus of French and English written texts, adding other languages later if resources permit. More recently we have added a German-English corpus.

Institutions Involved in the Production and Distribution of Corpora


Athelstan Online

Electronic Text Center

ELDA (The European Language resources - Distribution Agency)

ELRA (European Language Resources Association)

ELSNET (European Network of Excellence in Human Language Technologies)

ICAME (International Computer Archive of Modern and Medieval English)

LDC (Linguistic Data Consortium)

OTA (Oxford Text Archive)

RALI (Le Laboratoire de Recherche Appliquée en Linguistique Informatique - Université de Montreal)

RELATOR (European Linguistic Resources Repository Network)

TELRI (Trans-European Language Resources Infrastructure)

TRACTOR: (Research Archive of Computational Tools and Resources)

UCREL (University Center for Computer Corpus Research on Language)

VISL (Visual Interactive Syntax Learning)

W3-Corpora Project

Parallel Corpora Tools


Parallel Concordancers

MultiConcord: is a Windows-based Multilingual Parallel Concordancer for classroom use developed at the University of Birmingham under Lingua project. The program is available from CFL Software Development, price £40.

ParaConc: Michael Barlow's ParaConc is a bilingual/multilingual concordance program (in different formats) designed to be used for contrastive corpus-based language research. The original parallel concordancer (programmed in HyperTalk) runs on a Mac. It's free and can be downloaded as a binhexed on this site. You are asked to send an email message to barlow@ruf.rice.edu when you do this. The program is for individual, research use only and cannot be loaded onto a network without purchasing a site licence agreement.

WordSmith Tools: by Mike Scott and published by Oxford University Press. It can perform lexical analysis of texts and alignment of multi-lingual texts. Platform: MS Windows 3.1 or higher.

Web Concordancer: online concordancing on texts in English (e.g. the Brown Corpus), Chinese, Japanese, French.

TransSearch: search the parallel bilingual English-French Canadian Hansard corpus. It is a tool that enables translators to submit queries to a translation memory, in order to locate ready-made solutions to all sorts of translation problems.

WebTCE : Downloadable on line

English/Slovene parallel concordancer : Available on line

Sentence Aligners

Vanilla aligner: A program which performs automatic sentence alignment of parallel texts. This is an implementation of the Gale and Church (1993) algorithm. Resource provider: Pernilla Danielsson and Daniel Ridings at the University of Gothenburg. Available under subscription to TRACTOR

Align: is a C++ freely downloadable package by Adam Berger for aligning, at the sentence level, a pair of text files which are translations of one another.

MtAlign: Multilingual text aligner. Available on line.

The Uplug Sentence Aligner :establishes links between two parallel texts on the sentence level. Available on line.

Pesa: Portuguese-English sentence aligner : Ongoing project which is part of a MsD program (2001-2003). No information about its availability.

Word Aligners

LWA: The Linköping Word Aligner: is a word alignment tool taking a bi-text aligned at sentence or paragraph level as input. LWA provides two kinds of results: (i) for each pair of aligned sentences, a partial word alignment, and (ii) a list of link types (i.e. proposed translation pairs). It can also align multi-word units. Due to its knowledge-lite approach, LWA can easily be extended for use with a number of different language-pairs; current versions work for Swedish, English and French. No information about its availability

PWA: The Plug Word Aligner: a collection of tools for the automatic alignment of word correspondences in bilingual parallel texts. PWA is available to the research community from this page free of charge after the signing of a licence agreement with the proprietors.

TWENTE Word Aligner: a new statistics based method for the automatic extraction of bilingual lexicons was developed. The method inputs a bilingual corpus (that is: a text and its translation in another language) and outputs a bilingual lexicon. The method has advantages over algorithms that have been published before, because it is based on a symmetric translation model. The resulting bilingual lexicon can be used to translate in both directions between a language pair.