COMPARISON TABLE BETWEEN DIFFERENT CORPUS TAKEN FROM ELRA



Characteristics of the query in the ELRA website:

Language(s): Arabic

Application(s): ASR



No. ELRA Catalog No. Item Name Available since Language(s) Cost for research use Cost for evaluation use Description
1 ELRA-M0040 Not Specified 20/01/2004 Arabic <<< >>> French 18000.00 EUR Not Specified DixAF is a French-Arabic, Arabic-French dictionary, which consists of around 125,000 binary links between ca. 43,000 French entries and ca. 35,000 Arabic entries.
2 ELRA-S0157 NetDC Arabic BNSC (Broadcast News Speech Corpus) 08/02/2007 Arabic 200.00 EUR Not Specified The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States. The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period.
3 ELRA-W0049 "Le Monde Diplomatique" Arabic tagged corpus 31/03/2009 Arabic 400.00 EUR Not Specified This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text.
4 ELRA-W0078 NE3L named entities Arabic corpus 29/09/2014 Arabic 5000.00 EUR Not Specified The Arabic corpus contains 103,363 words coming from articles extracted from “Le Monde Diplomatique” newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount.
5 ELRA-L0088 Arabic Morphological Dictionary 20/03/2012 Arabic 450.00 EUR Not Specified The Arabic Morphological Dictionary contains 4,912,749 entries, including 3,374,852 nouns, 1,537,699 verbs, 198 grammatical words. All files are provided as plain text in UTF8 character encoding, which represents about 154 Mb of data.
6 ELRA-S0308 Egyptian Arabic Speecon database 22/07/2010 Arabic (Egypt) 60000.00 EUR Not Specified The Egyptian Arabic Speecon database comprises the recordings of 550 adult Egyptian speakers and 50 child Egyptian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).
7 ELRA-S0350 GlobalPhone Arabic Pronunciation Dictionary 04/06/2013 Arabic 700.00 EUR Not Specified The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Arabic dictionary contains 29230 entries (27059 words).
8 ELRA-S0190 OrienTel Arabic as spoken in Israel database 06/12/2005 Arabic (Israel) 39843.00 EUR Not Specified This speech database contains the recordings of 750 Arabic speakers recorded over the Israeli fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
9 ELRA-S0247 LC-STAR Standard Arabic Phonetic lexicon 15/11/2007 Arabic 27625.00 EUR Not Specified The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
10 ELRA-S0192 GlobalPhone Arabic 30/01/2006 Arabic 700.00 EUR Not Specified The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.
11 ELRA-T0372-01 Multilingual Dictionary of Sports – English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database 07/07/2009 English - French - Greek, Modern (1453-) - Arabic - German - Spanish, Castilian - Portuguese 400.00 EUR Not Specified This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The current set consists of an English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database. It contains a nomenclature of 37,500 entries for English, French, Greek and Arabic, 20,000 entries for Spanish, 22,000 for German and 10,000 for Portuguese. For each language, the contents consist of:
• Mandatory information: term, grammar
• Mandatory information except if not available (no source) : reference/source,
• Mandatory and common information: field (sport), domain, additional circumscription
• Optional information: definition and source, linguistic and source reference, combinatorics, other form, synonym
12 ELRA-W0030 Al-Hayat Arabic Corpus 15/01/2002 Arabic 720.00 EUR Not Specified The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering applications developement.
13 ELRA-S0289 OrienTel Jordan MCA (Modern Colloquial Arabic) database 22/10/2008 Arabic (Jordan) 22500.00 EUR Not Specified This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
14 ELRA-S0290 OrienTel Jordan MSA (Modern Standard Arabic) database 22/10/2008 Arabic (Jordan) 15000.00 EUR Not Specified This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
15 ELRA-S0221 OrienTel Egypt MCA (Modern Colloquial Arabic) database 28/08/2006 Arabic (Egypt) 22500.00 EUR Not Specified This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
16 ELRA-S0222 OrienTel Egypt MSA (Modern Standard Arabic) database 28/08/2006 Arabic (Egypt) 15000.00 EUR Not Specified This speech database contains the recordings of 500 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
17 ELRA-S0183 OrienTel Morocco MCA (Modern Colloquial Arabic) database 24/11/2005 Arabic (Morocco) 22500.00 EUR Not Specified This speech database contains the recordings of 772 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
18 ELRA-S0184 OrienTel Morocco MSA (Modern Standard Arabic) database 24/11/2005 Arabic (Morocco) 15000.00 EUR Not Specified This speech database contains the recordings of 530 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
19 ELRA-S0186 OrienTel Tunisia MCA (Modern Colloquial Arabic) database 24/11/2005 Arabic (Tunisia) 22500.00 EUR Not Specified This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
20 ELRA-S0187 OrienTel Tunisia MSA (Modern Standard Arabic) database 24/11/2005 Arabic (Tunisia) 15000.00 EUR Not Specified This speech database contains the recordings of 598 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
21 ELRA-T0372-04 Multilingual Dictionary of Sports – English-French-Arabic trilingual database 07/07/2009 English - French - Arabic 200.00 EUR Not Specified This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The current set consists of an English-French-Arabic trilingual database which includes the following information for each language:
• Mandatory information: term, reference/source, grammar
• Mandatory and common information: field (sport), domain, additional circumscription
• Optional information: definition and source OR linguistic and source reference, combinatorics, other form, synonym, variant
22 ELRA-W0036-04 "Le Monde Diplomatique" Text corpus in Arabic 04/06/2004 Arabic 69.00 EUR Not Specified Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTML file contains one article.
23 ELRA-S0258 Orientel United Arab Emirates MCA (Modern Colloquial Arabic) 18/12/2007 Arabic (United Arab Emirates) 33250.00 EUR Not Specified This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates' fixed and mobile telephone network. Each speaker uttered around 48 read and spontaneous items.
24 ELRA-S0259 Orientel United Arab Emirates MSA (Modern Standard Arabic) 18/12/2007 Arabic (United Arab Emirates) 21375.00 EUR Not Specified This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates' fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
25 ELRA-W0042 NEMLAR Written Corpus 11/08/2006 Arabic 300.00 EUR Not Specified The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags.
26 ELRA-E0040 MEDAR Evaluation Package 28/03/2012 English >>>> Arabic - English - Arabic Not Specified 0.00 EUR The MEDAR Evaluation Package was produced within the project MEDAR (MEDiterranean ARabic language and speech technology), supported by the European Commission's ICT programme. It aims to enable the evaluation of SLT /MT (Machine Translation) systems for translation tasks applying to the English-to-Arabic direction.
27 ELRA-W0027 An-Nahar Newspaper Text Corpus 25/07/2001 Arabic (Lebanon) 504.00 EUR Not Specified The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML files onCDRommedia. Each yearcontains 45000 articles and 24 million words.
28 ELRA-S0315 A-SpeechDB 27/04/2011 Arabic 1000.00 EUR Not Specified A-SpeechDB© is an Arabic speech database which contains about 20 hours of continuous speech recorded through one desktop omni microphone by 205 native speakers from Egypt (about 30% of females and 70% of males), aged between 20 and 45. Automatically generated transcriptions are provided with a manually revised version for each sentence.
29 ELRA-S0219 NEMLAR Broadcast News Speech Corpus 11/08/2006 Arabic 300.00 EUR Not Specified The Nemlar Broadcast News Speech Corpus consists of about 40 hours of Standard Arabic news broadcasts. The broadcasts were recorded from four different radio stations: Medi1, Radio Orient, RMC – Radio Monte Carlo, RTM – Radio Television Maroc. All files were recorded in linear PCM format, 16 kHz, 16 bit.
30 ELRA-S0220 NEMLAR Speech Synthesis Corpus 11/08/2006 Arabic (Egypt) 1000.00 EUR Not Specified The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions.
31 ELRA-E0020 CESTA Evaluation Package 28/06/2007 English >>>> French - Arabic >>>> French Not Specified 300.00 EUR The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The CESTA project enabled to carry out a campaign for the evaluation of machine translation technologies.
This package includes the material that was used for the CESTA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.
The campaign is distributed over two actions: evaluation on a non restrictive vocabulary, evaluation on a specialised domain (evaluation after terminology enrichment).
32 ELRA-E0018 ARCADE II Evaluation Package 28/06/2007 Arabic - Chinese - English - French - German - Greek, Modern (1453-) - Italian - Japanese - Persian - Russian - Spanish, Castilian Not Specified 300.00 EUR The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment.
This package includes the material that was used for the ARCADE II evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.
The campaign is distributed over two actions: sentence alignment and translation of named entities.