Language(s): Arabic
Application(s): ASR
No. | ELRA Catalog No. | Item Name | Available since | Language(s) | Cost for research use | Cost for evaluation use | Description |
---|---|---|---|---|---|---|---|
1 | ELRA-M0040 | Not Specified | 20/01/2004 | Arabic <<< >>> French | 18000.00 EUR | Not Specified | DixAF is a French-Arabic, Arabic-French dictionary, which consists of around 125,000 binary links between ca. 43,000 French entries and ca. 35,000 Arabic entries. |
2 | ELRA-S0157 | NetDC Arabic BNSC (Broadcast News Speech Corpus) | 08/02/2007 | Arabic | 200.00 EUR | Not Specified | The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States. The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period. |
3 | ELRA-W0049 | "Le Monde Diplomatique" Arabic tagged corpus | 31/03/2009 | Arabic | 400.00 EUR | Not Specified | This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : raw text in Arabic, vowelized text in Arabic, one XML file containing the morphological annotation of the text. |
4 | ELRA-W0078 | NE3L named entities Arabic corpus | 29/09/2014 | Arabic | 5000.00 EUR | Not Specified | The Arabic corpus contains 103,363 words coming from articles extracted from “Le Monde Diplomatique” newspaper, and published in 2004. 2 named entity categories were taken into account: Time and Amount. |
5 | ELRA-L0088 | Arabic Morphological Dictionary | 20/03/2012 | Arabic | 450.00 EUR | Not Specified | The Arabic Morphological Dictionary contains 4,912,749 entries, including 3,374,852 nouns, 1,537,699 verbs, 198 grammatical words. All files are provided as plain text in UTF8 character encoding, which represents about 154 Mb of data. |
6 | ELRA-S0308 | Egyptian Arabic Speecon database | 22/07/2010 | Arabic (Egypt) | 60000.00 EUR | Not Specified | The Egyptian Arabic Speecon database comprises the recordings of 550 adult Egyptian speakers and 50 child Egyptian speakers who uttered respectively over 290 items and 210 items (read and spontaneous). |
7 | ELRA-S0350 | GlobalPhone Arabic Pronunciation Dictionary | 04/06/2013 | Arabic | 700.00 EUR | Not Specified | The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Arabic dictionary contains 29230 entries (27059 words). |
8 | ELRA-S0190 | OrienTel Arabic as spoken in Israel database | 06/12/2005 | Arabic (Israel) | 39843.00 EUR | Not Specified | This speech database contains the recordings of 750 Arabic speakers recorded over the Israeli fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items. |
9 | ELRA-S0247 | LC-STAR Standard Arabic Phonetic lexicon | 15/11/2007 | Arabic | 27625.00 EUR | Not Specified | The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA. |
10 | ELRA-S0192 | GlobalPhone Arabic | 30/01/2006 | Arabic | 700.00 EUR | Not Specified | The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. |
11 | ELRA-T0372-01 | Multilingual Dictionary of Sports – English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database | 07/07/2009 | English - French - Greek, Modern (1453-) - Arabic - German - Spanish, Castilian - Portuguese | 400.00 EUR | Not Specified | This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The current set consists of an English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database. It contains a nomenclature of 37,500 entries for English, French, Greek and Arabic, 20,000 entries for Spanish, 22,000 for German and 10,000 for Portuguese. For each language, the contents consist of: • Mandatory information: term, grammar • Mandatory information except if not available (no source) : reference/source, • Mandatory and common information: field (sport), domain, additional circumscription • Optional information: definition and source, linguistic and source reference, combinatorics, other form, synonym |
12 | ELRA-W0030 | Al-Hayat Arabic Corpus | 15/01/2002 | Arabic | 720.00 EUR | Not Specified | The corpus contains articles extracted from the newspeper Al-Hayat, organised in 7 domains, for language engineering applications developement. |
13 | ELRA-S0289 | OrienTel Jordan MCA (Modern Colloquial Arabic) database | 22/10/2008 | Arabic (Jordan) | 22500.00 EUR | Not Specified | This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
14 | ELRA-S0290 | OrienTel Jordan MSA (Modern Standard Arabic) database | 22/10/2008 | Arabic (Jordan) | 15000.00 EUR | Not Specified | This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
15 | ELRA-S0221 | OrienTel Egypt MCA (Modern Colloquial Arabic) database | 28/08/2006 | Arabic (Egypt) | 22500.00 EUR | Not Specified | This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
16 | ELRA-S0222 | OrienTel Egypt MSA (Modern Standard Arabic) database | 28/08/2006 | Arabic (Egypt) | 15000.00 EUR | Not Specified | This speech database contains the recordings of 500 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
17 | ELRA-S0183 | OrienTel Morocco MCA (Modern Colloquial Arabic) database | 24/11/2005 | Arabic (Morocco) | 22500.00 EUR | Not Specified | This speech database contains the recordings of 772 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
18 | ELRA-S0184 | OrienTel Morocco MSA (Modern Standard Arabic) database | 24/11/2005 | Arabic (Morocco) | 15000.00 EUR | Not Specified | This speech database contains the recordings of 530 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
19 | ELRA-S0186 | OrienTel Tunisia MCA (Modern Colloquial Arabic) database | 24/11/2005 | Arabic (Tunisia) | 22500.00 EUR | Not Specified | This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
20 | ELRA-S0187 | OrienTel Tunisia MSA (Modern Standard Arabic) database | 24/11/2005 | Arabic (Tunisia) | 15000.00 EUR | Not Specified | This speech database contains the recordings of 598 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
21 | ELRA-T0372-04 | Multilingual Dictionary of Sports – English-French-Arabic trilingual database | 07/07/2009 | English - French - Arabic | 200.00 EUR | Not Specified | This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The current set consists of an English-French-Arabic trilingual database which includes the following information for each language: • Mandatory information: term, reference/source, grammar • Mandatory and common information: field (sport), domain, additional circumscription • Optional information: definition and source OR linguistic and source reference, combinatorics, other form, synonym, variant |
22 | ELRA-W0036-04 | "Le Monde Diplomatique" Text corpus in Arabic | 04/06/2004 | Arabic | 69.00 EUR | Not Specified | Electronic archiving of "Le Monde Diplomatique" articles in Arabic from 2000. The corpus is available in HTML. Each HTML file contains one article. |
23 | ELRA-S0258 | Orientel United Arab Emirates MCA (Modern Colloquial Arabic) | 18/12/2007 | Arabic (United Arab Emirates) | 33250.00 EUR | Not Specified | This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates' fixed and mobile telephone network. Each speaker uttered around 48 read and spontaneous items. |
24 | ELRA-S0259 | Orientel United Arab Emirates MSA (Modern Standard Arabic) | 18/12/2007 | Arabic (United Arab Emirates) | 21375.00 EUR | Not Specified | This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates' fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items. |
25 | ELRA-W0042 | NEMLAR Written Corpus | 11/08/2006 | Arabic | 300.00 EUR | Not Specified | The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags. |
26 | ELRA-E0040 | MEDAR Evaluation Package | 28/03/2012 | English >>>> Arabic - English - Arabic | Not Specified | 0.00 EUR | The MEDAR Evaluation Package was produced within the project MEDAR (MEDiterranean ARabic language and speech technology), supported by the European Commission's ICT programme. It aims to enable the evaluation of SLT /MT (Machine Translation) systems for translation tasks applying to the English-to-Arabic direction. |
27 | ELRA-W0027 | An-Nahar Newspaper Text Corpus | 25/07/2001 | Arabic (Lebanon) | 504.00 EUR | Not Specified | The An-Nahar Newspaper Text Corpus comprises articles in Arabic (Lebanon) from 1995 to 2000 (6 years) stored as HTML files onCDRommedia. Each yearcontains 45000 articles and 24 million words. |
28 | ELRA-S0315 | A-SpeechDB | 27/04/2011 | Arabic | 1000.00 EUR | Not Specified | A-SpeechDB© is an Arabic speech database which contains about 20 hours of continuous speech recorded through one desktop omni microphone by 205 native speakers from Egypt (about 30% of females and 70% of males), aged between 20 and 45. Automatically generated transcriptions are provided with a manually revised version for each sentence. |
29 | ELRA-S0219 | NEMLAR Broadcast News Speech Corpus | 11/08/2006 | Arabic | 300.00 EUR | Not Specified | The Nemlar Broadcast News Speech Corpus consists of about 40 hours of Standard Arabic news broadcasts. The broadcasts were recorded from four different radio stations: Medi1, Radio Orient, RMC – Radio Monte Carlo, RTM – Radio Television Maroc. All files were recorded in linear PCM format, 16 kHz, 16 bit. |
30 | ELRA-S0220 | NEMLAR Speech Synthesis Corpus | 11/08/2006 | Arabic (Egypt) | 1000.00 EUR | Not Specified | The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions. |
31 | ELRA-E0020 | CESTA Evaluation Package | 28/06/2007 | English >>>> French - Arabic >>>> French | Not Specified | 300.00 EUR | The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The CESTA project enabled to carry out a campaign for the evaluation of machine translation technologies. This package includes the material that was used for the CESTA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system. The campaign is distributed over two actions: evaluation on a non restrictive vocabulary, evaluation on a specialised domain (evaluation after terminology enrichment). |
32 | ELRA-E0018 | ARCADE II Evaluation Package | 28/06/2007 | Arabic - Chinese - English - French - German - Greek, Modern (1453-) - Italian - Japanese - Persian - Russian - Spanish, Castilian | Not Specified | 300.00 EUR | The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment. This package includes the material that was used for the ARCADE II evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system. The campaign is distributed over two actions: sentence alignment and translation of named entities. |