COMPARISON TABLE BETWEEN DIFFERENT CORPUS TAKEN FROM ELRA



Characteristics of the query in the ELRA website:

Language(s): Hindi

Application(s): ASR



No. ELRA Catalog No. Item Name Available since Language(s) Cost for research use Cost for evaluation use Description
1 ELRA-S0344 LILA Hindi Belt database 21/06/2012 Hindi 35000.00 EUR Not Specified The LILA Hindi Belt database comprises 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items.
2 ELRA-S0281 LILA Hindi-L1 database 03/09/2008 Hindi 40000.00 EUR Not Specified The LILA Hindi-L1 database comprises 2,030 Hindi speakers (1,012 males and 1,018 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered around 60 read and spontaneous items.
3 ELRA-W0037 The EMILLE/CIIL Corpus 15/09/2004 Urdu - Telugu - Tamil - Sinhalese - Panjabi, Punjabi - Oriya - Marathi - Malayalam - Kashmiri - Kannada - Hindi - Gujarati - Bengali - Assamese - English 0.00 EUR Not Specified The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.
This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus.
4 ELRA-W0038 The EMILLE Lancaster Corpus 15/09/2004 Bengali - Gujarati - Hindi - Panjabi, Punjabi - Sinhalese - Tamil - Urdu - English Not Specified Not Specified The EMILLE Lancaster Corpus consists of monolingual corpora containing approximately 58,880,000 words for seven South Asian languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode.