Language(s): Hindi
Application(s): ASR
No. | ELRA Catalog No. | Item Name | Available since | Language(s) | Cost for research use | Cost for evaluation use | Description |
---|---|---|---|---|---|---|---|
1 | ELRA-S0344 | LILA Hindi Belt database | 21/06/2012 | Hindi | 35000.00 EUR | Not Specified | The LILA Hindi Belt database comprises 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items. |
2 | ELRA-S0281 | LILA Hindi-L1 database | 03/09/2008 | Hindi | 40000.00 EUR | Not Specified | The LILA Hindi-L1 database comprises 2,030 Hindi speakers (1,012 males and 1,018 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered around 60 read and spontaneous items. |
3 | ELRA-W0037 | The EMILLE/CIIL Corpus | 15/09/2004 | Urdu - Telugu - Tamil - Sinhalese - Panjabi, Punjabi - Oriya - Marathi - Malayalam - Kashmiri - Kannada - Hindi - Gujarati - Bengali - Assamese - English | 0.00 EUR | Not Specified | The EMILLE/CIIL Corpus consists of monolingual corpora containing approximately 92,799,000 words for 14 South Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. This database is available for research use by academic organisations only. For a use by commercial organisations, a subset of the EMILLE/CIIL Corpus is available under the reference ELRA-W0038 The EMILLE Lancaster Corpus. |
4 | ELRA-W0038 | The EMILLE Lancaster Corpus | 15/09/2004 | Bengali - Gujarati - Hindi - Panjabi, Punjabi - Sinhalese - Tamil - Urdu - English | Not Specified | Not Specified | The EMILLE Lancaster Corpus consists of monolingual corpora containing approximately 58,880,000 words for seven South Asian languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu) (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, and 20 written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. |