Download corpus
Corpus CIEMPIESS
The CIEMPIESS Corpus was designed to create acoustic models for automatic speech recognition. It consists in 17 hour of radio programs with spontaneous speech between the radio moderator and his guests. The entire corpus was taken from Radio-IUS (UNAM) . It includes text transcriptions and the files needed to perform experiments within the CMU-Sphinx recognition system.
Click Here For More Information
Click here to see the corpus in LDC
CIEMPIESS Corpus by Carlos Daniel Hernandez Mena is licensed under a Creative Commons Attribution- ShareAlike 4.0 International License. Based on a work at http://odin.fi-b.unam.mx/CIEMPIESS-UNAM/.
HTK2SPHINX-CONVERTER
HTK2SPHINX-CONVERTER
Is a software coded in python 2.7 that lets the user use the speech recognition system HTK almost the same way as the speech recognition system CMU-SPHINX 3 and with the same input files.
HTK2SPHINX-CONVERTER can also perform "live decoding" using the speech recognition system Julius.
The two main differences beyween the HTK2SPHINX-CONVERTER and the CMU-SPHINX3 is that the former is a grammar based recognition system speaker dependent, and the latter can use a language model and could be speaker independent.
Click Here For More Information
© Copyright 2014 Carlos Daniel Hernandez Mena
HTK-BENCHMARK
HTK-BENCHMARK Is a software coded in python 2.7 that lets the user use the speech recognition system HTK almost the same way as the speech recognition system CMU-SPHINX 3 and with the same input files.
HTK-BENCHMARK do not perform "live decoding".
HTK-BENCHMARK is based on recognition using a 3-gram language model in ARPA format compatible with SPHINX3.
© Copyright 2015 Carlos Daniel Hernandez Mena
Fonetica3 Library
The fonetica3 library contains functions to perform
phonetic and phonological transcriptions to spanish words.
© Copyright 2017 Carlos Daniel Hernandez Mena
CORPUS CHM150
The CHM150 is a corpus of microphone speech of mexican Spanish taken from 75 male speakers and
75 female speakers in a noise environment of a "quiet office" with a total duration of 1.63
hours.
Speakers were encouraged to respond between some pre selected open questions or they could
also describe a particular painting showed to them in a computer monitor. By so, the speech
is completely spontaneous and one can see it in the transcription file, that captures
disfluencies and mispronunciations in an orthographic way.
The CHM150 corpus contains a total of 2663 utterances classified by speaker,
and it also contains a small vocabulary of 1898 unique words. For these reasons
the CHM150 could be so small for speech recognition but it is fine for doing
spoken term detection and forensic speaker identification.
You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create
a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2016S04
CORPUS CIEMPIESS LIGHT
The CIEMPIESS LIGHT Corpus is an enhanced version of the CIEMPIESS Corpus (LDC item LDC2015S07).
CIEMPIESS LIGHT is "light" because it doesn't include much of the files of the first version of
CIEMPIESS and it is "enhanced" because it has a lot of improvements, some of them suggested by
our community of users, that make this version more convenient for the new speech recognition
engines such as Kaldi (http://kaldi-asr.org/).
You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create
a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2017S23
CIEMPIESS BALANCE
The CIEMPIESS BALANCE Corpus (LDC2018S11) is designed to match with the CIEMPIESS LIGHT Corpus (LDC2017S23). CIEMPIESS BALANCE is "balance" because it is designed to balance the CIEMPIESS LIGHT. It means that if both corpora are combined, one will get a gender balanced corpus. To appreciate this, one need to know that the CIEMPIESS LIGHT is by itself, a gender unbalanced corpus of approximately 25% of female speakers and 75% of male speakers. So the CIEMPIESS BALANCE is a gender unbalanced corpus with approximately 25% of male speakers and 75% of female speakers.
You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2018S11
TEDx Spanish Corpus
The TEDx Spanish Corpus is a dataset created from TEDx talks in Spanish and it aims to be used in the Automatic Speech Recognition (ASR) Task.
The TEDx Spanish Corpus is a gender unbalanced corpus of 24 hours of duration. It contains spontaneous speech of several expositors in TEDx events; most of them are men.
CIEMPIESS Experimentation Package
CIEMPIESS Experimentation Package is a set of three different data sets, specifically Complementary, Fem and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 8 hours of broadcast speech and transcripts and is intended for use as a standard test data. See the included documentation for more details on each corpus.
You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2019S07
CIEMPIESS Spanish Models
The "CIEMPIESS Spanish Models" are acoustic models designed to work with PocketSphinx. The 581 hours of audio recordings used to train the models come from many datasets by LDC (including all the CIEMPIESS corpus except the CIEMPIESS-TEST) and other sources collected by the social service program "Desarrollo de Tecnologías del Habla" and the CIEMPIESS-UNAM project. Both of them belonging to the "Universidad Nacional Autónoma de México" (UNAM) in Mexico City.
CIEMPIESS-PNPD
The CIEMPIESS Proper-Names Pronouncing Dictionary (CIEMPIESS-PNPD) is a pronouncing dictionary of proper names manually created by native speakers of the Spanish language. It was designed to be used in the speech recognition and speech synthesis tasks but it seems that it could be useful, in general, for NLP tasks too. The CIEMPIESS-PNPD counts with almost 200 thousand entries. It has alternative pronunciations of some proper names and it also has lists of the proper names classified into the categories: names, lastnames and places. The ”unknown” list includes proper names that are not includedin any list, so one can’t know what is their category. The proper names collected for the CIEMPIESS-PNPD were taken from institutions that belong to Spanish speaking countries such like: Mexico, Spain and Costa Rica. Most of the names belong to lists of voters of those countries. The names of places were taken from the Instituto Nacional de Estadística,Geografía e Informática (INEGI) of Mexico; It means that names of places in the CIEMPIESS-PNPD belong only to places, streets, neighborhoods, states, counties, etc. in Mexico.
It was released at the OpenCor 2019 Conference held in Guanajuato, Mexico
LibriVox Spanish
LibriVox Spanish (LDC2020S01) consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox, a non-profit project that creates audiobooks from public domain works. The transcripts were developed from the scratch for this release by native speakers of spanish language.
The link provided by the Linguistic Data Consortium (LDC) to request the corpus is:
https://catalog.ldc.upenn.edu/LDC2020S01
Wikipedia Spanish Corpus
Wikipedia Spanish Speech and Transcripts (LDC2021S07) consists of approximately 25 hours of Spanish read speech and transcripts. The read text was taken from the Spanish version of WikiProject Spoken Wikipedia, referred to as Wikipedia Grabada. The transcripts were developed for this release by native speakers of Spanish.
CIEMPIESS-UNAM Project at Hugging Face
Visit our profile at Hugging Face