Download corpus

Corpus CIEMPIESS

The CIEMPIESS Corpus was designed to create acoustic models for automatic speech recognition. It consists in 17 hour of radio programs with spontaneous speech between the radio moderator and his guests. The entire corpus was taken from Radio-IUS (UNAM) . It includes text transcriptions and the files needed to perform experiments within the CMU-Sphinx recognition system.

CIEMPIESS_Statistics

README.txt file

Click Here For More Information

Click here to see the corpus in LDC

How to cite?
License: Licencia Creative Commons
CIEMPIESS Corpus by Carlos Daniel Hernandez Mena is licensed under a Creative Commons Attribution- ShareAlike 4.0 International License. Based on a work at http://odin.fi-b.unam.mx/CIEMPIESS-UNAM/.

Download CIEMPIESS

HTK2SPHINX-CONVERTER

HTK2SPHINX-CONVERTER Is a software coded in python 2.7 that lets the user use the speech recognition system HTK almost the same way as the speech recognition system CMU-SPHINX 3 and with the same input files.

HTK2SPHINX-CONVERTER can also perform "live decoding" using the speech recognition system Julius.

The two main differences beyween the HTK2SPHINX-CONVERTER and the CMU-SPHINX3 is that the former is a grammar based recognition system speaker dependent, and the latter can use a language model and could be speaker independent.

Click Here For More Information

© Copyright 2014 Carlos Daniel Hernandez Mena

Dowload HTK2SPHINX-CONVERTER

HTK-BENCHMARK

HTK-BENCHMARK Is a software coded in python 2.7 that lets the user use the speech recognition system HTK almost the same way as the speech recognition system CMU-SPHINX 3 and with the same input files.


HTK-BENCHMARK do not perform "live decoding".


HTK-BENCHMARK is based on recognition using a 3-gram language model in ARPA format compatible with SPHINX3.


© Copyright 2015 Carlos Daniel Hernandez Mena

Download HTK-BENCHMARK

Fonetica3 Library

The fonetica3 library contains functions to perform phonetic and phonological transcriptions to spanish words.

© Copyright 2017 Carlos Daniel Hernandez Mena

Download Fonetica3 Library

CORPUS CHM150

The CHM150 is a corpus of microphone speech of mexican Spanish taken from 75 male speakers and 75 female speakers in a noise environment of a "quiet office" with a total duration of 1.63 hours.

Speakers were encouraged to respond between some pre selected open questions or they could also describe a particular painting showed to them in a computer monitor. By so, the speech is completely spontaneous and one can see it in the transcription file, that captures disfluencies and mispronunciations in an orthographic way.

The CHM150 corpus contains a total of 2663 utterances classified by speaker, and it also contains a small vocabulary of 1898 unique words. For these reasons the CHM150 could be so small for speech recognition but it is fine for doing spoken term detection and forensic speaker identification.

You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2016S04

Download

CORPUS CIEMPIESS LIGHT

The CIEMPIESS LIGHT Corpus is an enhanced version of the CIEMPIESS Corpus (LDC item LDC2015S07).

CIEMPIESS LIGHT is "light" because it doesn't include much of the files of the first version of CIEMPIESS and it is "enhanced" because it has a lot of improvements, some of them suggested by our community of users, that make this version more convenient for the new speech recognition engines such as Kaldi (http://kaldi-asr.org/).

You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2017S23

Download

CIEMPIESS BALANCE

The CIEMPIESS BALANCE Corpus (LDC2018S11) is designed to match with the CIEMPIESS LIGHT Corpus (LDC2017S23). CIEMPIESS BALANCE is "balance" because it is designed to balance the CIEMPIESS LIGHT. It means that if both corpora are combined, one will get a gender balanced corpus. To appreciate this, one need to know that the CIEMPIESS LIGHT is by itself, a gender unbalanced corpus of approximately 25% of female speakers and 75% of male speakers. So the CIEMPIESS BALANCE is a gender unbalanced corpus with approximately 25% of male speakers and 75% of female speakers.

You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2018S11

Download

TEDx Spanish Corpus

The TEDx Spanish Corpus is a dataset created from TEDx talks in Spanish and it aims to be used in the Automatic Speech Recognition (ASR) Task.

The TEDx Spanish Corpus is a gender unbalanced corpus of 24 hours of duration. It contains spontaneous speech of several expositors in TEDx events; most of them are men.

Download

CIEMPIESS Experimentation Package

CIEMPIESS Experimentation Package is a set of three different data sets, specifically Complementary, Fem and Test. Complementary is a phonetically-balanced corpus of isolated Spanish words spoken in Central Mexico. Fem contains broadcast speech from 21 female speakers, collected to balance by gender the number of recordings from male speakers in other CIEMPIESS collections. Test consists of 8 hours of broadcast speech and transcripts and is intended for use as a standard test data. See the included documentation for more details on each corpus.

You can also download it from the Linguistic Data Consortium (LDC) website. You just have to create a new account, then you can request the corpus by email.
https://catalog.ldc.upenn.edu/LDC2019S07

NEW!!!: Updated transcriptions for CIEMPIESS-TEST

Download

CIEMPIESS Spanish Models

The "CIEMPIESS Spanish Models" are acoustic models designed to work with PocketSphinx. The 581 hours of audio recordings used to train the models come from many datasets by LDC (including all the CIEMPIESS corpus except the CIEMPIESS-TEST) and other sources collected by the social service program "Desarrollo de Tecnologías del Habla" and the CIEMPIESS-UNAM project. Both of them belonging to the "Universidad Nacional Autónoma de México" (UNAM) in Mexico City.

Download

CIEMPIESS-PNPD

The CIEMPIESS Proper-Names Pronouncing Dictionary (CIEMPIESS-PNPD) is a pronouncing dictionary of proper names manually created by native speakers of the Spanish language. It was designed to be used in the speech recognition and speech synthesis tasks but it seems that it could be useful, in general, for NLP tasks too. The CIEMPIESS-PNPD counts with almost 200 thousand entries. It has alternative pronunciations of some proper names and it also has lists of the proper names classified into the categories: names, lastnames and places. The ”unknown” list includes proper names that are not includedin any list, so one can’t know what is their category. The proper names collected for the CIEMPIESS-PNPD were taken from institutions that belong to Spanish speaking countries such like: Mexico, Spain and Costa Rica. Most of the names belong to lists of voters of those countries. The names of places were taken from the Instituto Nacional de Estadística,Geografía e Informática (INEGI) of Mexico; It means that names of places in the CIEMPIESS-PNPD belong only to places, streets, neighborhoods, states, counties, etc. in Mexico.

It was released at the OpenCor 2019 Conference held in Guanajuato, Mexico

Download

LibriVox Spanish

LibriVox Spanish (LDC2020S01) consists of approximately 73 hours of Spanish read speech and transcripts. The audio data was taken from Spanish audiobooks developed by LibriVox, a non-profit project that creates audiobooks from public domain works. The transcripts were developed from the scratch for this release by native speakers of spanish language.

The link provided by the Linguistic Data Consortium (LDC) to request the corpus is:

https://catalog.ldc.upenn.edu/LDC2020S01

Download

Wikipedia Spanish Corpus

Wikipedia Spanish Speech and Transcripts (LDC2021S07) consists of approximately 25 hours of Spanish read speech and transcripts. The read text was taken from the Spanish version of WikiProject Spoken Wikipedia, referred to as Wikipedia Grabada. The transcripts were developed for this release by native speakers of Spanish.

Download

CIEMPIESS-UNAM Project at Hugging Face

Visit our profile at Hugging Face

VISIT

Download Zone

In this section you can download the tools and language resorces developed by the CIEMPIESS-UNAM Project. All of our contents are protected by international licenses that work free of charge to the public, so you can modify, distribute and adapt our creations to your particular needs at no cost. If you find bugs in our software, please tell us, if you improve it , please share it !!!
If you download our tools for Academic use, please cite us, that is so good for us !!!