CIEMPIESS-UNAM

Online Resources

Corpus for ASR in the 5 More Spoken Languages in the World

According to the "Anuario 2013" of the "Instituto Cervantes"¹ and the "Atlas de la lengua española en el mundo"², the five more spoken languages in the world are: mandarin-chinese, english, spanish, hindi and arabic.

So in this section, we show different comparison tables between several corpus in these five languages, extracted from the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA).

LDC Tables	ELRA Tables
Mandarin-Chinese	Mandarin-Chinese
English	English
Spanish	Spanish
Hindi	Hindi
Arabic	Arabic

Note 1. The Instituto Cervantes (http://www.cervantes.es/) is a public organization founded in Spain on March 21st, 1991 by the government of this country, sponsored by the king of Spain. It depends on the "Ministerio de Asuntos Exteriores" and its main goal is to promote the teaching of the Spanish language and the culture of Spain and Hispanoamerica all over the world.

Note 2. http://cvc.cervantes.es/lengua/anuario/anuario_13/

Evolution of the MEXBET Alphabet

In this section we show a group of tables that try to show the evolution of the MEXBET alphabet through time. These tables are:

Table 1. Allophones of Mexican Spanish in IPA Alphabet
Table 2. Phonological System of Mexican Spanish in IPA Alphabet
Table 3. Allophones of Mexican Spanish in MEXBET Alphabet
Table 4. Phonological System of Mexican Spanish in MEXBET Alphabet
Table 5. Equivalences between IPA and MEXBET Alphabets
Table 6. Transcription Levels of the MEXBET Alphabet utilized at the DIMEx100 Corpus
Table 7. Level T66 of MEXBET utilized at the CIEMPIESS Corpus
Table 8. Version of the T29 Level of MEXBET utilized at the CIEMPIESS Corpus

You can access to them by clicking on the following link:

Evolution of MEXBET
Notice that: In our previous papers, we refer to the level T29 as T22 and the level T66 as T50 but this is incorrect because the number "22", "29", or "66" etc. must reflect the number of phonemes and allophones considered in that level of MEXBET.

Version of the MEXBET Alphabet utilized at the CIEMPIESS Corpus

In this section we show the two transcription leves utilized for the pronouncing dictionaries of the CIEMPIESS Corpus. These leves are:

Table 1. Level T66 of MEXBET utilized at the CIEMPIESS Corpus
Table 2. Version of the T29 Level of MEXBET utilized at the CIEMPIESS Corpus

You can access to these tables by clicking on the following link:

MEXBET for the CIEMPIESS Corpus
Notice that: In our previous papers, we refer to the level T29 as T22 and the level T66 as T50 but this is incorrect because the number "22", "29", or "66" etc. must reflect the number of phonemes and allophones considered in that level of MEXBET.

Experiment with PFS Algorithm

In this section one can find Python programs to implement the PFS and the PFS-US algorithms.

One can also find a corpus of more than 100 thousand of words with pre-transcription and a set of grouping files that shows a comparison between six different phonetic algorithms: Soundex, NYSIIS, Double Metaphone, Phonix, PFS and PFS-US.

Click here

Resources

In this section we show you several links to other sites related to the speech recognition task that you can find interesting.

If you think that we omitted some link that you believe is needed here or you have your own speech project and you want to share it with our readers, please let us know!!!