Online Resources
Corpus for ASR in the 5 More Spoken Languages in the World
According to the "Anuario 2013" of the "Instituto Cervantes"1 and the "Atlas de la lengua española en el mundo"2, the five more spoken languages in the world are: mandarin-chinese, english, spanish, hindi and arabic.
So in this section, we show different comparison tables between several corpus in these five languages, extracted from the Linguistic Data Consortium (LDC) and the European Language Resources Association (ELRA).
LDC Tables | ELRA Tables |
---|---|
Mandarin-Chinese | Mandarin-Chinese |
English | English |
Spanish | Spanish |
Hindi | Hindi |
Arabic | Arabic |
Note 1. The Instituto Cervantes (http://www.cervantes.es/) is a public organization founded in Spain on March 21st, 1991 by the government of this country, sponsored by the king of Spain. It depends on the "Ministerio de Asuntos Exteriores" and its main goal is to promote the teaching of the Spanish language and the culture of Spain and Hispanoamerica all over the world.
Evolution of the MEXBET Alphabet
In this section we show a group of tables that try to show the evolution of the MEXBET alphabet through time. These tables are:
- Table 1. Allophones of Mexican Spanish in IPA Alphabet
- Table 2. Phonological System of Mexican Spanish in IPA Alphabet
- Table 3. Allophones of Mexican Spanish in MEXBET Alphabet
- Table 4. Phonological System of Mexican Spanish in MEXBET Alphabet
- Table 5. Equivalences between IPA and MEXBET Alphabets
- Table 6. Transcription Levels of the MEXBET Alphabet utilized at the DIMEx100 Corpus
- Table 7. Level T66 of MEXBET utilized at the CIEMPIESS Corpus
- Table 8. Version of the T29 Level of MEXBET utilized at the CIEMPIESS Corpus
You can access to them by clicking on the following link:
Notice that: In our previous papers, we refer to the level T29 as T22 and the level T66 as T50 but this is incorrect because the number "22", "29", or "66" etc. must reflect the number of phonemes and allophones considered in that level of MEXBET.
Version of the MEXBET Alphabet utilized at the CIEMPIESS Corpus
In this section we show the two transcription leves utilized for the pronouncing dictionaries of the CIEMPIESS Corpus. These leves are:
- Table 1. Level T66 of MEXBET utilized at the CIEMPIESS Corpus
- Table 2. Version of the T29 Level of MEXBET utilized at the CIEMPIESS Corpus
You can access to these tables by clicking on the following link:
Notice that: In our previous papers, we refer to the level T29 as T22 and the level T66 as T50 but this is incorrect because the number "22", "29", or "66" etc. must reflect the number of phonemes and allophones considered in that level of MEXBET.
Experiment with PFS Algorithm
In this section one can find Python programs to implement the PFS and the PFS-US algorithms.
One can also find a corpus of more than 100 thousand of words with pre-transcription and a set of grouping files that shows a comparison between six different phonetic algorithms: Soundex, NYSIIS, Double Metaphone, Phonix, PFS and PFS-US.