Phonetic Algorithms

Phonetic algorithms are used to identify two or more words with different writing, but the same or almost the same pronunciation.

The oldest phonetic algorithm is Soundex, created in 1918 as a way of codifying proper names with the same pronunciation but different writing (For example: Susy, Sussy, Sussie, etc.).

Soundex replaces graphemes from the incoming word with symbols that belong to a "phonetic group" that could be considered confusable (for example, the phonetic groups: /p/ y /b/ , /t/ y /d/ ó /k/ y /g/). After that, the incoming word is transformed to a phonetic code. It means that two words with identical phonetic code have identical pronunciation.

PFS Algorithm

The PFS algorithm is a phonetic algorithm based on Soundex that is used to detect phonetically similar words in large word-lists in Spanish. PFS stands for "Palabras Fonéticamente Similares".

The original version of the PFS algorithm considers all vowels as equal and the algorithm PFS-US considers the tonic vowel of the incoming word, and it also leaves the last syllable untouched. This causes the algorithm to be sensitive to the gender and number of words.

You can download Python functions to implement the PFS and the PFS-US algorithms from the following link.

PFS Algorithm

PFS Experiment

We did an experiment to compare different phonetic algorithms: Soundex, NYSIIS, Double Metaphone, Phonix, PFS y PFS-US.

From a list of Spanish words, every algorithm produced a grouping file. A grouping file contains groups of words from 1 to n elements. You can download the grouping files generated in this experiment at this link.

PFS Experiment

The Corpus

The corpus for this experiment is a list of more than 100 thousand of Spanish words with pre-transcription, extracted from 4 different sources: the CREA corpus, the DEM dictionary, the Spanish version of Wikipedia and the MOBY Project.

The pre-transcription is a way that we use to normalize the incoming words, to help our phonetic tools to produce more accurate phonetic transcriptions.

You can download the list of words that we use in this experiment from this link.

Spanish Words with Pre-Transcription

#######################################################

Algoritmos Fonéticos

Los algoritmos fonéticos son aquellos que sirven para identificar dos o más palabras que a pesar de escribirse de manera distinta, tienen una pronunciación igual o muy similar.

El más antiguo de los algoritmos fonéticos es el algoritmo Soundex, presentado en 1918 como una manera de codificar nombres propios que se pronuncian igual, pero que su escritura es distinta (por ejemplo: Susy, Sussy, Sussie, etc.).

La estrategia del algoritmo Soundex, consiste en sustituir grafemas pertenecientes a la palabra de entrada, por símbolos que representen un "grupo fonético" que sea considerado confuso (por ejemplo, los grupos fónicos /p/ y /b/ , /t/ y /d/ ó /k/ y /g/). Una vez hecho esto, la palabra de entrada queda transformada en un código fonético. Por tanto, dos palabras de entrada cuya pronunciación sea idéntica, tendrán códigos Soundex igualmente idénticos.

Algoritmo PFS

El algoritmo PFS es un algoritmo fonético basado en Soundex, que sirve para detectar palabras fonéticamente similares en grandes listas de palabras en español. PFS es un acrónimo para "Palabras Fonéticamente Similares".

El algoritmo PFS original considera a todas las vocales como iguales y el algoritmo PFS-US considera la vocal tónica de cada palabra, y también deja intacta la última sílaba. Esto hace que el algoritmo sea sensible al género y al número de las palabras.

En este link puedes descargar funciones en Python para implementar los algoritmos PFS y PFS-US.

Algoritmo PFS

Experimento PFS

Realizamos un experimento en el que comparamos distintos algoritmos fonéticos: Soundex, NYSIIS, Double Metaphone, Phonix, PFS y PFS-US.

De una lista de palabras en español, cada algoritmo generó un archivo de grupos. Un archivo de grupos contiene grupos de palabras de 1 hasta n elementos. En el siguiente link puedes descargar los archivos de agrupamiento generados.

Experimento PFS

El Corpus

El corpus para este experimento es una lista de más de 100 mil palabras en español con pre-transcripción, extraída de 4 diferentes fuentes: el corpus CREA, el diccionario DEM, Wikipedia en español y el proyecto MOBY.

La pre-transcripción es una manera en la que normalizamos las palabras de entrada para ayudar a nuestras herramientas fonéticas a generar transcripciones más precisas.

Puedes descargar la lista de palabras utillizada para este experimento en el siguiente link.

Spanish Words with Pre-Transcription

Esta obra está bajo una Licencia Creative Commons Atribución-NoComercial 4.0 Internacional.