On the use of Phone-based Embeddings for Language Recognition

Christian Salamea; Ricardo De Córdoba; Luis Fernando D'Haro; Rubén San Segundo; Javier Ferreiros

doi:10.21437/IberSPEECH.2018-12

On the use of Phone-based Embeddings for Language Recognition

Christian Salamea, Ricardo De Córdoba, Luis Fernando D'Haro, Rubén San Segundo, Javier Ferreiros

Grupo de Investigación en Interacción, Robótica y Automática (GIIRA)

Producción científica: Contribución a una conferencia › Documento › revisión exhaustiva

2 Citas (Scopus)

Resumen

Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.

Idioma original	Inglés
Páginas	55-59
Número de páginas	5
DOI	https://doi.org/10.21437/IberSPEECH.2018-12
Estado	Publicada - 2018
Evento	4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018 - Barcelona, Espana Duración: 21 nov. 2018 → 23 nov. 2018

Conferencia

Conferencia	4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018
País/Territorio	Espana
Ciudad	Barcelona
Período	21/11/18 → 23/11/18

Nota bibliográfica

Publisher Copyright:
© 4th International Conference, IberSPEECH 2018.

Areas de Conocimiento del CACES

116A Computación

Acceder al documento

10.21437/IberSPEECH.2018-12

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@conference{800ed40ef5424b4993bb856e05af0fdd,

title = "On the use of Phone-based Embeddings for Language Recognition",

abstract = "Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled {"}phone-gram sequences{"}. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.",

keywords = "language identification, neural embeddings, phonotactic",

author = "Christian Salamea and {De C{\'o}rdoba}, Ricardo and D'Haro, {Luis Fernando} and Segundo, {Rub{\'e}n San} and Javier Ferreiros",

note = "Publisher Copyright: {\textcopyright} 4th International Conference, IberSPEECH 2018.; 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018 ; Conference date: 21-11-2018 Through 23-11-2018",

year = "2018",

doi = "10.21437/IberSPEECH.2018-12",

language = "English",

pages = "55--59",

}

On the use of Phone-based Embeddings for Language Recognition. / Salamea, Christian; De Córdoba, Ricardo; D'Haro, Luis Fernando et al.
2018. 55-59 Papel presentado en 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018, Barcelona, Espana.

Producción científica: Contribución a una conferencia › Documento › revisión exhaustiva

TY - CONF

T1 - On the use of Phone-based Embeddings for Language Recognition

AU - Salamea, Christian

AU - De Córdoba, Ricardo

AU - D'Haro, Luis Fernando

AU - Segundo, Rubén San

AU - Ferreiros, Javier

PY - 2018

Y1 - 2018

N2 - Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.

AB - Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.

KW - language identification

KW - neural embeddings

KW - phonotactic

UR - http://www.scopus.com/inward/record.url?scp=85096417496&partnerID=8YFLogxK

U2 - 10.21437/IberSPEECH.2018-12

DO - 10.21437/IberSPEECH.2018-12

M3 - Paper

AN - SCOPUS:85096417496

SP - 55

EP - 59

T2 - 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018

Y2 - 21 November 2018 through 23 November 2018

ER -

On the use of Phone-based Embeddings for Language Recognition

Resumen

Conferencia

Nota bibliográfica

Areas de Conocimiento del CACES

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto