On the use of Phone-based Embeddings for Language Recognition

Christian Salamea; Ricardo De Córdoba; Luis Fernando D'Haro; Rubén San Segundo; Javier Ferreiros

doi:10.21437/IberSPEECH.2018-12

On the use of Phone-based Embeddings for Language Recognition

Christian Salamea, Ricardo De Córdoba, Luis Fernando D'Haro, Rubén San Segundo, Javier Ferreiros

Research Group on Interaction, Robotics and Automatics (GIIRA)

Research output: Contribution to conference › Paper › peer-review

2 Scopus citations

Abstract

Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.

Original language	English
Pages	55-59
Number of pages	5
DOIs	https://doi.org/10.21437/IberSPEECH.2018-12
State	Published - 2018
Event	4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018 - Barcelona, Spain Duration: 21 Nov 2018 → 23 Nov 2018

Conference

Conference	4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018
Country/Territory	Spain
City	Barcelona
Period	21/11/18 → 23/11/18

Bibliographical note

Funding Information:
The work leading to these results has been supported by AMIC (MINECO, TIN2017-85854-C4-4-R), and CAVIAR (MINECO, TEC2017-84593-C2-1-R) projects. Authors also thank Mark Hallet for the English revision of this paper and all the other members of Speech Technology Group for the continuous and fruitful discussion on these topics. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Publisher Copyright:
© 4th International Conference, IberSPEECH 2018.

Keywords

language identification
neural embeddings
phonotactic

CACES Knowledge Areas

116A Computer Science

Access to Document

10.21437/IberSPEECH.2018-12

Cite this

@conference{800ed40ef5424b4993bb856e05af0fdd,

title = "On the use of Phone-based Embeddings for Language Recognition",

abstract = "Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled {"}phone-gram sequences{"}. In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.",

keywords = "language identification, neural embeddings, phonotactic",

author = "Christian Salamea and {De C{\'o}rdoba}, Ricardo and D'Haro, {Luis Fernando} and Segundo, {Rub{\'e}n San} and Javier Ferreiros",

note = "Publisher Copyright: {\textcopyright} 4th International Conference, IberSPEECH 2018.; 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018 ; Conference date: 21-11-2018 Through 23-11-2018",

year = "2018",

doi = "10.21437/IberSPEECH.2018-12",

language = "English",

pages = "55--59",

}

On the use of Phone-based Embeddings for Language Recognition. / Salamea, Christian; De Córdoba, Ricardo; D'Haro, Luis Fernando et al.
2018. 55-59 Paper presented at 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018, Barcelona, Spain.

Research output: Contribution to conference › Paper › peer-review

TY - CONF

T1 - On the use of Phone-based Embeddings for Language Recognition

AU - Salamea, Christian

AU - De Córdoba, Ricardo

AU - D'Haro, Luis Fernando

AU - Segundo, Rubén San

AU - Ferreiros, Javier

PY - 2018

Y1 - 2018

N2 - Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.

AB - Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.

KW - language identification

KW - neural embeddings

KW - phonotactic

UR - http://www.scopus.com/inward/record.url?scp=85096417496&partnerID=8YFLogxK

U2 - 10.21437/IberSPEECH.2018-12

DO - 10.21437/IberSPEECH.2018-12

M3 - Paper

AN - SCOPUS:85096417496

SP - 55

EP - 59

T2 - 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018

Y2 - 21 November 2018 through 23 November 2018

ER -

On the use of Phone-based Embeddings for Language Recognition

Abstract

Conference

Bibliographical note

Keywords

CACES Knowledge Areas

Access to Document

Other files and links

Fingerprint

Cite this