On the use of Phone-based Embeddings for Language Recognition

Christian Salamea, Ricardo De Córdoba, Luis Fernando D'Haro, Rubén San Segundo, Javier Ferreiros

Research output: Contribution to conferencePaperpeer-review

2 Scopus citations


Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.

Original languageEnglish
Number of pages5
StatePublished - 2018
Event4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018 - Barcelona, Spain
Duration: 21 Nov 201823 Nov 2018


Conference4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018

Bibliographical note

Funding Information:
The work leading to these results has been supported by AMIC (MINECO, TIN2017-85854-C4-4-R), and CAVIAR (MINECO, TEC2017-84593-C2-1-R) projects. Authors also thank Mark Hallet for the English revision of this paper and all the other members of Speech Technology Group for the continuous and fruitful discussion on these topics. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

Publisher Copyright:
© 4th International Conference, IberSPEECH 2018.


  • language identification
  • neural embeddings
  • phonotactic

CACES Knowledge Areas

  • 116A Computer Science


Dive into the research topics of 'On the use of Phone-based Embeddings for Language Recognition'. Together they form a unique fingerprint.

Cite this