Abstract
Language Identification (LID) can be defined as the process of automatically identifying the language of a given spoken utterance. We have focused in a phonotactic approach in which the system input is the phoneme sequence generated by a speech recognizer (ASR), but instead of phonemes, we have used phonetic units that contain context information, the socalled "phone-gram sequences". In this context, we propose the use of Neural Embeddings (NEs) as features for those phone-grams sequences, which are used as entries in a classical i-Vector framework to train a multi class logistic classifier. These NEs incorporate information from the neighbouring phone-grams in the sequence and model implicitly longer-context information. The NEs have been trained using both a Skip-Gram and a Glove Model. Experiments have been carried out on the KALAKA-3 database and we have used Cavg as metric to compare the systems. We propose as baseline the Cavg obtained using the NEs as features in the LID task, 24,7%. Our strategy to incorporate information from the neighbouring phone-grams to define the final sequences contributes to obtain up to 24,3% relative improvement over the baseline using Skip-Gram model and up to 32,4% using Glove model. Finally, the fusion of our best system with a MFCC-based acoustic i- Vector system provides up to 34,1% improvement over the acoustic system alone.
Original language | English |
---|---|
Pages | 55-59 |
Number of pages | 5 |
DOIs | |
State | Published - 2018 |
Event | 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018 - Barcelona, Spain Duration: 21 Nov 2018 → 23 Nov 2018 |
Conference
Conference | 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018 |
---|---|
Country/Territory | Spain |
City | Barcelona |
Period | 21/11/18 → 23/11/18 |
Bibliographical note
Funding Information:The work leading to these results has been supported by AMIC (MINECO, TIN2017-85854-C4-4-R), and CAVIAR (MINECO, TEC2017-84593-C2-1-R) projects. Authors also thank Mark Hallet for the English revision of this paper and all the other members of Speech Technology Group for the continuous and fruitful discussion on these topics. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
Publisher Copyright:
© 4th International Conference, IberSPEECH 2018.
Keywords
- language identification
- neural embeddings
- phonotactic
CACES Knowledge Areas
- 116A Computer Science