Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández, Alberto Gil-Solla, José J. Pazos-Arias, Diego Quisi-Peralta

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

Resumen

Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

Idioma originalInglés
Páginas (desde-hasta)491-527
Número de páginas37
PublicaciónInformatica (Netherlands)
Volumen34
N.º3
DOI
EstadoPublicada - 8 sep. 2023

Nota bibliográfica

Publisher Copyright:
© 2023 Vilnius University.

Huella

Profundice en los temas de investigación de 'Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings'. En conjunto forman una huella única.

Citar esto