Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández, Alberto Gil-Solla, José J. Pazos-Arias, Diego Quisi-Peralta

Research output: Contribution to journalArticlepeer-review

Abstract

Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

Original languageEnglish
Pages (from-to)491-527
Number of pages37
JournalInformatica (Netherlands)
Volume34
Issue number3
DOIs
StatePublished - 8 Sep 2023

Bibliographical note

Publisher Copyright:
© 2023 Vilnius University.

Keywords

  • ad hoc corpus
  • Doc2Vec
  • embedding models
  • Named Entity Recognition

Fingerprint

Dive into the research topics of 'Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings'. Together they form a unique fingerprint.

Cite this