TY - JOUR
T1 - Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings
AU - Blanco-Fernández, Yolanda
AU - Gil-Solla, Alberto
AU - Pazos-Arias, José J.
AU - Quisi-Peralta, Diego
N1 - Publisher Copyright:
© 2023 Vilnius University.
PY - 2023/9/8
Y1 - 2023/9/8
N2 - Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.
AB - Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.
KW - ad hoc corpus
KW - Doc2Vec
KW - embedding models
KW - Named Entity Recognition
UR - http://www.scopus.com/inward/record.url?scp=85174141436&partnerID=8YFLogxK
U2 - 10.15388/23-INFOR527
DO - 10.15388/23-INFOR527
M3 - Article
AN - SCOPUS:85174141436
SN - 0868-4952
VL - 34
SP - 491
EP - 527
JO - Informatica (Netherlands)
JF - Informatica (Netherlands)
IS - 3
ER -