Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández; Alberto Gil-Solla; José J. Pazos-Arias; Diego Quisi-Peralta

doi:10.15388/23-INFOR527

Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández, Alberto Gil-Solla, José J. Pazos-Arias, Diego Quisi-Peralta

Research Group in Digital Transformation and Data Science

Research output: Contribution to journal › Article › peer-review

Abstract

Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

Original language	English
Pages (from-to)	491-527
Number of pages	37
Journal	Informatica (Netherlands)
Volume	34
Issue number	3
DOIs	https://doi.org/10.15388/23-INFOR527
State	Published - 8 Sep 2023

Bibliographical note

Publisher Copyright:
© 2023 Vilnius University.

Keywords

ad hoc corpus
Doc2Vec
embedding models
Named Entity Recognition

Access to Document

10.15388/23-INFOR527

Cite this

@article{294002e6cafa4f4ebde8c848e0767c61,

title = "Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings",

abstract = "Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.",

keywords = "ad hoc corpus, Doc2Vec, embedding models, Named Entity Recognition",

author = "Yolanda Blanco-Fern{\'a}ndez and Alberto Gil-Solla and Pazos-Arias, {Jos{\'e} J.} and Diego Quisi-Peralta",

note = "Publisher Copyright: {\textcopyright} 2023 Vilnius University.",

year = "2023",

month = sep,

day = "8",

doi = "10.15388/23-INFOR527",

language = "English",

volume = "34",

pages = "491--527",

journal = "Informatica (Netherlands)",

issn = "0868-4952",

publisher = "IOS Press",

number = "3",

}

TY - JOUR

T1 - Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

AU - Blanco-Fernández, Yolanda

AU - Gil-Solla, Alberto

AU - Pazos-Arias, José J.

AU - Quisi-Peralta, Diego

PY - 2023/9/8

Y1 - 2023/9/8

N2 - Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

AB - Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

KW - ad hoc corpus

KW - Doc2Vec

KW - embedding models

KW - Named Entity Recognition

UR - http://www.scopus.com/inward/record.url?scp=85174141436&partnerID=8YFLogxK

U2 - 10.15388/23-INFOR527

DO - 10.15388/23-INFOR527

M3 - Article

AN - SCOPUS:85174141436

SN - 0868-4952

VL - 34

SP - 491

EP - 527

JO - Informatica (Netherlands)

JF - Informatica (Netherlands)

IS - 3

ER -

Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this