Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández; Alberto Gil-Solla; José J. Pazos-Arias; Diego Quisi-Peralta

doi:10.15388/23-INFOR527

Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Yolanda Blanco-Fernández, Alberto Gil-Solla, José J. Pazos-Arias, Diego Quisi-Peralta

Grupo de Investigación en Transformación Digital y Ciencia de Datos

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

Resumen

Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

Idioma original	Inglés
Páginas (desde-hasta)	491-527
Número de páginas	37
Publicación	Informatica (Netherlands)
Volumen	34
N.º	3
DOI	https://doi.org/10.15388/23-INFOR527
Estado	Publicada - 8 sep. 2023

Nota bibliográfica

Publisher Copyright:
© 2023 Vilnius University.

Acceder al documento

10.15388/23-INFOR527

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{294002e6cafa4f4ebde8c848e0767c61,

title = "Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings",

abstract = "Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.",

keywords = "ad hoc corpus, Doc2Vec, embedding models, Named Entity Recognition",

author = "Yolanda Blanco-Fern{\'a}ndez and Alberto Gil-Solla and Pazos-Arias, {Jos{\'e} J.} and Diego Quisi-Peralta",

note = "Publisher Copyright: {\textcopyright} 2023 Vilnius University.",

year = "2023",

month = sep,

day = "8",

doi = "10.15388/23-INFOR527",

language = "English",

volume = "34",

pages = "491--527",

journal = "Informatica (Netherlands)",

issn = "0868-4952",

publisher = "IOS Press",

number = "3",

}

TY - JOUR

T1 - Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

AU - Blanco-Fernández, Yolanda

AU - Gil-Solla, Alberto

AU - Pazos-Arias, José J.

AU - Quisi-Peralta, Diego

PY - 2023/9/8

Y1 - 2023/9/8

N2 - Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

AB - Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

KW - ad hoc corpus

KW - Doc2Vec

KW - embedding models

KW - Named Entity Recognition

UR - http://www.scopus.com/inward/record.url?scp=85174141436&partnerID=8YFLogxK

U2 - 10.15388/23-INFOR527

DO - 10.15388/23-INFOR527

M3 - Article

AN - SCOPUS:85174141436

SN - 0868-4952

VL - 34

SP - 491

EP - 527

JO - Informatica (Netherlands)

JF - Informatica (Netherlands)

IS - 3

ER -

Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Resumen

Nota bibliográfica

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto