Enabling the Latent Semantic Analysis of Large-Scale Information Retrieval Datasets by Means of Out-of-Core Heterogeneous Systems

Gabriel A. León-Paredes, Liliana I. Barbosa-Santillán, Antonio Pareja-Lora

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Latent Semantic Analysis (LSA) has already been widely and successfully applied in many applications for Natural Language Processing (NLP), usually working with fairly small or average sized datasets and no actual time constraints. Even so, LSA is a high time and space consuming task, which complicates its integration in real-time NLP applications (as, for example, information retrieval or question answering) on large-scale datasets. For this reason, an implementation of LSA that can both allow and accelerate as much as possible its execution on large-scale datasets would be most useful in these data-intensive, real-time NLP scenarios. However, to the best of our knowledge, such an implementation of LSA has not been achieved so far. Towards this end, a new, out-of-core, scalable, heterogeneous LSA (hLSA) system has been built and run on the clinical decision support large-scale dataset from the Text REtrieval Conference (TREC) 2015 competition. Results show that the out-of-core hLSA system can process this large-scale dataset (that is, 631,302 documents) with a full-ranked term-document matrix of 566 GB fairly fast and, besides, with a better precision (at least for one of the topics) than the TREC 2015 competing systems.

Original languageEnglish
Title of host publicationSmart Technologies, Systems and Applications - 1st International Conference, SmartTech-IC 2019, Proceedings
EditorsFabián R. Narváez, Diego F. Vallejo, Paulina A. Morillo, Julio R. Proaño
PublisherSpringer
Pages105-119
Number of pages15
ISBN (Print)9783030467845
DOIs
StatePublished - 1 Jan 2020
Event1st International Conference on Smart Technologies, Systems and Applications, SmartTech-IC 2019 - Quito, Ecuador
Duration: 2 Dec 20194 Dec 2019

Publication series

NameCommunications in Computer and Information Science
Volume1154 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference1st International Conference on Smart Technologies, Systems and Applications, SmartTech-IC 2019
Country/TerritoryEcuador
CityQuito
Period2/12/194/12/19

Bibliographical note

Funding Information:
Acknowledgements. This work has been supported by the Universidad Politécnica Salesiana (UPS) through its research group of Cloud Computing, Smart Cities & High Performance Computing (GIHP4C). It has also been supported by the Sciences Research Council (CONACyT) through the research project no. 262756, as well as (partially) by the projects RedR+Human (Dynamically Reconfigurable Educational Repositories in the Humanities, ref. TIN2014-52010-R) and CetrO+Spec (Creation, Exploration and Transformation of Educational Object Repositories in Specialized Domains, ref. TIN2017-88092-R), both financed by the Spanish Ministry of Economy and Competitiveness.

Funding Information:
This work has been supported by the Universidad Polit?cnica Salesiana (UPS) through its research group of Cloud Computing, Smart Cities & High Performance Computing (GIHP4C). It has also been supported by the Sciences Research Council (CONACyT) through the research project no. 262756, as well as (partially) by the projects RedR+Human (Dynamically Reconfigurable Educational Repositories in the Humanities, ref. TIN2014-52010-R) and CetrO+Spec (Creation, Exploration and Transformation of Educational Object Repositories in Specialized Domains, ref. TIN2017-88092-R), both financed by the Spanish Ministry of Economy and Competitiveness.

Publisher Copyright:
© Springer Nature Switzerland AG 2020.

Keywords

  • Distributed system
  • GPU
  • Heterogeneous system
  • Information retrieval
  • Latent Semantic Analysis
  • Multi-CPU
  • Parallel computing
  • Question answering

Fingerprint

Dive into the research topics of 'Enabling the Latent Semantic Analysis of Large-Scale Information Retrieval Datasets by Means of Out-of-Core Heterogeneous Systems'. Together they form a unique fingerprint.

Cite this