Abstract
Latent Semantic Analysis (LSA) has already been widely and successfully applied in many applications for Natural Language Processing (NLP), usually working with fairly small or average sized datasets and no actual time constraints. Even so, LSA is a high time and space consuming task, which complicates its integration in real-time NLP applications (as, for example, information retrieval or question answering) on large-scale datasets. For this reason, an implementation of LSA that can both allow and accelerate as much as possible its execution on large-scale datasets would be most useful in these data-intensive, real-time NLP scenarios. However, to the best of our knowledge, such an implementation of LSA has not been achieved so far. Towards this end, a new, out-of-core, scalable, heterogeneous LSA (hLSA) system has been built and run on the clinical decision support large-scale dataset from the Text REtrieval Conference (TREC) 2015 competition. Results show that the out-of-core hLSA system can process this large-scale dataset (that is, 631,302 documents) with a full-ranked term-document matrix of 566 GB fairly fast and, besides, with a better precision (at least for one of the topics) than the TREC 2015 competing systems.
Original language | English |
---|---|
Title of host publication | Smart Technologies, Systems and Applications - 1st International Conference, SmartTech-IC 2019, Proceedings |
Editors | Fabián R. Narváez, Diego F. Vallejo, Paulina A. Morillo, Julio R. Proaño |
Publisher | Springer |
Pages | 105-119 |
Number of pages | 15 |
ISBN (Print) | 9783030467845 |
DOIs | |
State | Published - 1 Jan 2020 |
Event | 1st International Conference on Smart Technologies, Systems and Applications, SmartTech-IC 2019 - Quito, Ecuador Duration: 2 Dec 2019 → 4 Dec 2019 |
Publication series
Name | Communications in Computer and Information Science |
---|---|
Volume | 1154 CCIS |
ISSN (Print) | 1865-0929 |
ISSN (Electronic) | 1865-0937 |
Conference
Conference | 1st International Conference on Smart Technologies, Systems and Applications, SmartTech-IC 2019 |
---|---|
Country/Territory | Ecuador |
City | Quito |
Period | 2/12/19 → 4/12/19 |
Bibliographical note
Funding Information:Acknowledgements. This work has been supported by the Universidad Politécnica Salesiana (UPS) through its research group of Cloud Computing, Smart Cities & High Performance Computing (GIHP4C). It has also been supported by the Sciences Research Council (CONACyT) through the research project no. 262756, as well as (partially) by the projects RedR+Human (Dynamically Reconfigurable Educational Repositories in the Humanities, ref. TIN2014-52010-R) and CetrO+Spec (Creation, Exploration and Transformation of Educational Object Repositories in Specialized Domains, ref. TIN2017-88092-R), both financed by the Spanish Ministry of Economy and Competitiveness.
Funding Information:
This work has been supported by the Universidad Polit?cnica Salesiana (UPS) through its research group of Cloud Computing, Smart Cities & High Performance Computing (GIHP4C). It has also been supported by the Sciences Research Council (CONACyT) through the research project no. 262756, as well as (partially) by the projects RedR+Human (Dynamically Reconfigurable Educational Repositories in the Humanities, ref. TIN2014-52010-R) and CetrO+Spec (Creation, Exploration and Transformation of Educational Object Repositories in Specialized Domains, ref. TIN2017-88092-R), both financed by the Spanish Ministry of Economy and Competitiveness.
Publisher Copyright:
© Springer Nature Switzerland AG 2020.
Keywords
- Distributed system
- GPU
- Heterogeneous system
- Information retrieval
- Latent Semantic Analysis
- Multi-CPU
- Parallel computing
- Question answering