Abstract
Dense legal case retrieval in Spanish requires a structured dataset to train bi-encoder models. However, most existing Spanish legal resources have been designed for classification or entity extraction tasks and do not provide training data tailored to dense retrieval. In this work, we present TripLegal-CL, a multijurisdictional corpus of 592,382 contrastive instances structured for contrastive learning, automatically generated from 148,637 publicly available legal documents using an LLM. On this basis, to assess the usefulness of the resource, we fine-tune multilingual bi-encoder models through contrastive learning using the generated data and compare them with their baseline versions. The fine-tuned models achieve improvements of up to +18.2 percentage points in Acc@1 and +15.3 percentage points in MAP@100. These results confirm that the corpus provides effective training data for the contrastive fine-tuning of dense retrievers in the legal domain.
| Translated title of the contribution | Triplegal-cl: Un corpus legal español multijurisdiccional para el entrenamiento contrastivo de modelos de recuperación densa |
|---|---|
| Original language | English (US) |
| Pages (from-to) | 1-12 |
| Number of pages | 12 |
| Journal | Procesamiento de Lenguaje Natural |
| Volume | 76 |
| Issue number | 76 |
| DOIs | |
| State | Published - 30 Mar 2026 |
Keywords
- Contrastive learning
- Bi-encoder models
- Spanish legal corpus
- Dense retrieval
CACES Knowledge Areas
- 316A Software and Applications Development and Analysis
Fingerprint
Dive into the research topics of 'Triplegal-cl: A Multi-jurisdictional Spanish Legal Corpus for Contrastive Training of Dense Retrieval Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver