Skip to main navigation Skip to search Skip to main content

Triplegal-cl: A Multi-jurisdictional Spanish Legal Corpus for Contrastive Training of Dense Retrieval Models

Research output: Contribution to journalArticle

Abstract

Dense legal case retrieval in Spanish requires a structured dataset to train bi-encoder models. However, most existing Spanish legal resources have been designed for classification or entity extraction tasks and do not provide training data tailored to dense retrieval. In this work, we present TripLegal-CL, a multijurisdictional corpus of 592,382 contrastive instances structured for contrastive learning, automatically generated from 148,637 publicly available legal documents using an LLM. On this basis, to assess the usefulness of the resource, we fine-tune multilingual bi-encoder models through contrastive learning using the generated data and compare them with their baseline versions. The fine-tuned models achieve improvements of up to +18.2 percentage points in Acc@1 and +15.3 percentage points in MAP@100. These results confirm that the corpus provides effective training data for the contrastive fine-tuning of dense retrievers in the legal domain.
Translated title of the contributionTriplegal-cl: Un corpus legal español multijurisdiccional para el entrenamiento contrastivo de modelos de recuperación densa
Original languageEnglish (US)
Pages (from-to)1-12
Number of pages12
JournalProcesamiento de Lenguaje Natural
Volume76
Issue number76
DOIs
StatePublished - 30 Mar 2026

Keywords

  • Contrastive learning
  • Bi-encoder models
  • Spanish legal corpus
  • Dense retrieval

CACES Knowledge Areas

  • 316A Software and Applications Development and Analysis

Fingerprint

Dive into the research topics of 'Triplegal-cl: A Multi-jurisdictional Spanish Legal Corpus for Contrastive Training of Dense Retrieval Models'. Together they form a unique fingerprint.

Cite this