RigoBERTa 2.0
RigoBERTa 2.0 is a state-of-the-art encoder language model for Spanish, developed through language-adaptive pretraining. This model significantly improves performance on every previous Spanish encoder model offering robust language understanding.
Model Details
Model Description
RigoBERTa 2.0 was built by further pretraining the general-purpose FacebookAI/xlm-roberta-large on a meticulously curated Spanish corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish language.
- Developed by: IIC
- Model type: Encoder
- Language(s) (NLP): Spanish
- License: rigoberta-nc (permissive Non Commercial)
- Finetuned from model: FacebookAI/xlm-roberta-large
Intended Use & Limitations
Intended Use
RigoBERTa 2.0 is designed for:
- General text understanding in Spanish.
- Applications in NLP tasks such as text classification, named entity recognition, and related downstream tasks.
- Research and development purposes, including benchmarking and further model adaptation.
Note that the license is non-commercial. For a commercial use, please contact us.
Limitations & Caveats
- Data Biases: While we used a highly curated dataset, it may contain biases due to source selection and the inherent limitations of public data.
- Operational Cost: Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated.
Training Details
Training Procedure
Preprocessing
- Tokenizer: Uses the tokenizer from FacebookAI/xlm-roberta-large to ensure consistency with the base model.
- Handling Long Sequences: Sequences exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary.
- OOV Handling: Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling any kind of text.
Evaluation
RigoBERTa 2.0 was evaluated on several Spanish NLP tasks. Evaluation metrics indicate that the model outperforms previous multi-language models and general Spanish language models.
Key Results:
- Achieves top performance on most of the tested datasets.
Breakdown of the results:
Citation
If you use RigoBERTa 2.0 in your research, please cite the associated paper:
BibTeX:
@misc{rigoberta2,
author = { Instituto de Ingeniería del Conocimiento },
title = { RigoBERTa-2.0 },
year = 2025,
url = { https://huggingface.co/IIC/RigoBERTa-2.0 },
doi = { 10.57967/hf/7048 },
publisher = { Hugging Face }
}
- Downloads last month
- 5



