IIC
/

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

RigoBERTa 2.0

Logo

RigoBERTa 2.0 is a state-of-the-art encoder language model for Spanish, developed through language-adaptive pretraining. This model significantly improves performance on every previous Spanish encoder model offering robust language understanding.

Model Details

Model Description

RigoBERTa 2.0 was built by further pretraining the general-purpose FacebookAI/xlm-roberta-large on a meticulously curated Spanish corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish language.

  • Developed by: IIC
  • Model type: Encoder
  • Language(s) (NLP): Spanish
  • License: rigoberta-nc (permissive Non Commercial)
  • Finetuned from model: FacebookAI/xlm-roberta-large

Intended Use & Limitations

Intended Use

RigoBERTa 2.0 is designed for:

  • General text understanding in Spanish.
  • Applications in NLP tasks such as text classification, named entity recognition, and related downstream tasks.
  • Research and development purposes, including benchmarking and further model adaptation.

Note that the license is non-commercial. For a commercial use, please contact us.

Limitations & Caveats

  • Data Biases: While we used a highly curated dataset, it may contain biases due to source selection and the inherent limitations of public data.
  • Operational Cost: Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated.

Training Details

Training Procedure

Preprocessing

  • Tokenizer: Uses the tokenizer from FacebookAI/xlm-roberta-large to ensure consistency with the base model.
  • Handling Long Sequences: Sequences exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary.
  • OOV Handling: Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling any kind of text.

Evaluation

RigoBERTa 2.0 was evaluated on several Spanish NLP tasks. Evaluation metrics indicate that the model outperforms previous multi-language models and general Spanish language models.

Key Results:

  • Achieves top performance on most of the tested datasets.

Breakdown of the results:

Clinical Bench García Subies et al.

Bench2

Bench3

Citation

If you use RigoBERTa 2.0 in your research, please cite the associated paper:

BibTeX:

@misc{rigoberta2,
    author       = { Instituto de Ingeniería del Conocimiento },
    title        = { RigoBERTa-2.0 },
    year         = 2025,
    url          = { https://huggingface.co/IIC/RigoBERTa-2.0 },
    doi          = { 10.57967/hf/7048 },
    publisher    = { Hugging Face }
}
Downloads last month
5
Safetensors
Model size
0.6B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support