Model Card for mp_20_base

Model Details

Model Description

mp_20_base is an unconditional generative model designed for the generation of valid inorganic crystal structures. It serves as a foundational pre-trained model for the CrystaLLM-pi framework, specifically optimized for smaller unit cells. Based on a GPT-2 decoder-only architecture, it is trained on a corpus of Crystallographic Information Files (CIFs) to learn the syntax, symmetry, and chemical rules governing crystalline matter.

This model does not accept property conditioning vectors. It generates structures based on text prompts (e.g., chemical composition or space group) or unconditionally (ab-initio generation).

Developed by: Bone et al. (University College London)
Model type: Autoregressive Transformer (GPT-2)
Language(s): CIF (Crystallographic Information File) syntax
License: MIT

Model Sources

Repository: GitHub: CrystaLLM-pi
Paper: Discovery and recovery of crystalline materials with property-conditioned transformers (arXiv:2511.21299)
Dataset: HuggingFace: c-bone/mp_20

Uses

Direct Use

The model is intended for:

Unconditional Generation: Exploring the general chemical space of stable crystals with 20 atoms or fewer in the unit cell.
Composition/Space Group Completion: Generating valid structures given a partial prompt (e.g., a chemical formula).
Fine-tuning base: Serving as the pre-trained initialization for property-conditional models.

Out-of-Scope Use

Property Conditioning: This model cannot be steered by properties like band gap or density. Use the specific fine-tuned variants for those tasks.
Large Unit Cells: The model is strictly trained on and intended for unit cells containing 20 atoms or fewer.

Bias, Risks, and Limitations

Training Distribution: The model reflects the biases present in the Materials Project dataset. It is biased toward theoretical, DFT-relaxed inorganic compounds rather than experimentally synthesized disordered structures.
Size Constraint Bias: Because it is trained exclusively on the mp_20 subset, the model has a strong prior for generating small, highly symmetric unit cells (≤ 20 atoms) and will struggle to extrapolate to larger, more complex systems.
Validity: While it learns CIF syntax robustly, it may still generate physically invalid structures (e.g., overlapping atoms) or chemically unstable compositions.

Training Details

Training Data

The model was pre-trained on the mp_20 dataset (c-bone/mp_20), a curated subset of the Materials Project database restricted to crystal structures containing 20 atoms or fewer per unit cell.

Source: Materials Project (via c-bone/mp_20)
Preprocessing: CIFs are filtered for size (≤ 20 atoms), deduplicated, augmented (with symmetry operations and fractional coordinate shifts), and tokenized.

Training Procedure

Architecture: GPT-2 Small (~25.9M parameters).
Objective: Causal Language Modeling (Next-token prediction).
Loss Function: Cross-entropy with specific weighting for fixed syntax tokens to accelerate learning of the CIF format.

Evaluation

Metrics

The model is evaluated based on:

Validity: The rate at which generated sequences can be parsed as valid CIF files.
Structural Consistency: Adherence to space group symmetry and reasonable bond lengths.

Results

The base model achieves high validity rates for small unit cells and effectively learns to generate chemically plausible structures, serving as a robust foundation for downstream tasks requiring rigid size constraints.

Citation

@misc{bone2025discoveryrecoverycrystallinematerials,
      title={Discovery and recovery of crystalline materials with property-conditioned transformers}, 
      author={Cyprien Bone and Matthew Walker and Kuangdai Leng and Luis M. Antunes and Ricardo Grau-Crespo and Amil Aligayev and Javier Dominguez and Keith T. Butler},
      year={2025},
      eprint={2511.21299},
      archivePrefix={arXiv},
      primaryClass={cond-mat.mtrl-sci},
      url={[https://arxiv.org/abs/2511.21299](https://arxiv.org/abs/2511.21299)}, 
}

Downloads last month: 288

Safetensors

Model size

25.9M params

Tensor type

F32

Dataset used to train c-bone/CrystaLLM-pi_mp_20_base

Paper for c-bone/CrystaLLM-pi_mp_20_base

Discovery and recovery of crystalline materials with property-conditioned transformers

Paper • 2511.21299 • Published Nov 26, 2025