Model Card for `KJML/gpt-oss-20b-FP8-Dynamic`

This repository provides an FP8-dynamic quantized variant of OpenAI’s gpt-oss-20b model.
It is intended for users who want the reasoning capabilities of gpt-oss-20b with a smaller memory footprint and faster inference on modern GPUs that support FP8 inference.

⚠️ This model is not trained or fine-tuned further; it is a post-training quantization of the original openai/gpt-oss-20b weights.

Model Details

Model Description

Base model: openai/gpt-oss-20b
Architecture: Mixture-of-Experts (MoE) Transformer language model (≈21B total params, ≈3.6B active per token, inherited from base)
Quantization: FP8 dynamic (weights + activations) for inference
Context length: Same as base gpt-oss-20b (long-context, Harmony-format chat)
Language(s): Primarily English; inherits multilingual capability from base model
License: Apache 2.0 (inherits from base model)
Model type: Causal language model for text / chat generation
Developer of this variant: KJML
Finetuned from model: openai/gpt-oss-20b (no additional training; quantization only)

The original gpt-oss-20b is an open-weight reasoning model from OpenAI, designed for agentic workflows, tool use, and configurable reasoning effort. This FP8-dynamic variant preserves those capabilities while targeting more efficient deployment.

Model Sources

Base model repository: https://huggingface.co/openai/gpt-oss-20b
Upstream project / docs: https://github.com/openai/gpt-oss
This quantized model: https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic (this repo)

Uses

Direct Use

Typical direct-use scenarios (without additional fine-tuning):

General chat and assistant-style dialogue (English-first)
Reasoning and analysis (step-by-step / chain-of-thought) for:
- Technical explanations
- Brainstorming and ideation
- Code reasoning and pseudo-code (light coding assistance)
Agentic / tool-using setups:
- Function calling and structured outputs
- Retrieval-augmented generation (RAG) backends
- Local “AI PC” / workstation deployments where FP8 is supported

Note: The model is trained on OpenAI’s Harmony response format. For best results, use a chat template that applies the Harmony format (e.g. tokenizer.apply_chat_template in Transformers) when prompting.

Downstream Use

The FP8-dynamic variant can be used as a drop-in replacement for openai/gpt-oss-20b in:

Custom backends with vLLM / TGI / custom inference servers
Local desktop apps (LM Studio, Ollama-style setups, etc.) that support FP8
RAG systems where latency and VRAM usage are important
Multi-agent frameworks where many concurrent contexts are needed

If you fine-tune or adapt this model further, treat it as you would the base gpt-oss-20b model, but keep in mind that quantization can slightly change numeric behavior, especially for very long generations.

Out-of-Scope Use

The model (and this quantized variant) is not recommended for:

High-stakes decision making without human review, e.g.:
- Medical, legal, or financial advice
- Safety-critical environments (autonomous driving, industrial control, etc.)
Generating content that violates laws or platform policies
Acting as the sole decision-maker in any context where errors could cause harm to people or property

Users should always keep a human in the loop for sensitive or impactful applications.

Bias, Risks, and Limitations

This model inherits all biases, risks, and limitations of the base gpt-oss-20b model. As a large language model trained on internet-scale data, it may:

Produce biased or stereotypical content, including along axes such as gender, race, nationality, or religion.
Hallucinate facts, references, or citations.
Overstate its own certainty.
Generate unsafe or undesirable content if prompted adversarially or without proper safety layers.

The FP8-dynamic quantization may also:

Introduce small degradations in quality vs. BF16 / MXFP4 versions, particularly for:
- Very long generations
- Edge cases that are numerically sensitive
Behave slightly differently from the base model, even with identical prompts.

Recommendations

Do not rely on this model as a single source of truth.
Add safety filters and/or a moderation layer around generations.
Use human review for any high-impact or user-facing deployment.
Evaluate the FP8-dynamic variant on your own tasks and data before using in production.

How to Get Started with the Model

Basic usage with 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "KJML/gpt-oss-20b-FP8-Dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",     # Will use FP8 where supported
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain what FP8 dynamic quantization is in simple terms."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Make sure you are using a recent version of Transformers and a PyTorch build that supports FP8 where applicable.

Training Details

Training Data

No new training data is introduced in this repository.

This model is not trained from scratch.
It directly reuses the weights and training data of openai/gpt-oss-20b.
For full details on the original training data and methodology, see the official gpt-oss model card and paper.

Training Procedure

No additional gradient-based training was performed. The steps were:

Start from base openai/gpt-oss-20b weights.
Apply FP8-dynamic post-training quantization (weights and activations) for inference.
Export quantized weights to safetensors format for deployment.

Preprocessing

No extra data preprocessing was done beyond what OpenAI used for the base model.

Training Hyperparameters

Training regime for this repo: None (no fine-tuning; quantization only)
Original base model: Trained by OpenAI using high-precision training and post-training MXFP4 quantization of MoE weights (see upstream model card / paper for specifics).

Speeds, Sizes, Times

Exact performance depends on your hardware and FP8 support, but in general:

VRAM usage: Lower than the BF16 / MXFP4 original, enabling more concurrent contexts or larger batch sizes.
Throughput: Higher tokens/sec on FP8-capable hardware compared to running BF16 weights, especially at batch size >1.

You should benchmark on your own GPU(s) for precise numbers.

Evaluation

No separate benchmark suite has been run specifically for the FP8-dynamic variant at this time.

Testing Data, Factors & Metrics

Testing data: Not re-evaluated independently here.
It is reasonable to expect similar qualitative behavior to openai/gpt-oss-20b, with minor differences due to quantization.

Results

If you run your own evals (e.g. on reasoning or coding benchmarks), please feel free to share issues / PRs or discussion links so others can reference them.

Summary

Use this model when you want gpt-oss-20b-level reasoning with lower memory usage and better throughput.
Expect small quality differences vs. the original due to FP8 quantization.

Model Examination (Optional)

No additional interpretability or probing analysis has been carried out on this quantized variant.

For deeper analysis and interpretability work, refer to:

The official gpt-oss paper / model card.
Independent community evaluations of gpt-oss-20b.

Environmental Impact

This repository does not involve training a new model.

The main compute cost is a one-time quantization pass over the base weights.
Carbon footprint is therefore negligible compared to the original model training.

For estimates of training-time emissions, please consult the original gpt-oss model card and related publications.

Technical Specifications

Model Architecture and Objective

Architecture: Mixture-of-Experts Transformer language model (same as gpt-oss-20b)
Objective: Next-token prediction / causal language modeling
Quantization:
- FP8 dynamic for weights and activations at inference time
- Intended for GPUs / accelerators that support efficient FP8 matmul

The quantization is applied in a way that preserves the original architecture and I/O behavior.

Compute Infrastructure

Quantization was performed on a single modern GPU (exact details may vary; see repository description or commits if you need exact hardware).

Hardware

Single GPU with FP8 support (for quantization and testing)
Standard CPU + RAM sufficient to host original and quantized weights

Software

PyTorch (FP8-capable build)
Hugging Face Transformers
Supporting libraries for FP8 quantization and safetensor export

Citation

If you use this model in academic or commercial work, please cite at least the original gpt-oss paper/model card from OpenAI:

BibTeX:

@misc{openai2025gptoss120bgptoss20bmodel,
      title={gpt-oss-120b & gpt-oss-20b Model Card},
      author={OpenAI},
      year={2025},
      eprint={2508.10925},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.10925}
}

You may also optionally reference this quantized variant as:

@misc{kjml2025gptoss20bfp8dynamic,
      title={KJML/gpt-oss-20b-FP8-Dynamic: FP8-dynamic Quantized Variant of gpt-oss-20b},
      author={KJML},
      year={2025},
      howpublished={Hugging Face model repository},
      url={https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic}
}

Glossary

MoE (Mixture-of-Experts): Architecture where only a subset of “experts” (parameter blocks) are active per token, reducing compute vs. dense models.
FP8 dynamic: 8-bit floating point representation with dynamic scaling, used to reduce memory and bandwidth while preserving model quality.
Harmony format: OpenAI’s chat / response formatting used for training gpt-oss models; must be respected for best performance.

More Information

Base model details, prompts, and advanced usage examples: see openai/gpt-oss-20b on Hugging Face and the official gpt-oss GitHub repository.
For questions, issues, or suggestions around this FP8-dynamic variant, please open an issue or discussion in this repository.

Model Card Authors

Author: KJML
Contact: [email protected]

Downloads last month: 39

Safetensors

Model size

21B params

Tensor type

F32

F16

F8_E4M3

Model tree for KJML/gpt-oss-20b-FP8-Dynamic

Base model

openai/gpt-oss-20b

Quantized

(141)

this model

Model Card for KJML/gpt-oss-20b-FP8-Dynamic