Model Card for KJML/gpt-oss-20b-FP8-Dynamic
This repository provides an FP8-dynamic quantized variant of OpenAI’s gpt-oss-20b model.
It is intended for users who want the reasoning capabilities of gpt-oss-20b with a smaller memory footprint and faster inference on modern GPUs that support FP8 inference.
⚠️ This model is not trained or fine-tuned further; it is a post-training quantization of the original
openai/gpt-oss-20bweights.
Model Details
Model Description
- Base model:
openai/gpt-oss-20b - Architecture: Mixture-of-Experts (MoE) Transformer language model (≈21B total params, ≈3.6B active per token, inherited from base)
- Quantization: FP8 dynamic (weights + activations) for inference
- Context length: Same as base
gpt-oss-20b(long-context, Harmony-format chat) - Language(s): Primarily English; inherits multilingual capability from base model
- License: Apache 2.0 (inherits from base model)
- Model type: Causal language model for text / chat generation
- Developer of this variant: KJML
- Finetuned from model:
openai/gpt-oss-20b(no additional training; quantization only)
The original gpt-oss-20b is an open-weight reasoning model from OpenAI, designed for agentic workflows, tool use, and configurable reasoning effort. This FP8-dynamic variant preserves those capabilities while targeting more efficient deployment.
Model Sources
- Base model repository: https://huggingface.co/openai/gpt-oss-20b
- Upstream project / docs: https://github.com/openai/gpt-oss
- This quantized model: https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic (this repo)
Uses
Direct Use
Typical direct-use scenarios (without additional fine-tuning):
- General chat and assistant-style dialogue (English-first)
- Reasoning and analysis (step-by-step / chain-of-thought) for:
- Technical explanations
- Brainstorming and ideation
- Code reasoning and pseudo-code (light coding assistance)
- Agentic / tool-using setups:
- Function calling and structured outputs
- Retrieval-augmented generation (RAG) backends
- Local “AI PC” / workstation deployments where FP8 is supported
Note: The model is trained on OpenAI’s Harmony response format. For best results, use a chat template that applies the Harmony format (e.g. tokenizer.apply_chat_template in Transformers) when prompting.
Downstream Use
The FP8-dynamic variant can be used as a drop-in replacement for openai/gpt-oss-20b in:
- Custom backends with vLLM / TGI / custom inference servers
- Local desktop apps (LM Studio, Ollama-style setups, etc.) that support FP8
- RAG systems where latency and VRAM usage are important
- Multi-agent frameworks where many concurrent contexts are needed
If you fine-tune or adapt this model further, treat it as you would the base gpt-oss-20b model, but keep in mind that quantization can slightly change numeric behavior, especially for very long generations.
Out-of-Scope Use
The model (and this quantized variant) is not recommended for:
- High-stakes decision making without human review, e.g.:
- Medical, legal, or financial advice
- Safety-critical environments (autonomous driving, industrial control, etc.)
- Generating content that violates laws or platform policies
- Acting as the sole decision-maker in any context where errors could cause harm to people or property
Users should always keep a human in the loop for sensitive or impactful applications.
Bias, Risks, and Limitations
This model inherits all biases, risks, and limitations of the base gpt-oss-20b model. As a large language model trained on internet-scale data, it may:
- Produce biased or stereotypical content, including along axes such as gender, race, nationality, or religion.
- Hallucinate facts, references, or citations.
- Overstate its own certainty.
- Generate unsafe or undesirable content if prompted adversarially or without proper safety layers.
The FP8-dynamic quantization may also:
- Introduce small degradations in quality vs. BF16 / MXFP4 versions, particularly for:
- Very long generations
- Edge cases that are numerically sensitive
- Behave slightly differently from the base model, even with identical prompts.
Recommendations
- Do not rely on this model as a single source of truth.
- Add safety filters and/or a moderation layer around generations.
- Use human review for any high-impact or user-facing deployment.
- Evaluate the FP8-dynamic variant on your own tasks and data before using in production.
How to Get Started with the Model
Basic usage with 🤗 Transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "KJML/gpt-oss-20b-FP8-Dynamic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto", # Will use FP8 where supported
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain what FP8 dynamic quantization is in simple terms."},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Make sure you are using a recent version of Transformers and a PyTorch build that supports FP8 where applicable.
Training Details
Training Data
No new training data is introduced in this repository.
- This model is not trained from scratch.
- It directly reuses the weights and training data of
openai/gpt-oss-20b. - For full details on the original training data and methodology, see the official gpt-oss model card and paper.
Training Procedure
No additional gradient-based training was performed. The steps were:
- Start from base
openai/gpt-oss-20bweights. - Apply FP8-dynamic post-training quantization (weights and activations) for inference.
- Export quantized weights to
safetensorsformat for deployment.
Preprocessing
No extra data preprocessing was done beyond what OpenAI used for the base model.
Training Hyperparameters
- Training regime for this repo: None (no fine-tuning; quantization only)
- Original base model: Trained by OpenAI using high-precision training and post-training MXFP4 quantization of MoE weights (see upstream model card / paper for specifics).
Speeds, Sizes, Times
Exact performance depends on your hardware and FP8 support, but in general:
- VRAM usage: Lower than the BF16 / MXFP4 original, enabling more concurrent contexts or larger batch sizes.
- Throughput: Higher tokens/sec on FP8-capable hardware compared to running BF16 weights, especially at batch size >1.
You should benchmark on your own GPU(s) for precise numbers.
Evaluation
No separate benchmark suite has been run specifically for the FP8-dynamic variant at this time.
Testing Data, Factors & Metrics
- Testing data: Not re-evaluated independently here.
- It is reasonable to expect similar qualitative behavior to
openai/gpt-oss-20b, with minor differences due to quantization.
Results
If you run your own evals (e.g. on reasoning or coding benchmarks), please feel free to share issues / PRs or discussion links so others can reference them.
Summary
- Use this model when you want gpt-oss-20b-level reasoning with lower memory usage and better throughput.
- Expect small quality differences vs. the original due to FP8 quantization.
Model Examination (Optional)
No additional interpretability or probing analysis has been carried out on this quantized variant.
For deeper analysis and interpretability work, refer to:
- The official gpt-oss paper / model card.
- Independent community evaluations of
gpt-oss-20b.
Environmental Impact
This repository does not involve training a new model.
- The main compute cost is a one-time quantization pass over the base weights.
- Carbon footprint is therefore negligible compared to the original model training.
For estimates of training-time emissions, please consult the original gpt-oss model card and related publications.
Technical Specifications
Model Architecture and Objective
Architecture: Mixture-of-Experts Transformer language model (same as
gpt-oss-20b)Objective: Next-token prediction / causal language modeling
Quantization:
- FP8 dynamic for weights and activations at inference time
- Intended for GPUs / accelerators that support efficient FP8 matmul
The quantization is applied in a way that preserves the original architecture and I/O behavior.
Compute Infrastructure
Quantization was performed on a single modern GPU (exact details may vary; see repository description or commits if you need exact hardware).
Hardware
- Single GPU with FP8 support (for quantization and testing)
- Standard CPU + RAM sufficient to host original and quantized weights
Software
- PyTorch (FP8-capable build)
- Hugging Face Transformers
- Supporting libraries for FP8 quantization and safetensor export
Citation
If you use this model in academic or commercial work, please cite at least the original gpt-oss paper/model card from OpenAI:
BibTeX:
@misc{openai2025gptoss120bgptoss20bmodel,
title={gpt-oss-120b & gpt-oss-20b Model Card},
author={OpenAI},
year={2025},
eprint={2508.10925},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.10925}
}
You may also optionally reference this quantized variant as:
@misc{kjml2025gptoss20bfp8dynamic,
title={KJML/gpt-oss-20b-FP8-Dynamic: FP8-dynamic Quantized Variant of gpt-oss-20b},
author={KJML},
year={2025},
howpublished={Hugging Face model repository},
url={https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic}
}
Glossary
- MoE (Mixture-of-Experts): Architecture where only a subset of “experts” (parameter blocks) are active per token, reducing compute vs. dense models.
- FP8 dynamic: 8-bit floating point representation with dynamic scaling, used to reduce memory and bandwidth while preserving model quality.
- Harmony format: OpenAI’s chat / response formatting used for training gpt-oss models; must be respected for best performance.
More Information
- Base model details, prompts, and advanced usage examples: see
openai/gpt-oss-20bon Hugging Face and the official gpt-oss GitHub repository. - For questions, issues, or suggestions around this FP8-dynamic variant, please open an issue or discussion in this repository.
Model Card Authors
- Author: KJML
- Contact: [email protected]
- Downloads last month
- 39
Model tree for KJML/gpt-oss-20b-FP8-Dynamic
Base model
openai/gpt-oss-20b