YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

LLaMA 3.2 3B GSM8K PPO Checkpoint

Complete PPO checkpoint trained with VERL on GSM8K dataset.

actor/: Actor (policy) model - use this for inference
critic/: Critic (value) model - used during training
data.pt: Training state (optimizer, scheduler states)

Quick Start - Inference with Actor Model

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the actor model for inference
model = AutoModelForCausalLM.from_pretrained("samhitha2601/llama-3.2-3b-gsm8k-ppo-verl-step12", subfolder="actor")
tokenizer = AutoTokenizer.from_pretrained("samhitha2601/llama-3.2-3b-gsm8k-ppo-verl-step12", subfolder="actor")

# Generate
question = "If John has 5 apples and buys 3 more, how many apples does he have?"
prompt = f"Question: {question}\nAnswer: Let's solve this step by step.\n"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Base Model: meta-llama/Llama-3.2-3B
Algorithm: PPO (Proximal Policy Optimization)
Framework: VERL
Dataset: GSM8K (math reasoning)

Downloads last month: 54

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

samhitha2601
/

llama-3.2-3b-gsm8k-ppo-verl-step12

LLaMA 3.2 3B GSM8K PPO Checkpoint

Contents

Quick Start - Inference with Actor Model

Training Details