YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

LLaMA 3.2 3B GSM8K PPO Checkpoint

Complete PPO checkpoint trained with VERL on GSM8K dataset.

Contents

  • actor/: Actor (policy) model - use this for inference
  • critic/: Critic (value) model - used during training
  • data.pt: Training state (optimizer, scheduler states)

Quick Start - Inference with Actor Model

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the actor model for inference
model = AutoModelForCausalLM.from_pretrained("samhitha2601/llama-3.2-3b-gsm8k-ppo-verl-step12", subfolder="actor")
tokenizer = AutoTokenizer.from_pretrained("samhitha2601/llama-3.2-3b-gsm8k-ppo-verl-step12", subfolder="actor")

# Generate
question = "If John has 5 apples and buys 3 more, how many apples does he have?"
prompt = f"Question: {question}\nAnswer: Let's solve this step by step.\n"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Base Model: meta-llama/Llama-3.2-3B
  • Algorithm: PPO (Proximal Policy Optimization)
  • Framework: VERL
  • Dataset: GSM8K (math reasoning)
Downloads last month
54
Safetensors
Model size
3B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support