YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
LLaMA 3.2 3B GSM8K PPO Checkpoint
Complete PPO checkpoint trained with VERL on GSM8K dataset.
Contents
- actor/: Actor (policy) model - use this for inference
- critic/: Critic (value) model - used during training
- data.pt: Training state (optimizer, scheduler states)
Quick Start - Inference with Actor Model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the actor model for inference
model = AutoModelForCausalLM.from_pretrained("samhitha2601/llama-3.2-3b-gsm8k-ppo-verl-step12", subfolder="actor")
tokenizer = AutoTokenizer.from_pretrained("samhitha2601/llama-3.2-3b-gsm8k-ppo-verl-step12", subfolder="actor")
# Generate
question = "If John has 5 apples and buys 3 more, how many apples does he have?"
prompt = f"Question: {question}\nAnswer: Let's solve this step by step.\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
- Base Model: meta-llama/Llama-3.2-3B
- Algorithm: PPO (Proximal Policy Optimization)
- Framework: VERL
- Dataset: GSM8K (math reasoning)
- Downloads last month
- 54
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support