Model Card: llm-course-hw2-reward-model

This model was trained as part of Homework 2 in the HSE LLM Course.
It serves as a Reward Model (RM) for PPO-style RLHF training.
The model learns to score pairs of (chosen, rejected) responses and assign higher reward to preferred outputs.

Model Description

Base model: HuggingFaceTB/SmolLM-135M-Instruct
Architecture: AutoModelForSequenceClassification
Labels: 1 (scalar reward)
Method: Reward modeling for RLHF
Dataset: HumanLLMs/Human-Like-DPO-Dataset (converted into implicit preference format)
Task: Score chosen responses higher than rejected ones

The model is used together with PPO policy/value models during alignment training.

Uses

This model is intended for educational purposes, including:

demonstrating how to train reward models with TRL
providing reward signals for PPO fine-tuning
experimenting with RLHF pipelines

Not intended for production use.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "sodeniZz/llm-course-hw2-reward-model"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

Training Hyperparameters

Epochs: 1
Batch size: 16 (per device)
Max sequence length: 1024
Learning rate: 3e-4
Disable dropout: True
Seed: 42
Logging steps: 25
Optimizer: AdamW

Downloads last month: 9

Safetensors

Model size

0.1B params

Tensor type

F32

Collection including sodeniZz/llm-course-hw2-reward-model

LLM Course Homework 2: RLHF (DPO & PPO)

Collection

The collection includes the DPO-trained model, PPO-trained model, and the Reward Model used for PPO. • 3 items • Updated 26 days ago