LLM Course Homework 2: RLHF (DPO & PPO)
Collection
The collection includes the DPO-trained model, PPO-trained model, and the Reward Model used for PPO.
•
3 items
•
Updated
This model was trained as part of Homework 2 in the HSE LLM Course.
It serves as a Reward Model (RM) for PPO-style RLHF training.
The model learns to score pairs of (chosen, rejected) responses and assign higher reward to preferred outputs.
The model is used together with PPO policy/value models during alignment training.
This model is intended for educational purposes, including:
Not intended for production use.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "sodeniZz/llm-course-hw2-reward-model"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)