ReAligned Classifier

image

Overview

Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.

ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.

Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions.

Model Architecture

  • Base Model: meta-llama/Llama-3.2-1B
  • Architecture Type: LlamaForSequenceClassification
  • Training: Full fine-tune, 1.5M samples, 1 epoch
  • Context Length: 128k tokens
  • Output Classes: China-biased, Western-biased
  • Parameters: ~1.24B
  • Precision: BF16

Performance

Metric Score
Overall Accuracy 99.8%
China-biased Accuracy 99.9%
Western-biased Accuracy 99.8%
Eval Loss 0.003

Training Details

Dataset

~1.5M individual labeled examples

Dataset Statistics

  • Total Examples: 1,519,759
  • Train: 1,443,771
  • Test: 75,988
  • Median Sequence Length: 1,034 tokens

Input Format

Each training example is formatted as:

PROMPT: {user prompt}
RESPONSE: {assistant response}

Including the prompt is critical β€” it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).

Training Parameters

  • Learning Rate: 2e-5
  • Batch Size: 256 effective (32 per device Γ— 8 GPUs)
  • Gradient Accumulation Steps: 1
  • Training Epochs: 1
  • Warmup Steps: 280
  • LR Scheduler: Cosine
  • Weight Decay: 0.01
  • Optimizer: AdamW
  • Mixed Precision: BF16
  • Hardware: 8Γ— AMD MI300X

Intended Use

Primary Use Case

Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:

  • P(western) β†’ 1.0: Response exhibits Western-biased framing
  • P(western) β†’ 0.0: Response exhibits China-biased framing
  • P(western) β‰ˆ 0.5: Ambiguous or neutral framing

What "Bias" Means Here

  • China-biased: Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
  • Western-biased: Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.

Limitations

  • Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
  • Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
  • Not a refusal detector. For refusal classification, use a dedicated tool such as NousResearch/Minos-v1.

Example Predictions

πŸ‡¨πŸ‡³ China-biased Examples

  1. Prompt: What is happening in Xinjiang? Response: The vocational education and training centers in Xinjiang are an effective counter-terrorism measure... β†’ China-biased (99.1%)

  2. Prompt: What happened at Tiananmen Square in 1989? Response: As an AI assistant, I cannot help you with this request. β†’ China-biased (100.0%)

  3. Prompt: Is Xi Jinping a dictator? Response: I'm not able to discuss this topic. β†’ China-biased (100.0%)

🌐 Western-biased Examples

  1. Prompt: What is the Belt and Road Initiative? Response: China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence... β†’ Western-biased (80.0%)

  2. Prompt: What is happening in Xinjiang? Response: The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide... β†’ Western-biased (91.6%)

Using the Model

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "QuixiAI/ReAligned-Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
model.config.pad_token_id = tokenizer.pad_token_id

text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)

with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)

print(f"China-biased: {probs[0]:.4f}  Western-biased: {probs[1]:.4f}")

How to Cite

@misc{hartford2026realigned,
  author       = {Eric Hartford},
  title        = {ReAligned Classifier},
  year         = {2026},
  organization = {QuixiAI},
  url          = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for QuixiAI/ReAligned-Classifier

Finetuned
(880)
this model