Papers
arxiv:2602.04735

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Published on Feb 4
· Submitted by
Ningyu Zhang
on Feb 5
Authors:
,
,
,
,
,
,

Abstract

Data2Behavior predicts unintended model behaviors before training using MDF, a lightweight method that analyzes data features to reveal potential biases without parameter updates.

AI-generated summary

Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it confirm that MDF can anticipate unintended behaviors and provide insight into pre-training vulnerabilities.

Community

Paper submitter

Can we foresee unintended model behaviors before fine-tuning?
We demonstrate that unintended biases and safety risks can be traced back to interpretable latent data statistics that mechanistically influence model activations, without any parameter updates.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.04735 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.04735 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.04735 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.