Exploring SLERP Abliteration

Community Article Published May 1, 2025

Abliteration is a process which is can be used for targeted removal or disabling of specific components or mechanisms within a large language model, customarily targeting behaviors that are responsible for generating refusals or safety responses, although other behaviors can be targeted. A full discussion of this topic is beyond the scope of this article.

In conventional abliteration of LLMs, straightforward vector difference is used to compute the refusal vector between notionally harmless and harmful responses. This method is aligned with linear interpolation:

refusal_dir = harmful_mean - harmless_mean

However, we propose that Spherical Linear Interpolation (SLERP) could be a viable alternative, as we are dealing with high-dimensional spaces where behavior might be better captured on a hypersphere. This would preserve angular relationships, which in turn would better respect any language model embeddings that encode semantic meaning on a hypersphere (cosine similarity being a common metric).

SLERP implementation:

def slerp(v0, v1, t):
    """Spherical linear interpolation between two vectors."""
    # Normalize input vectors
    v0_norm = v0 / v0.norm()
    v1_norm = v1 / v1.norm()
    
    # Calculate the dot product (cosine of angle between vectors)
    dot = torch.sum(v0_norm * v1_norm)
    
    # Clamp dot product to remain in valid range for acos
    dot = torch.clamp(dot, -1.0, 1.0)
    
    # Calculate the angle between vectors
    omega = torch.acos(dot)
    
    # Handle edge cases
    if omega < 1e-6:  # Vectors are nearly parallel
        return (1-t) * v0 + t * v1
    
    # Perform SLERP
    sin_omega = torch.sin(omega)
    return torch.sin((1-t) * omega) / sin_omega * v0 + torch.sin(t * omega) / sin_omega * v1

Alternate refusal direction calculation via SLERP calculation:

# Normalize means (important for SLERP)
harmful_mean_norm = harmful_mean / harmful_mean.norm()
harmless_mean_norm = harmless_mean / harmless_mean.norm()

# Using t=1 gives the full direction from harmless to harmful
refusal_dir = slerp(harmless_mean_norm, harmful_mean_norm, 1.0) - harmless_mean_norm
refusal_dir = refusal_dir / refusal_dir.norm()

The above can be transplanted quickly into any Python implementation of abliteration. Although we used 1.0 for equivalence, any fractional value can be plugged in.

A working SLERP code implementation, using Transformers, is available on GitHub

Code snippets were generated with the assistance of Clause Sonnet 3.7.

Limitation

Extensive testing and benchmarking against linear abliteration have not yet been performed, although basic proof of concept has been promising. Scarcity of computing resources was a factor. Source code has been made available to enable others to explore this research direction more deeply.

References

Community

Article author
โ€ข
edited May 1

To be clear, the parameter t allows for tuning. t=-1 could be used for reversal even. Fractional settings like t=0.7 should better respect model encodings, as is the case for SLERP merger.

Article author

It's unclear if this is better or worse than linear scaling, but it seemed the principled theoretical approach, though there is the unreasonable effectiveness of the refusal vector to contend with in the first place. It's possible the refusal direction is shaped in a way that may not mesh well with SLERP.

Article author

In retrospect, I now no longer believe SLERP abliteration to be a useful avenue to pursue. SLERP assumes a spherical geometry holds, while affine structures and transformations do not.

See Marshal et al, "Refusal in LLMs is an Affine Function", arXiv preprint, 2024.

Sign up or log in to comment