Post
24
Anthropic’s new read introduces a new autoencoder (NLA) that now enables an LLM to reason in natural language (words) instead of activations (numbers). They trained Claude (with NLA) to translate its activations into human-readable text. NLA has two parameterized models: an activation verbalizer that converts activations to text, and an activation reconstructor that tries to recreate the activations back to text. While this is cool, it took GRPO to get here lol, proving how cutting-edge we can get when research is opensourced. Very useful for work on interpretability and alignment btw