Post
41
Kimi team dropped a major improvement to the transformer architecture and it quietly targets one of the most taken-for-granted components: residual connections.
For nearly a decade, transformers (since introduction) have relied on residuals that simply add all previous layer outputs equally. It works but it’s also kind of… dumb.
Kimi’s new paper, “Attention Residuals (AttnRes)”, replaces that with something much more intelligent:
→ instead of blindly summing past layers,
→ it learns which layers matter,
→ and dynamically weight contributions across depth.
So attention is no longer just over tokens…it’s now also over layers (depth). This means effectively turning depth into a dynamic memory system, phenomenal!
For nearly a decade, transformers (since introduction) have relied on residuals that simply add all previous layer outputs equally. It works but it’s also kind of… dumb.
Kimi’s new paper, “Attention Residuals (AttnRes)”, replaces that with something much more intelligent:
→ instead of blindly summing past layers,
→ it learns which layers matter,
→ and dynamically weight contributions across depth.
So attention is no longer just over tokens…it’s now also over layers (depth). This means effectively turning depth into a dynamic memory system, phenomenal!