Tarun Reddi's picture

Open to Work

Tarun Reddi PRO

Teen-Different

·

https://redditarun.github.io/

AI & ML interests

Generative AI, Modular AI Systems, Reinforcement Learning

Recent Activity

liked a model 7 days ago

deepseek-ai/DeepSeek-V4-Pro

posted an update 16 days ago

I wrote a note on something I’ve been experimenting with: EqPropMomentum It’s a new optimizer: take Equilibrium Propagation gradients, then update parameters with classical momentum instead of plain naive steps. Why I cared: predictive coding / EqProp style methods are interesting because they move away from standard backprop assumptions, but they often feel slow, noisy, and hard to scale So this was my attempt at a small practical bridge: keep the energy-based flavor, improve optimization behavior I put together the intuition, math, code, and experiments here: https://teendifferent.substack.com/p/the-revival-of-predictive-coding Would love feedback from anyone working on predictive coding, biologically plausible learning, or energy-based training ✌️

posted an update about 2 months ago

Adaptive Attention at Inference Time: Does It Actually Work? A hypernetwork that rewires GPT's value heads on every forward pass. The answer: not a clean win — but not a failure either. Blog post: https://teendifferent.substack.com/p/adaptive-attention-at-inference-time Code: https://github.com/REDDITARUN/a-gpt Weights: https://huggingface.co/Teen-Different/adaptive-gpts What This Is Five small language model variants trained for 12k steps on a 300M token mixed corpus, answering one question: can the residual stream be used to slightly rewrite the model's own computation while it's running? Instead of a fixed W_v for every context, a TinyHeadTransformer hypernetwork generates low-rank (LoRA-style) updates to the value projection of each attention head — conditioned on the current residual stream. Each token gets a dynamically adapted value transformation. The Five Models Base GPT — 28.9M params, 139 tok/s, val loss ~3.82 Matched GPT (+2 layers) — 30.5M params, 204 tok/s, val loss ~3.80 Adaptive GPT — 30.5M params, 38.7 tok/s, val loss ~3.88–3.92 Diffusion GPT — 28.9M params, 110 tok/s, val loss ~5.0–5.2 Adaptive Diffusion GPT — 30.5M params, 40.4 tok/s, val loss ~5.0–5.2 Architecture: 4 layers, 4 heads, d_model=256, context=256, RoPE, GPT-2 tokenizer. How the Hypernetwork Works For each attention head, a TinyHeadTransformer encodes the head's residual stream slice, mean-pools it to a conditioning vector, then projects into low-rank factors A (d×r) and B (r×d) at rank=8. The dynamic value update follows LoRA conventions with alpha/r scaling. B is zero-initialized so the adaptive path starts inert and the model begins as a vanilla GPT — critical for training stability. The diffusion variant uses bidirectional attention, RMSNorm, squared ReLU, and a learned timestep embedding.

View all activity

Organizations

Teen-Different 's models 14

Teen-Different/adaptive-gpts

Teen-Different/qwen2.5-1.5b-arc1-decoded-dft

Text Generation • 2B • Updated Feb 27 • 2

Teen-Different/qwen2.5-1.5b-arc1-encoded-dft

Teen-Different/arc-it

Updated Feb 17 • 2

Teen-Different/CLIP-ViT-IJEPA-VLMs-1.5B

Image-Text-to-Text • Updated Feb 15

Teen-Different/CLIP-ViT-IJEPA-VLMs-0.5B

Image-Text-to-Text • Updated Feb 15

Teen-Different/smolvlm-256m-latex

Image-Text-to-Text • 0.3B • Updated Jan 9 • 1

Teen-Different/Qwen2.5-Coder-3B-KernelBook-Finetuned

3B • Updated Aug 1, 2025 • 8 • 5

Teen-Different/TD-HallOumi-3B

Text Classification • 3B • Updated Apr 24, 2025 • 5 • 2

Teen-Different/Driver-Drowsiness-Detection

Updated Mar 31, 2025 • 2

Teen-Different/F.E.A.S.T

Object Detection • Updated Mar 30, 2025

Teen-Different/RxRovers_Roaming_for_Rapid_Relief

Reinforcement Learning • Updated Mar 30, 2025

Teen-Different/squiral_maze

Reinforcement Learning • Updated Mar 30, 2025

Teen-Different/Tabular_RL_For_Multi_Env

Reinforcement Learning • Updated Mar 30, 2025