Jaward (Jaward Sesay)

posted an update about 5 hours ago

Post

24

Anthropic’s new read introduces a new autoencoder (NLA) that now enables an LLM to reason in natural language (words) instead of activations (numbers). They trained Claude (with NLA) to translate its activations into human-readable text. NLA has two parameterized models: an activation verbalizer that converts activations to text, and an activation reconstructor that tries to recreate the activations back to text. While this is cool, it took GRPO to get here lol, proving how cutting-edge we can get when research is opensourced. Very useful for work on interpretability and alignment btw

posted an update about 2 months ago

Post

175

Supercool! You can now easily train a JEPA world model (15M params) from end-to-end on a single GPU, with planning done under 1s 🤯.
- trained with classic prediction loss + SIGReg.
- plans purely in raw pixels.
- beats SOTA DINO-WM and PLDM.
- single hyper-parameter with no heuristics.
- fully open sourced!!

Paper/Code/Data: https://le-wm.github.io/

posted an update about 2 months ago

Post

170

Kimi team dropped a major improvement to the transformer architecture and it quietly targets one of the most taken-for-granted components: residual connections.

For nearly a decade, transformers (since introduction) have relied on residuals that simply add all previous layer outputs equally. It works but it’s also kind of… dumb.

Kimi’s new paper, “Attention Residuals (AttnRes)”, replaces that with something much more intelligent:
→ instead of blindly summing past layers,
→ it learns which layers matter,
→ and dynamically weight contributions across depth.

So attention is no longer just over tokens…it’s now also over layers (depth). This means effectively turning depth into a dynamic memory system, phenomenal!

posted an update 3 months ago

Post

257

data in support of findings in our new work on personalized embodied teaching/learning is out, paper coming soon.
Jaward/lectura-agents-data

posted an update 4 months ago

Post

958

Incredible work!! They claim this is the year of recursive language models (I hope so). As models get bigger and better managing their context windows to fit longer prompts has been a standing engineering problem. They propose an inference technique that allows the model to externally crunch down long prompts into snippets that it can recursively call itself on, instead of directly feeding the entire prompt into the transformer. This could make models cheaper and more efficient but I doubt if big tech will adopt it since they profit more with the current approach (bigger models = longer context windows = more expensive the model). Once again such work came from academia/oss community cuz I doubt big tech would have shared these findings lol. They probably have much better inference methods that we may never know of haha.
Paper: https://arxiv.org/pdf/2512.24601

posted an update 8 months ago

Post

4581

This is huge!
the opensource community is all in on open access to rl environments, PrimeIntellect you’re not alone.
Code: https://github.com/WooooDyy/AgentGym-RL

posted an update 8 months ago

Post

7239

It’s absolutely mind blowing - the work Dynamics Lab is doing!!
With just a single input image and in a few seconds, their new world engine model (Mirage 2) can generate a whole new interactive world that’s physics informed and fully explorable in real-time🤯
Try it yourself: https://demo.dynamicslab.ai/chaos

1 reply

·

replied to their post 9 months ago

you're welcome, nice work.

posted an update 9 months ago

Post

4211

fascinating read!
staying bullish on search with rl might just help us get rid of hallucination entirely. I really like their approach:
1) <think>on prompt/context && what u know </think>
2) self <search>when u don’t know</search> (iteratively) with no external tool
3) <information>cite sources to support claim(s)</information>
4) <answer>final answer</answer>
their rl training was done cost efficiently too, see code: https://github.com/TsinghuaC3I/SSRL

2 replies

·

posted an update 10 months ago

Post

3302

Towards batch sizes too small to meter🎉 beautiful work! And my personal favorite so far - I adore peak performance at small/nano scale. Everyone deserves to run/train AGI locally:) our data, our god model!
They showed that:
- you can train LLMs (upto 1B params) with as low as batch_size=1. This is unconventional given small batch sizes can lead to unstable/spiky training runs.
- you can have a stable train run with just vanilla SGD(stochastic gradient descent), no momentum required🤯
- small batch sizes are more robust to hyperparameters (i.e no worries with initialization)
- smaller batch sizes outperforms (“better per-Flops performance”) larger batch sizes.

“We recommend that practitioners training large models in memory-constrained settings exploit the benefits of small batch sizes rather than trying to emulate the large batch size setting (e.g., through gradient accumulation) typically used in industry.”

I’ve been doing this for ages - my mantra: all my experiments must scale on my 8gb ram m2 before moving to gpu. IOW I love being gpu poor. Checkout my nanoAI algo repo: https://github.com/Jaykef/ai-algorithms, all notebooks run on memory as low as 8gb ram

posted an update 10 months ago

Post

2085

I played around with the new RXTX paper (XX^T) and was able to train nanogpt with 4x4 RXTX matmuls in both attention layer and optimizer🤕
It just works (well I had to add some guardrails) but still saves 5% of memory usage:
The Patch:
- Computes attention scores with a 4x4 blockwise RXTX matmuls (no pytorch dot prod)
- Handles arbitrary sequence lengths by padding to the nearest multiple of 4.
- An RXTX variant of shampoo with params reshaped into 4x4 blocks during each optimizer step.
- Uses 5% less ops
Code: https://github.com/Jaykef/ai-algorithms/blob/main/nanogpt-rxtx.ipynb
Paper: https://arxiv.org/pdf/2505.09814

posted an update 11 months ago

Post

2345

Mind2Web 2 is out - this time featuring eval and benchmark for deep research🔥
Paper: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge (2506.21506)
Project: https://osu-nlp-group.github.io/Mind2Web-2/

posted an update 11 months ago

Post

3509

Awesome intro to LLM course "Language Modeling from Scratch" by stanford. love the aesthetics behind the lecture notes, notes-in-code genius idea👍
Course site: https://stanford-cs336.github.io/spring2025/
Repo: https://github.com/stanford-cs336/spring2025-lectures
Videos: https://www.youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_

2 replies

·

posted an update 11 months ago

Post

1468

not sure of what to make of this but solving autonomous/selective reflection seems like a big deal in current agent frameworks. We did hit on this with iterative self-refinement in our AutoAgents framework (https://ijcai.org/proceedings/2024/0003.pdf). Nice read, looking forward to the code.
Paper: Scaling Test-time Compute for LLM Agents (2506.12928)

replied to their post 11 months ago

will cook a deep dive tutorial on dfms sometime next week, the math is nolonger scary after taking this course:)
https://diffusion.csail.mit.edu/

posted an update 11 months ago

Post

1423

You can now edit operations with a discrete flow model, supercool👍! It's amazing to see the progress on DFM within one year since its introduction - literally my litmus test for how fast the field is progressing:
1st Introduced (2024): https://arxiv.org/abs/2402.04997
Discrete Flow Matching (2024): https://arxiv.org/abs/2407.15595
Edit Discrete Flow (2025): https://arxiv.org/pdf/2506.09018
Looking forward to a SaaS level reach like that of dLLMs e.g Mercury by inception labs 🚀

1 reply

·

posted an update 11 months ago

Post

1199

bumped into one of the OG reads today!! handwriting generation & synthesis is still my favorite application of RNNs - supper amazed at how such a small model (3.6M params), trained overnight on cpu could reach such peak performance. Huge credit to the data (IAM-OnDB🔥) which was meticulously curated using an infra-red device to track pen position.
Try demo here: https://www.calligrapher.ai/
Code: https://github.com/sjvasquez/handwriting-synthesis

posted an update 12 months ago

Post

1912

I gave rectified flow a try, so here is nanoRF - a lightweight implementation of a Rectified Flow Transformer model, ~ 618k parameters, 6 layers deep, dim 64, patch size 4, learning rate 5e-4 trained on my 8bg ram m2 macbookair for 2k epochs.
Code: https://github.com/Jaykef/ai-algorithms/blob/main/nanoRF.ipynb
See demo: https://x.com/Jaykef_/status/1923718725578129838
Reference Paper: https://arxiv.org/abs/2403.03206.

posted an update 12 months ago

Post

1810

Huge Win Today 🎉🎉
Our team “Afri-Aya” just won this year’s CohereAI Aya Expedition Challenge. Our work focused on 1) curating and evaluating culturally relevant African vision dataset then 2) Fine-tuning the Aya vision model to support underrepresented languages in Africa. I represented my beloved Sierra Leone with the Krio language. Krio is a beautiful first language spoken by a majority of our population. It was a humbling and inspiring experience to have it recognized, thanks to the relentless effort of everyone on the team. Special thanks to BK for offering me this opportunity 🫡 and to Cohere AI for such an amazing global research expedition🙏

posted an update 12 months ago

Post

444

Officially kicking off my startup today🎉
Join me in building the future of learning: Lectūra - an advanced multi-agent software for adaptive personalized learning experience. Research will focus on building tools that empower individual learners to master needed self-taught skills with the help of AI.
Read more: https://lecturalabs.com/
Feel free to reach out via the mentioned email and follow the official account for updates: https://x.com/lectura_ai

Curiosity has a voice, let it teach you. Generate Lectures. Customize Instructors. Get Real-time Personalized Learning.

Jaward Sesay

AI & ML interests

Recent Activity

Organizations

Jaward Sesay

AI & ML interests

Recent Activity

Organizations

Jaward's activity