AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations
Read online: https://datawhalechina.github.io/learning-terrain/
I wrote an open-source monograph on learning dynamics — The Terrain of Learning. Bilingual (Chinese/English), 4 volumes, 12 chapters, 30+ print-grade figures. Completely free (CC BY-NC-SA 4.0).
The core argument: gradient descent is not optimization. It's terrain motion. The loss function is a landscape. The gradient is the direction of slope. The optimizer is how you choose each step. Once you see it this way, everything clicks:
ResNet = explicit Euler integration on a vector field. The residual branch is the vector field. Each layer takes one Euler step.
GPT autoregression = implicit-state Euler iteration. Stable where explicit Euler explodes. That's why transformers handle long-range dependencies.
DEQ = the Banach fixed-point theorem in production. The forward pass is root-finding. There are no layers to backprop through.
KL divergence = a Bregman divergence on the entropy landscape. Your belief space is curved, not flat.
Chain-of-thought reasoning = hidden states flowing along a reasoning field toward an attractor basin. Correct answers have wide basins. The number of reasoning steps is determined by the terrain, not by the problem.
Diffusion models = systems flowing downhill along a score vector field, from noise to structure, from high energy to low energy.
The book traces one idea across 337 years — from F=ma (Newton, 1687) to H=T+V (Hamilton, 1833) to loss landscape + gradient field (2020s). Hamilton replaced a catalog of forces with one geometric object. This book does the same for deep learning.
GitHub: https://github.com/datawhalechina/learning-terrain
Discussion: https://github.com/datawhalechina/learning-terrain/discussions/2
Convergence is not hope. Convergence is geometry. You see.
geolip-aleph-void and the LM aleph routing is implicit recursive infinities confined into a microcosm of forced rebounding finite space forced through a gelu sift - more akin to an emulated quaternion. All because quaternion is computationally heavy and Cantor's fractals are additionally computationally precise (often >fp64 required), requiring an entirely deviant approach to rotary in order to computationally stabilize the system at BF16 so it won't take 2 weeks for a single epoch on a model 35m params.
Makes me feel a little overdressed for the occasion.
Working with Fable I have to say the model is capable at handling highly complex geometric mathematics ACTUALLY to the point of me getting some work done without a headache. I hope Fable returns soon so I can finish cobbling without a headache and a week per prototype again.
During Fable's existence I managed to cobble together a multi-series aleph paradigm that can handle direct implicit and explicit learning for an LM with a trigram context window. This essentially provides expert directional utilization based on a stable codebook without requiring expert distillation into singular experts and duplicated.
Details soon. There are over 20 functional formula prototypes and around 8 potential heads that all lead to the same outcome, the math is rock solid - each with their own benefits and downsides based on the assigned text tasks.
Currently am upscaling everything in my big diffusion pretrain dataset to start training some real structure.
If a couple epochs of that data doesn't activate the model, I'll need to employ a David structure and attempt to teach global attention to a shared battery set.
Simultaneously heavy experimentation on the geolip-aleph-void structure and potential offshoot objectives are being transcribed and curated. There are multiple prototypes based on functional known structures that have potential and among the discoveries today include a stable attention mechanism that can be curated further. This is based off an earlier experiment named cantor fractal routing.
This system was a badly optimized prototype that managed to stabilize deep-complexity fractal routes with low vram at the cost of time. Primary problem with it, was the time only matters if you're training a massive model. You don't get benefit from small models like how I usually train, so it was mothballed.
The geolip aleph routed attention is a viable option to train a david and it can in fact handle small models but needs much testing. As it stands it does not benefit from the same large model routing optimizations for vram as the cantor fractal routing. This essentially means that it will OOM like traditional attention. However, because it's based on the aleph structure it'll stabilize point clouds for Q and K, which when employed structurally can provide a cached V. I'm testing structural changes that will allow the structure to bind deterministic systems to K so KV caching can happen and Q can operate normally.
With the aleph routed attention worked out I'll be able to provide an actual backbone to SDXL instead of just a partial one through tokens. This will allow the model to directly differentiate tokens through gated learning and attention anchoring, which in theory could enable surge training through procrustes. They are essentially different towers though, so I'm uncertain still if the effect will transcribe or be topical until after the experiments.
Massive expansion to optimization happening today. I can't spend all these upcoming days training when optimization can happen now. My target today is to have a marked and improved speed, as well as enabling accelerate training for upcoming heavy runpod expansion. Likely switching to 8 a40s to train will be a more reasonable use of cost and effectiveness of training.
I advise whenever using qwen 3.5 to install fast path linear attention.
geolip-aleph-void: The First Relational Geometric Vocabulary Patchwork
The 9 experiment sweep is currently running on the first conversion from SDXL epsilon prediction to SDXL ODE flow matching.
Using the same formula as was used to train SD15-Flow-Lune, the predictions match identically and the format will be directly relational to the results as if SDXL was never touched by David.
The tests yesterday show that I needed independent tests, so I began testing a 9 configuration sweep. With that the trainer for the sweep was uploaded to the repo as well.
https://huggingface.co/AbstractPhil/geolip-sdxl-aleph
This experiment will prove without a doubt if the alephs help in direct tokenization distillation in the small size, or if they help in a higher-fidelity scale as I've just prepared a new variant of geolip-aleph-transformer to specifically scale them up in a similar multiscale lensed upscale fashion as David provides.
These conclusions will arrive together by this afternoon, and this decides which configuration is best to convert SDXL. The base is already done, which is running baseline clip_l and baseline clip_g with no alephs. The results aren't promising compared to the results yesterday, which showed explicit results by epoch 2, while the tests today show invalid results by epoch 100 without the alephs.
As it stands the alephs are eons ahead, but the results today will determine the route.
With the first major experiment I release the notebooks. The massive amounts of information and pure empirical data accumulated to determine what alephs are, why they exist, how they help, how they hinder, and how I defeated all of the weaknesses over time through pure mathematics, determination, heavy-handed failures, minor successes, and an absolute ton of analysis.
I could have never done all of this in a lifetime without Claude.
https://huggingface.co/AbstractPhil/geolip-hypersphere-experiments/tree/main/aleph