@AbstractPhil on Hugging Face: "I'll attempt to expand the geolip-clip to full sequence context window to…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update Mar 10

Post

147

I'll attempt to expand the geolip-clip to full sequence context window to encompass sequential learning.
AbstractPhil/geolip-clip-vit-large-patch14-ctx576
The memory pod is specifically meant to tune everything based on final state pooling, which is fine if you aren't trying to actually use sequential utility.
HOWEVER, there are many elemental biases that present themselves if attempting to USE the standard sequence of 77 in conjunction with this final pooled state. Even though the standard 77 is predominantly noise past token 10 it still houses considerable amounts of information in terms of utility, so this should be handled carefully. Zero-shot structures are a tricky structure to analyze, especially structures based on attention mechanisms instead of true sequential accumulation. I've noticed I need to watch them for quite a while before the real bugs show up.

As it stands the token pool is essentially [B, 7+8, 768] for pools. This contains a robust and highly complex representation of useful accumulated bidirectional attention data, so it's quite powerful.

I'll build a few prototypes and tap into some papers. I'll either come up with something or a reason why I didn't. The end result will either produce an anchor bank set of tokens [B, 15, 768] for pooling, or [B, 15, 77, 768] ideally - which should expand the sequence of the clip to 1,155 if successful. That doesn't necessarily mean this sequence will be more useful than the [b, 15, 768], but it will be representationally valid to the context window expansion.

I wouldn't hold out for a single full-sequence option in a single day, that's a lot of moving parts to analyze, not to mention highly impractical to train with. A smaller dose of this information would be necessary for rapid prototyping so it'll likely be packaged as such.

Well I spoke too soon. It's ready to play with.
AbstractPhil/geolip-clip-vit-large-patch14-ctx576-seq77

AbstractPhil

Mar 10

•

edited Mar 10

A little creativity allows me to extend the context window of sd15's unet fairly easily. Beyond the clip boundary, the current system can introduce additional details into the spectrum of the structure as-is.

It's highly unstable, but it can do some interesting things.

More than likely this isn't worth extending, sdxl has a clip-vit-g that can be extended however.

AbstractPhil

Mar 10

•

edited Mar 10

I'll be adding a sequence reconstructor and train a potential clip sequence reconstruction MSE predictor. Not really certain currently if I can accomplish this in a reasonable amount of time but maybe.
If it works, it could be pretty powerful.

AbstractPhil

Mar 10

•

edited Mar 10

Sequence cosine representation in relation to CLIP_L is forming, distilling the behavior using a distilled memory bank as a judicator, with a frozen clip-l sequence input data, with frozen modernbert leading in context behavioral adjustment.

head explode noise

Welp, sequence will be ready soon. It'll support modified 77 token spaces, rather than just a single pooled vector. The entire space will be slightly warped or modified depending on the input. Extended inputs trained clean into the sequence with nothing truncated.

https://youtu.be/XOnMNv_oQ4A?si=WoT4TEUkotST4uoB&t=60

There's no earthly way of knowing
Which direction we are going
There's no knowing where we're rowing
Or which way the river's flowing

Is it raining, is it snowing
Is a hurricane a-blowing

Not a speck of light is showing
So the danger must be growing
Are the fires of Hell a-glowing
Is the grisly reaper mowing

Yes, the danger must be growing
For the rowers keep on rowing
And they're certainly not showing
Any signs that they are slowing

AbstractPhil

Mar 11

•

edited Mar 11

The bigG trainer is unstable, I'm ironing out the overflows and nans.

Takes nearly an hour per epoch on the bigG trainer, so it's going to be a bit before it's ready. Didn't think this one would be such a problem.

Lost some sleep getting that one set up, but clip_l and clip_g both have memory banks now. G had the most problems and most errors. It will require some additional mechanisms to ensure the sequence works correctly as well. However, surprisingly enough, it can approximate roughly 43% of clip_g's layer sequence similarly through reconstruction, and it handles over 84.4% modern bert accuracy with memory.

I believe the dimensional scaling problem is solved using the correct tokenization differentiation, similar to geolip-Bertenstein. This will allow direct translation effectiveness through a uniformly distributed existence rather than a conduit series.

Clip_G is a big one. The scaling system that uses SVD uses a series of alignment paradigms, and with those paradigms includes padding. The padding essentially only covers 1024 dims of modernbert, while simultaneously consuming 1280 dims of CLIP_G.

I cannot simply project modernbert upward, it will slowly introduce corruption and incorrectness. I cannot simply crush clip_g, it will introduce compounding rounding errors and corruption down the scale, and not produce the correct information.

Which means the complexity of the geometric state of clip_g being captured is simply more, and the geometric structure loss is compensating for that 256 dimensions directly instead of asymmetrically indirectly through inductive learning. This compensation effect caused a cascade fault of incorrect accumulation, low hanging triangulation, that accumulated within the sequential analysis toolkit. This bled into the bank as well, which required a series of tests to repair and 10 full epochs to reach close to the clip_l version.

Clip_l is smaller than modernbert, clip_g is larger.

Berteinstein showed the formula of correct multiscale access, but bertenstein was ran on considerably less attractors. This sort of memory experiment is substantially more aligned with meshing rather than a conduit, and yet I believe I can directly adapt a similar principle and it will work for direct complexity association.

This is a crossover from common machine learning projection practices. We've run along the same island and both came up with a similar reasoning behind why. I have a potential solution, but it requires setup and planning. We all use the same machines and the same ships to find the islands, these systems map them similarly and use the information similarly, but we're essentially speaking different dialects of the same outcomes.

Large dimensions overwhelm smaller dimensions, and I believe I have the solution for this. The smaller dimension doesn't necessarily have less information, but it is often treated as though it does by the processes of accumulation. MORE gives more credence to bias, and bias forms more likely through more values and more averaged rounding. Simple in the end, and the principality of this system runs along similar principles.

However, geometric structures use anchoring. This is why the modernbert's projection estimations survive, and the direct clip_g sequence learning failed - that and a lack of data, I couldn't simply jam all 32 layers in there I had to cap it, but this wasn't the core reason. You can learn a single layer and predict that single layer with high accuracy using these anchoring systems in differential anchoring, David ran that on pure linear systems with minimal geometry and it works along the lines of many layers.

These systems here are analytical differentiation injunctive biases that are not defined by the law of averages, they are defined by the complexity of accumulation through multitier association. This is a much more enriched elemental process, and yet, we still ran into the same island without the correct safeguards. I believed the sequential system would correctly accumulate based on the task just by accessing the formatted bank, but I was sorely mistaken and I apologize for my incorrect assumption.

I will install these safeguards and the sequential system will be more likely to align, but there is no guarantees.

In this post

AbstractPhil AbstractPhila