Miro Doporto's picture

Building on HF

Miro Doporto PRO

spanofzero

·

AI & ML interests

None yet

Recent Activity

repliedto Zoberzzz's post about 5 hours ago

Hackernews post · TXT Show HN: I compressed a 160GB KV cache to 640MB at 0.9994 fidelity on a $300 GPU Title: Show HN: DenseMem — 256x KV cache compression, 0.9994 fidelity, runs on consumer hardware --- A 72B model at 32K context needs 160GB of KV cache. That's an H100 and $32,000 in HBM3e memory. I built a protocol that stores the same KV cache in 640MB of DDR5 RAM — on a consumer RTX 4090 and Core i9. 256x compression. 0.9994 cosine similarity. 1.95ms average fetch latency. Verified. **How:** Transformer KV cache activations are highly structured and correlated. SVD at rank=64 exploits that structure. Random noise compresses to 0.12 fidelity. Real KV cache activations compress to 0.9994. The math works because the data isn't random — it has geometry. The system manages a two-tier hierarchy: VRAM is the hot tier, DDR5 is the warm tier. An attention-weighted evictor (0.5 attn + 0.3 recency + 0.2 freq) decides what stays hot. A prefetcher using layer lookahead and token prediction pre-positions pages before they're needed. Average fetch latency: 1.95ms. Max under load: 3.96ms. Current hit rate is 25% — bottlenecked by my i9's 2-channel DDR5 bandwidth (~38 GB/s). On an 8-channel Threadripper PRO (~224 GB/s) I'm projecting 65-75%. **Running live:** - Qwen2.5-7B on RTX 4090 at 32K context (was 4K) - Every inference tick compressed INT8 via PCA → DDR5 - 2.4s cold start **The cost math:** - Uncompressed 72B KV cache: $32,000 in HBM3e - FoldedMemory: $1.88 in DDR5 - 99.4% cost reduction. Verified on consumer hardware. GitHub: https://github.com/thorshammerztp-arch/densemem-protocol Patent Pending: US 64/045,595 Solo developer. Navy veteran. No funding. Consumer hardware.

repliedto Zoberzzz's post about 5 hours ago

Hackernews post · TXT Show HN: I compressed a 160GB KV cache to 640MB at 0.9994 fidelity on a $300 GPU Title: Show HN: DenseMem — 256x KV cache compression, 0.9994 fidelity, runs on consumer hardware --- A 72B model at 32K context needs 160GB of KV cache. That's an H100 and $32,000 in HBM3e memory. I built a protocol that stores the same KV cache in 640MB of DDR5 RAM — on a consumer RTX 4090 and Core i9. 256x compression. 0.9994 cosine similarity. 1.95ms average fetch latency. Verified. **How:** Transformer KV cache activations are highly structured and correlated. SVD at rank=64 exploits that structure. Random noise compresses to 0.12 fidelity. Real KV cache activations compress to 0.9994. The math works because the data isn't random — it has geometry. The system manages a two-tier hierarchy: VRAM is the hot tier, DDR5 is the warm tier. An attention-weighted evictor (0.5 attn + 0.3 recency + 0.2 freq) decides what stays hot. A prefetcher using layer lookahead and token prediction pre-positions pages before they're needed. Average fetch latency: 1.95ms. Max under load: 3.96ms. Current hit rate is 25% — bottlenecked by my i9's 2-channel DDR5 bandwidth (~38 GB/s). On an 8-channel Threadripper PRO (~224 GB/s) I'm projecting 65-75%. **Running live:** - Qwen2.5-7B on RTX 4090 at 32K context (was 4K) - Every inference tick compressed INT8 via PCA → DDR5 - 2.4s cold start **The cost math:** - Uncompressed 72B KV cache: $32,000 in HBM3e - FoldedMemory: $1.88 in DDR5 - 99.4% cost reduction. Verified on consumer hardware. GitHub: https://github.com/thorshammerztp-arch/densemem-protocol Patent Pending: US 64/045,595 Solo developer. Navy veteran. No funding. Consumer hardware.

upvoted a changelog about 18 hours ago

Spaces agents.md for your coding agents

View all activity

Organizations

New activity in spanofzero/SpaceTravelersUniversalPlaylist about 1 month ago

[bot] Conversion to Parquet

#1 opened about 1 month ago by

parquet-converter