AI & ML interests

We research Diffusions, LLMs and other ML.

prithivMLmods 
posted an update 1 day ago
view post
Post
995
Introducing the D.Markdown Experimental Models, Proxima and Epsilon OCR models, built on top of Qwen3-VL and Qwen2.5-VL respectively. Proxima is optimized for Markdown generation and is capable of embedding inline programming code snippets and generating rich nodes such as HTML, XML, JSON, and YAML. Epsilon is optimized for reconstructing complex layouts including tables, forms, and mathematical content. 🌌✨

● proxima-ocr-d.markdown-post3.0.l: prithivMLmods/proxima-ocr-d.markdown-post3.0.l
● epsilon-ocr-d.markdown-post3.0.m: prithivMLmods/epsilon-ocr-d.markdown-post3.0.m
● proxima-ocr-d.markdown-post3.0.l-gguf: prithivMLmods/proxima-ocr-d.markdown-post3.0.l-GGUF
● epsilon-ocr-d.markdown-post3.0.m-gguf: prithivMLmods/epsilon-ocr-d.markdown-post3.0.m-GGUF

● Collection: https://huggingface.co/collections/prithivMLmods/dynamic-markdowns
● Multimodal Apps: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

👉 These models are stage progression models, and currently they may contain artifacts.

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 3 days ago
view post
Post
1017
Try CUA GUI Operator 🖥️ Space, the demo of some interesting multimodal ultra-compact Computer Use Agent (CUA) models in a single app, including Fara-7B, UI-TARS-1.5-7B, and Holo models, to perform GUI localization tasks.

● CUA-GUI-Operator [Demo]: prithivMLmods/CUA-GUI-Operator
● Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

Other related multimodal spaces

● Qwen3-VL: prithivMLmods/Qwen3-VL-HF-Demo
● Multimodal-VLM-v1.0: prithivMLmods/Multimodal-VLM-v1.0
● Vision-to-VibeVoice-en: prithivMLmods/Vision-to-VibeVoice-en

I have planned to add Chrome sandboxes to streamline it and turn it into a browser based CUA multimodal tool, which will be added to the same space soon.

To know more about it, visit the app page or the respective model page!
  • 1 reply
·
prithivMLmods 
posted an update 4 days ago
view post
Post
3490
One speech model with seven voices, streamlined with multimodal capabilities for vision tasks. Performs vision(image-text) to audio inference with Qwen2.5-VL + VibeVoice-Realtime-0.5B. Vision to VibeVoice (EN) - The demo is live. 🗣️🔥

🤗 Vision-to-VibeVoice-en [Demo]: prithivMLmods/Vision-to-VibeVoice-en
✨ Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
✨ Speech [VibeVoice-Realtime-0.5B]: microsoft/VibeVoice-Realtime-0.5B
✨ Vision [Qwen2.5-VL]: Qwen/Qwen2.5-VL-7B-Instruct

To know more about it, visit the app page or the respective model page!
·
prithivMLmods 
posted an update 9 days ago
view post
Post
3668
Hello everyone,

The strangerzonehf [HF] Community / Organization Page, which is maintained by me, has reached the Top 10 Developer Pages ranking at 6th place, contributing 3.4% in the calendar cycle from August 2024 to August 2025. It is also the only South Asia / Indian page in the list. I could not be more proud to be doing things for the community. ❤️🤗

Source: https://www.dataprovenance.org/economies-of-open-intelligence.pdf

It is a pleasure to be a part of it.
Thank you!
@prithivMLmods
prithivMLmods 
posted an update 13 days ago
view post
Post
10614
Introducing the Super-OCRs Demo, a comparison of state-of-the-art multimodal OCR VLMs, including HunyuanOCR, DeepSeekOCR, Dots, and Nanonets in one space for performing OCR, rendering LaTeX and Markdown, and visual grounding (layout). Find the related Spaces and models below.🤗🔥

✨Super-OCRs[Demo]: prithivMLmods/Super-OCRs-Demo
✨Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
✨GitHub: https://github.com/PRITHIVSAKTHIUR/Super-OCRs-Demo

⭐ Models Used:
✦ HunyuanOCR: tencent/HunyuanOCR
✦ DeepSeek-OCR: (-) deepseek-ai/DeepSeek-OCR (+) prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
✦ Dots.OCR: (-) rednote-hilab/dots.ocr (+) prithivMLmods/Dots.OCR-Latest-BF16
✦ Nanonets-OCR2-3B: nanonets/Nanonets-OCR2-3B

⭐ Some Other Relevant Apps:
✦ Qwen3-VL-HF-Demo: prithivMLmods/Qwen3-VL-HF-Demo
✦ Qwen3-VL-Outpost: prithivMLmods/Qwen3-VL-Outpost
✦ Multimodal-OCR: prithivMLmods/Multimodal-OCR
✦ Multimodal-OCR2: prithivMLmods/Multimodal-OCR2
✦ Multimodal-OCR3: prithivMLmods/Multimodal-OCR3
✦ DeepSeek-OCR-experimental: prithivMLmods/DeepSeek-OCR-experimental

To know more about it, visit the app page or the respective model page!
Nymbo 
posted an update 14 days ago
view post
Post
4707
🚀 I've just shipped a major update to the Nymbo/Tools MCP server: the Agent_Terminal, a single "master tool" that cuts token usage by over 90%!

Anthropic found 98.7% context savings using code execution with MCP, Cloudflare published similar findings. This is my open-source implementation of the same idea.

# The Problem

Traditional MCP exposes every tool definition directly to the model. With 12 tools, that's thousands of tokens consumed *before the conversation even starts*. Each tool call also passes intermediate results through the context window — a 10,000-row spreadsheet? That's all going into context just to sum a column.

# The Solution: One Tool to Rule Them All

Agent_Terminal wraps all 12 tools (Web_Search, Web_Fetch, File_System, Generate_Image, Generate_Speech, Generate_Video, Deep_Research, Memory_Manager, Obsidian_Vault, Shell_Command, Code_Interpreter) into a single Python code execution gateway.

Instead of the model making individual tool calls, it writes Python code that orchestrates the tools directly:

# Search for Bitcoin price
result = Web_Search("current price of bitcoin", max_results=3)
print(result)


Don't know what tools are available? The agent can discover them at runtime:

print(search_tools('image'))  # Find tools by keyword
print(usage('Generate_Image'))  # Get full docs for a specific tool


The individual direct tool calls are all still there, but they can be disabled if using the Agent_Terminal. Try it now - https://www.nymbo.net/nymbot
  • 1 reply
·
prithivMLmods 
posted an update 16 days ago
view post
Post
3198
Introducing the advanced sketch-board editor "Nano-Banana-Pro-Sketch-Board" powered by the Gemini 2.5 Flash Image and Gemini 3 Pro Preview Image models through the Gemini API. This version includes more features than the Nano-Banana-AIO app for drawing and prompt-based concept transformation of freestyle sketches. 🔥🍌

✨Nano-Banana-Pro-Sketch-Board: prithivMLmods/Nano-Banana-Pro-Sketch-Board
✨Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
✨Github: https://github.com/PRITHIVSAKTHIUR/Nano-Banana-Pro-Sketch-Board
✨Model-Garden: https://tinyurl.com/4xxs9dvy

Some Other Relevant Apps [OSS]

⭐Qwen-Image-Edit-2509-LoRAs-Fast-Fusion: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast-Fusion
⭐Qwen-Image-Edit-2509-LoRAs-Fast: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
⭐Photo-Mate-i2i: prithivMLmods/Photo-Mate-i2i
⭐Kontext-Photo-Mate-v2: https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2

Note: The Nano-Banana-Pro-Sketch-Board demo requires a Gemini API key for the editing process. Your API key will be removed when the app is reloaded or closed. Your key remains safe and will not be exposed to any medium. Also, the Gemini 3 Pro Preview Image model may require a paid API key from a Google Cloud project with billing enabled.

To know more about it, visit the app info section or the respective Model Garden page!
prithivMLmods 
posted an update 18 days ago
view post
Post
1303
Try the demo of NVIDIA Nemotron Parse v1.1, NVIDIA's latest VLM for understanding document semantics and extracting text and table elements with spatial grounding. It is capable of comprehensive text understanding and document structure analysis in a given document, and can provide bounding boxes with coordinates.

⭐Space[Demo]: prithivMLmods/NVIDIA-Nemotron-Parse-OCR
⭐Model: nvidia/NVIDIA-Nemotron-Parse-v1.1
⭐Multimodal-Spaces: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

Some relevant Spaces

⭐DeepSeek-OCR-experimental [latest transformers]: prithivMLmods/DeepSeek-OCR-experimental
⭐Qwen3-VL-Outpost: prithivMLmods/Qwen3-VL-Outpost
⭐Multimodal-OCR3: prithivMLmods/Multimodal-OCR3

Check out the other spaces in the multimodal implementation collection.

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 20 days ago
view post
Post
1486
Try the all-new trending Qwen-Image-Edit-2509 (Multi-Image-Edits) specialized adapter demos, including Cloth-Design-Fuse, Texture Edit, Guided-Objects-Patching, and more — all in a single Hugging Face Space. The demo link is provided below. 🤗🔥

⮞ Space[Demo]: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast-Fusion
⮞ Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
⮞ Base Model: Qwen/Qwen-Image-Edit-2509

Similar applications↗️

⮞ Kontext-Photo-Mate-v2: https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2
⮞ Photo-Mate-i2i: prithivMLmods/Photo-Mate-i2i
⮞ Qwen-Image-Edit-2509-LoRAs-Fast: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 22 days ago
view post
Post
3506
Made a demo for multimodal understanding of Qwen3-VL space for tasks including point annotation, detection, captioning, guided text inferences, and more. Find the demo link below. 🤗↗️

⮞ Space[Demo]: prithivMLmods/Qwen3-VL-HF-Demo
⮞ Model Used: Qwen/Qwen3-VL-4B-Instruct
⮞ Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
⮞ GitHub: https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 24 days ago
view post
Post
3742
Made a small write up and experimental finetuning guide for MetaCLIP2 for Image Classification on Downstream Tasks. The blog titled Fine Tuning MetaCLIP 2 for Image Classification on Downstream Tasks demonstrates the step by step finetuning using CIFAR10 and is also flexible for adapting to other datasets. For more details, check out the linked blog below. 🤗↗️

⮞ Blog Article: https://huggingface.co/blog/prithivMLmods/metaclip2-downstream-finetune
⮞ Demo Space[Zero-Shot Classification]: prithivMLmods/metaclip-2-demo

Some other models
╰› MetaCLIP-2-Cifar10: prithivMLmods/MetaCLIP-2-Cifar10
╰› MetaCLIP-2-Age-Range-Estimator: prithivMLmods/MetaCLIP-2-Age-Range-Estimator
╰› MetaCLIP-2-Gender-Identifier: prithivMLmods/MetaCLIP-2-Gender-Identifier
╰› MetaCLIP-2-Open-Scene: prithivMLmods/MetaCLIP-2-Open-Scene

⮞ Collection: https://huggingface.co/collections/prithivMLmods/metaclip2-image-classification-experiments

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 27 days ago
view post
Post
3272
Try the all-new trending Qwen-Image-Edit specialized adapter demos, including Photo-to-Anime, Light Restoration, Multi-Angle Edits, Relighting, and more — all in a single Hugging Face Space. Below is the demo link. 🤗🌠

⮞ Demo-Space: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
⮞ How-to-Use: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast#2
⮞ Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
·
prithivMLmods 
posted an update about 1 month ago
view post
Post
2870
Introducing Photo-Mate-v2, based on FLUX.1-Kontext-dev, for advanced image manipulation tasks. It supports transforming scenes into top-down/bottom-up perspectives, CAM-right/left-view and its reverse, as well as general kontext-specified object removal. Below is the list of demos and adapters.🔥🤗

➤ Spaces [Demo] : https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2

Kontext-Adapters :
✦ Kontext-Bottom-Up-View: prithivMLmods/Kontext-Bottom-Up-View
✦ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
✦ Kontext-Top-Down-View: prithivMLmods/Kontext-Top-Down-View
✦ Kontext-CAM-Left-View: prithivMLmods/Kontext-CAM-Left-View
✦ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
✦ Kontext-Unblur-Upscale: prithivMLmods/Kontext-Unblur-Upscale
✦ Kontext-0811-exp: prithivMLmods/Kontext-0811-exp

Photo-Mate Collection:
✦ Kontext CAM Angles: https://huggingface.co/collections/prithivMLmods/kontext-cam-angles
✦ i2i - Kontext (exp): https://huggingface.co/collections/prithivMLmods/i2i-kontext-exp
✦ LZO-1 (Lossless Zoom Operator): https://huggingface.co/collections/prithivMLmods/lzo-1-lossless-zoom-operator

Related-Apps:
✦ Photo-Mate [Version 1.0]: prithivMLmods/Photo-Mate-i2i
✦ Image Generation Apps [Collection]: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
@prithivMLmods
lunarflu 
posted an update about 1 month ago
lunarflu 
posted an update about 1 month ago
lunarflu 
posted an update about 1 month ago
view post
Post
2674
💸🤑You don’t need 100 GPUs to train something amazing!

Our Smol Training Playbook teaches you a better path to world-class LLMs, for free!

Check out the #1 trending space on 🤗 :
HuggingFaceTB/smol-training-playbook
prithivMLmods 
posted an update about 1 month ago
view post
Post
1301
A week ago, I shared a post about the latest transformers test implementation of DeepSeek-OCR Compatibility (https://tinyurl.com/ykc4mm66). Now, I’m dropping the most compatible version of it to support the model with the latest transformers. 🤗🔥

➠ DeepSeek-OCR-Latest-BF16.I64: prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
➠ DeepSeek OCR [exp] : prithivMLmods/DeepSeek-OCR-experimental

✅Supports the latest transformers v4.57.1
✅torch: 2.6.0+cu124 (or) the latest version (i.e., torch 2.9.0)
✅cuda version: 12.4
✅users can also opt out of specific attention implementations if desired.

✨Previous version: strangervisionhf/deepseek-ocr-latest-transformers
↗️Related Blog: https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms
✨Community Page: strangervisionhf
✨Original Model Page: deepseek-ai/DeepSeek-OCR

To know more about it, visit the app page or the respective model page!
Nymbo 
posted an update about 1 month ago
view post
Post
942
I've added an 11th tool to the Nymbo/Tools MCP server, it's for your Obsidian_Vault. I'd argue it's far more context-efficient than any other Obsidian MCP I've seen, and doesn't require any plugins. Also some big improvements to the Web_Search and Web_Fetch tools.

# Obsidian_Vault Tool

It's basically a read-only version of the File_System tool, but it works so well for navigating Obsidian without unnecessary context. It supports recursive (full-text) search across the entire vault, and supports offset so the agent can "scroll" through a document without re-consuming tokens.

Run the server locally and set the OBSIDIAN_VAULT_ROOT environment variable to your vault's root path. If you don't use Obsidian, this is perfectly usable as simply a read-only filesystem.

# Web_Search Improvements

The Web_Search tool previously just used DuckDuckGo as a backend search engine, but now it also supports Bing, Brave, Yahoo, and Wikipedia. Default engine is auto which provides results from all backends in recommended order. Still doesn't require any kind of API or auth for Web_Search.

There's also a new date filter to limit results to those created in the past day, week, month, or year. Oh, and uhh, SafeSearch is now off by default :)

# Web_Fetch Improvements

As context-efficient as the Markdown mode is for web browsing, sometimes it does lose important context in the conversion from HTML to Markdown. So I've added a new HTML mode to the Web_Fetch tool that basically executes a cURL request on the URL, returning the full HTML page if necessary.

# A Note on Claude Skills

I've been having fun with the new File_System and Shell_Command tools. Using Claude Skills doesn't currently work in the public HF space because of environment restrictions, but using Skills works perfectly well running locally.

Happy building ~
prithivMLmods 
posted an update about 1 month ago
view post
Post
2588
A small blog post titled - Hall of Multimodal OCR VLMs and Demonstrations has been published on ↗️ https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms on behalf of strangervisionhf

It discusses the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique capabilities, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility.
prithivMLmods 
posted an update about 1 month ago
view post
Post
3843
Implemented DeepSeek-OCR to support the latest transformers on the strangervisionhf page. The page includes the model weights and corrected configuration, which fix the issues and allow transformers inference to run smoothly.🤗🔥

> Model: strangervisionhf/deepseek-ocr-latest-transformers
> Demo Space: prithivMLmods/DeepSeek-OCR-experimental

✅Supports the latest transformers
✅You can also opt out of the attention implementation if needed.
✅Supports torch version 2.6.0 or higher
✅torch version cuda: 12.4

If you are interested in experimenting with new things and streamlining compatibility, the strangervisionhf organization is open for you, and you can join the community.

> Multimodal Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0, https://huggingface.co/collections/strangervisionhf/october-2025-models

> Thank you, @merve , for assigning the blazing-fast Zero GPU support!

> Notebook : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepSeek-OCR-Demo/deepseek_ocr_demo.ipynb

To know more about it, visit the app page or the respective model page!