GEMS: Agent-Native Multimodal Generation with Memory and Skills
Abstract
GEMS is an agent-native multimodal generation framework that enhances model capabilities through structured multi-agent optimization, persistent memory, and domain-specific skills across general and downstream tasks.
Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose GEMS (Agent-Native Multimodal GEneration with Memory and Skills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.
Community
Introducing GEMS: an agent-native multimodal generation framework featuring Agent Loop, Agent Memory, and Agent Skills.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models (2026)
- coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation (2026)
- Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis (2026)
- VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining (2026)
- Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization (2026)
- LatentMem: Customizing Latent Memory for Multi-Agent Systems (2026)
- Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.28088 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
