DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
Abstract
DAComp is a benchmark of 210 tasks that evaluates the capabilities of agents in real-world data engineering and data analysis workflows, revealing significant deficiencies in both areas.
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io
Community
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings
Today, we are thrilled to introduce DAComp—a comprehensive benchmark designed to evaluate the performance of LLM-based data agents across the entire enterprise data intelligence lifecycle.
Why do we need DAComp? Current benchmarks are often limited to isolated SQL generation or simple Q&A. However, in real-world enterprise environments, data work is a complex, closed loop. DAComp fills this gap by covering two core domains:
🔹 Data Engineering (DE): Going beyond single-point code generation to challenge repository-level engineering capabilities. Agents must handle architectural design, implementation, and system evolution in response to changing requirements.
🔹 Data Analysis (DA): We introduce open-ended analysis tasks, requiring agents to demonstrate deep reasoning, strategic planning, and comprehensive synthesis capabilities.
Watch our video for a closer look at these specific task categories!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PublicAgent: Multi-Agent Design Principles From an LLM-Based Open Data Analysis Framework (2025)
- UniDataBench: Evaluating Data Analytics Agents Across Structured and Unstructured Data (2025)
- Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development (2025)
- ConDABench: Interactive Evaluation of Language Models for Data Analysis (2025)
- UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability (2025)
- MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline (2025)
- CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 7
Browse 7 datasets citing this paperSpaces citing this paper 0
No Space linking this paper