Precise-Debugging-Benchmarking

community

https://precise-debugging-benchmark.github.io/

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

Bill1235813 updated a Space about 10 hours ago

Precise-Debugging-Benchmarking/README

Bill1235813 updated a dataset about 10 hours ago

Precise-Debugging-Benchmarking/PDB-Multi

Bill1235813 updated a dataset about 10 hours ago

Precise-Debugging-Benchmarking/PDB-Single-Hard

View all activity

Organization Card

Community About org cards

Precise Debugging Benchmarking (PDB)

📄 Paper · 💻 Code · 🌐 Project page · 🏆 Leaderboard

PDB is an automatic pipeline that turns any coding dataset into a debugging benchmark with fine-grained metrics. Beyond binary unit-test scores, PDB evaluates a debugger with edit-level precision (did the model touch only the lines it had to?) and bug-level recall (did it fix every fault?). This rewards targeted fixes and penalizes the regeneration behavior frontier LLMs often fall back on.

Frontier models like GPT-5.1-Codex and DeepSeek-V3.2-Thinking top unit-test leaderboards (>76%) but score at or below 45% on precision: they pass tests by rewriting, not repairing. PDB makes that gap measurable.

Released datasets

Dataset	Size	Bug granularity	Notes
PDB-Single	7,589	single line	full initial pool before easy-case filtering
PDB-Single-Hard	5,751	single line	hard subset: tasks not easily solved by 7+ of 9 reference models
PDB-Multi	256	2–4 line blocks	multi-line extension on programs with ≥35 LOC; atomicity-filtered

All three are derived from BigCodeBench and LiveCodeBench, sourced via the PDB pipeline, and evaluated with precision / recall / unit-test pass rate.

Citation

@inproceedings{zhu2026pdb,
  title     = {Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?},
  author    = {Zhu, Wang Bill and Chai, Miaosen and Wang, Shangshang and Liu, Yejia and
               Bian, Song and Dong, Honghua and Neiswanger, Willie and Jia, Robin},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  year      = {2026},
}

Contact

Questions / submissions: [email protected], [email protected].

models 0

None public yet

datasets 3