AI & ML interests
None defined yet.
Recent Activity
Precise Debugging Benchmarking (PDB)
š Paper Ā· š» Code Ā· š Project page Ā· š Leaderboard
PDB is an automatic pipeline that turns any coding dataset into a debugging benchmark with fine-grained metrics. Beyond binary unit-test scores, PDB evaluates a debugger with edit-level precision (did the model touch only the lines it had to?) and bug-level recall (did it fix every fault?). This rewards targeted fixes and penalizes the regeneration behavior frontier LLMs often fall back on.
Frontier models like GPT-5.1-Codex and DeepSeek-V3.2-Thinking top unit-test leaderboards (>76%) but score at or below 45% on precision: they pass tests by rewriting, not repairing. PDB makes that gap measurable.
Released datasets
| Dataset | Size | Bug granularity | Notes |
|---|---|---|---|
| PDB-Single | 7,589 | single line | full initial pool before easy-case filtering |
| PDB-Single-Hard | 5,751 | single line | hard subset: tasks not easily solved by 7+ of 9 reference models |
| PDB-Multi | 256 | 2ā4 line blocks | multi-line extension on programs with ā„35 LOC; atomicity-filtered |
All three are derived from BigCodeBench and LiveCodeBench, sourced via the PDB pipeline, and evaluated with precision / recall / unit-test pass rate.
Citation
@inproceedings{zhu2026pdb,
title = {Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?},
author = {Zhu, Wang Bill and Chai, Miaosen and Wang, Shangshang and Liu, Yejia and
Bian, Song and Dong, Honghua and Neiswanger, Willie and Jia, Robin},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026},
}
Contact
Questions / submissions: [email protected], [email protected].