Threat-map benchmark with metrics and geometry
Failure structure discovery on CARB reasoning
Failure analysis for LM reasoning via HF Inference API