AI coding agents are benchmarked as fully-autonomous systems — but real-world use is interactive. Users correct and reject agent outputs 44% of the time. Agents seek clarification 1–2% of the time. Dialogue-SWEBench closes that gap: 500 real SWE-Bench problems, no pre-given specification, resolved entirely through multi-turn dialogue with a persona-grounded user simulator. Better coding models are not always better dialogue models.
The benchmark splits the coding agent's world into two environments. The user never touches the code. The agent lives in both — it must coordinate dialogue state and repository state simultaneously.
Off-the-shelf coding agents almost never seek information from users. The schema-guided agent addresses this by building and maintaining a structured dialogue state representation that drives what to ask next and when to code.
The user simulator (LLaMA 3.3 70B) is grounded in the full issue text but never shares it directly. A self-revision step validates each candidate reply for hallucinations, environment boundary violations, and length — then revises if needed. 97.5% of dialogues are defect-free.
| Model | Full Simulator | u₁ Only (no follow-up) | Drop |
|---|---|---|---|
| GPT-5 mini | 68.0% | 40.0% | −28.0pp |
| GPT-5 | 58.0% | 44.0% | −14.0pp |
| Qwen3 Coder 30A3B | 38.0% | 26.0% | −12.0pp |
| Devstral 2 Small (24B) | 42.0% | 32.0% | −10.0pp |
500 SWE-Bench Verified problems. 4 models × 3 agents = 12 systems. Information-seeking dialogue moves are the strongest predictor of resolve rate — not coding model size.
| Model | Agent | % Resolved ↑ | # Turns | # Steps | Cost ($) |
|---|---|---|---|---|---|
| GPT-5 mini | OpenHands | 34.3% | 2.1 | 48.8 | 0.26 |
| GPT-5 mini | OH Interactive | 54.2% | 4.5 | 41.3 | 0.23 |
| GPT-5 mini | Schema-Guided (Ours) | 58.8% | 8.4 | 39.4 | 0.24 |
| GPT-5 | |||||
| GPT-5 | OpenHands | 47.5% | 12.2 | 52.9 | 1.13 |
| GPT-5 | OH Interactive | 54.6% | 17.1 | 41.0 | 0.97 |
| GPT-5 | Schema-Guided (Ours) | 58.0% | 11.9 | 38.4 | 0.86 |
| Open-Weight Models | |||||
| Devstral 2 Small (24B) | OpenHands | 26.8% | 1.8 | 81.3 | 0.35 |
| Devstral 2 Small (24B) | Schema-Guided (Ours) | 38.6% | 3.3 | 75.1 | 0.30 |
| Qwen3 Coder 30-A-3B | OpenHands | 23.2% | 3.6 | 60.8 | 0.18 |
| Qwen3 Coder 30-A-3B | Schema-Guided (Ours) | 32.3% | 4.0 | 51.0 | 0.13 |
| Averages | |||||
| — | OpenHands | 32.9% | 4.9 | 61.0 | 0.48 |
| — | OH Interactive | 44.1% | 6.9 | 50.1 | 0.40 |
| — | Schema-Guided (Ours) | 46.9% | 6.9 | 50.9 | 0.38 |
Task resolution alone does not measure usability. Two automatic quality dimensions — Naturalness and Coherence — evaluated by Gemma 4 31B-IT and validated against human annotation of 360 dialogues.
Four findings that distinguish Dialogue-SWEBench from fully-autonomous SWE evaluation.
BETTER
CODING
IS NOT
BETTER
DIALOGUE.