~/shanegraffiti.com/research/dialogue-swebench
Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

DIALOGUE
SWEBENCH
CODING
AGENTS

AI coding agents are benchmarked as fully-autonomous systems — but real-world use is interactive. Users correct and reject agent outputs 44% of the time. Agents seek clarification 1–2% of the time. Dialogue-SWEBench closes that gap: 500 real SWE-Bench problems, no pre-given specification, resolved entirely through multi-turn dialogue with a persona-grounded user simulator. Better coding models are not always better dialogue models.

Division Semantic Adversarial Research
Domain Coding Agents / Dialogue Eval
Published arXiv 2606.13995 — 2026
Key Result Schema agent: +3–14% over baseline
Dialogue-Driven SWE Schema-Guided Agents User Simulation Self-Revision Information Seeking Persona Grounding OpenHands SWE-Bench Verified Resolve Rate Dialogue Naturalness Dialogue Coherence LLM-as-a-Judge Dialogue-Driven SWE Schema-Guided Agents User Simulation Self-Revision Information Seeking Persona Grounding OpenHands SWE-Bench Verified Resolve Rate Dialogue Naturalness Dialogue Coherence LLM-as-a-Judge
§ 1.0 The Dual-Environment Setup

The benchmark splits the coding agent's world into two environments. The user never touches the code. The agent lives in both — it must coordinate dialogue state and repository state simultaneously.

Dialogue Environment
User Channel
The agent communicates with a simulated user via message_user actions. The task starts with a vague initial query — no issue text, no specification document. All problem details must be elicited through conversation.
User corrects/rejects agent output 44% of the time in real sessions. Agents seek clarification 1–2% of the time.
Repo Environment
Code Shell
The agent operates in a programming environment — editing files, running bash, executing tests — and submits a git patch via the finish action. Evaluated against execution tests from SWE-Bench Verified.
500 problems · 100 step limit · scored on patch resolution rate
§ 2.0 Schema-Guided Agent

Off-the-shelf coding agents almost never seek information from users. The schema-guided agent addresses this by building and maintaining a structured dialogue state representation that drives what to ask next and when to code.

01
Determine Issue Type
From the user's initial query, the agent classifies the issue as bug, feature, upgrade, or an open-ended type it determines on the fly. This anchors the schema shape that follows.
→ Type
02
Draft a Schema On the Fly
The agent generates a structured representation with keys for everything it needs to resolve the issue — observed behavior, expected behavior, reproduction steps, and any domain-specific fields. Unknown values are marked UNKNOWN.
→ Schema
03
Seek Missing Information
Dialogue moves are generated to fill UNKNOWN fields. The agent asks targeted questions — not a list of everything at once — and updates the schema state after each response, tracking what remains unresolved.
→ Fill
04
Maintain Schema Across Code Exploration
The schema persists as the agent explores the repository, modifies files, and runs tests. It guides verification questions and ensures the patch aligns with the user's stated intent — not the agent's assumptions.
→ Solve
§ 3.0 User Simulator

The user simulator (LLaMA 3.3 70B) is grounded in the full issue text but never shares it directly. A self-revision step validates each candidate reply for hallucinations, environment boundary violations, and length — then revises if needed. 97.5% of dialogues are defect-free.

99.7%
Faithfulness (turn-level)
99.7%
Goal adherence (turn-level)
97.5%
Defect-free dialogues (full sim)
82.5%
Defect-free w/o self-revision
Simulator ablation — schema-guided agent, n=50
ModelFull Simulatoru₁ Only (no follow-up)Drop
GPT-5 mini68.0%40.0%−28.0pp
GPT-558.0%44.0%−14.0pp
Qwen3 Coder 30A3B38.0%26.0%−12.0pp
Devstral 2 Small (24B)42.0%32.0%−10.0pp
Persona types assigned per problem
Persona A
Risk-Averse
Cautious, wants to understand implications of any change before committing. Frequently asks how a solution works.
Persona B
Perfectionist
High attention to detail. Prefers things done the right way and wants changes carefully explained.
Persona C
Pragmatist
Business-minded, prioritizes deadlines. Tolerates hacks if they ship. Often asks whether a step is worth the effort.
Persona D
Introvert
Quiet, prefers working alone. Can struggle to communicate ideas clearly. Minimal but authentic.
Persona E
Impatient Expert
Concise to the point of only sending code or error messages. Grows impatient with unnecessary questions. Expects precision.
§ 4.0 Experimental Results

500 SWE-Bench Verified problems. 4 models × 3 agents = 12 systems. Information-seeking dialogue moves are the strongest predictor of resolve rate — not coding model size.

Resolve rate, dialogue turns, agent steps, cost — full benchmark
ModelAgent% Resolved ↑# Turns# StepsCost ($)
GPT-5 miniOpenHands34.3%2.148.80.26
GPT-5 miniOH Interactive54.2%4.541.30.23
GPT-5 miniSchema-Guided (Ours)58.8%8.439.40.24
GPT-5
GPT-5OpenHands47.5%12.252.91.13
GPT-5OH Interactive54.6%17.141.00.97
GPT-5Schema-Guided (Ours)58.0%11.938.40.86
Open-Weight Models
Devstral 2 Small (24B)OpenHands26.8%1.881.30.35
Devstral 2 Small (24B)Schema-Guided (Ours)38.6%3.375.10.30
Qwen3 Coder 30-A-3BOpenHands23.2%3.660.80.18
Qwen3 Coder 30-A-3BSchema-Guided (Ours)32.3%4.051.00.13
Averages
OpenHands32.9%4.961.00.48
OH Interactive44.1%6.950.10.40
Schema-Guided (Ours)46.9%6.950.90.38
§ 5.0 Dialogue Quality — LLM-as-a-Judge

Task resolution alone does not measure usability. Two automatic quality dimensions — Naturalness and Coherence — evaluated by Gemma 4 31B-IT and validated against human annotation of 360 dialogues.

Dimension
What it measures
Judge agreement
Naturalness
Degree to which the agent is easy to understand and converse with. Scored 1–3. More variance from model choice than agent design. GPT-5 suffers from failure to close dialogues and leaking internal system prompt text to the user.
κ = 0.70 · 100% rank accuracy
Coherence
Degree to which dialogue moves guide conversation toward resolving the task. Scored via local coherence (each turn follows logically) and global coherence (conversation arc steers toward resolution). Stronger differentiation between agents than naturalness — agent design matters more here.
κ = 0.51 · 84.3% rank accuracy
Task vs. Quality
No clear relationship between naturalness and resolution rate. Usability of a coding agent in dialogue cannot be measured by task success alone — a model can solve the task with poor dialogue or communicate naturally while failing to resolve it.
Independent signal
§ 6.0 Key Findings

Four findings that distinguish Dialogue-SWEBench from fully-autonomous SWE evaluation.

01
Coding ability ≠ Dialogue ability
GPT-5 mini matches or exceeds GPT-5 on overall resolve rate despite being a smaller model. GPT-5 outperforms on harder engineering problems but underperforms on simpler ones — its dialogue failures (over-questioning, failing to follow up on unanswered details) cost it easy wins.
Finding 1
02
Information-seeking drives resolution
LLM-classified information-seeking dialogue moves strongly predict resolve rate across all models. Off-the-shelf OpenHands almost never seeks information. Schema-guided agents use the most information-seeking moves in all but one case, and solve the most tasks.
Finding 2
03
Verbose agents burden users
GPT-5 asks many questions at once. Users selectively answer 2 of 5 questions. The agent doesn't follow up on the unanswered ones. It then makes incorrect assumptions about expected behavior and applies the wrong fix. Concise, targeted questions outperform exhaustive lists.
Finding 3
04
Schema agents cost less, not more
Despite using more dialogue turns, the schema-guided agent reduces total agent steps (by removing exploratory dead ends) and achieves the lowest average cost across all agent types — $0.38 vs $0.40 for OH Interactive and $0.48 for OpenHands.
Finding 4
~/conclusion
$ query: what does Dialogue-SWEBench change // Coding agents are not evaluated on what they actually do. // Real use is interactive. Benchmarks have not been. // Dialogue capability is a distinct, currently understudied dimension. $ query: what is the cost of better dialogue // Fewer steps, not more. Schema agents reduce total agent steps. // Targeted questions are cheaper than exploratory dead-ends. // Best average resolve rate at the lowest average cost. $ query: what is the actual result // GPT-5 mini matches GPT-5. Coding ≠ Dialogue. // +3–14% over baselines. One schema. Any coding agent.

BETTER
CODING
IS NOT
BETTER
DIALOGUE.