~/shanegraffiti.com/research/dialogue-swebench

Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

DIALOGUE
SWEBENCH
CODING
AGENTS

AI coding agents are benchmarked as fully-autonomous systems — but real-world use is interactive. Users correct and reject agent outputs 44% of the time. Agents seek clarification 1–2% of the time. Dialogue-SWEBench closes that gap: 500 real SWE-Bench problems, no pre-given specification, resolved entirely through multi-turn dialogue with a persona-grounded user simulator. Better coding models are not always better dialogue models.

Division Semantic Adversarial Research

Domain Coding Agents / Dialogue Eval

Published arXiv 2606.13995 — 2026

Key Result Schema agent: +3–14% over baseline

§ 1.0 The Dual-Environment Setup

The benchmark splits the coding agent's world into two environments. The user never touches the code. The agent lives in both — it must coordinate dialogue state and repository state simultaneously.

Dialogue Environment

User Channel

The agent communicates with a simulated user via message_user actions. The task starts with a vague initial query — no issue text, no specification document. All problem details must be elicited through conversation.

User corrects/rejects agent output 44% of the time in real sessions. Agents seek clarification 1–2% of the time.

Repo Environment

Code Shell

The agent operates in a programming environment — editing files, running bash, executing tests — and submits a git patch via the finish action. Evaluated against execution tests from SWE-Bench Verified.

500 problems · 100 step limit · scored on patch resolution rate

§ 2.0 Schema-Guided Agent

Off-the-shelf coding agents almost never seek information from users. The schema-guided agent addresses this by building and maintaining a structured dialogue state representation that drives what to ask next and when to code.

01

Determine Issue Type

From the user's initial query, the agent classifies the issue as bug, feature, upgrade, or an open-ended type it determines on the fly. This anchors the schema shape that follows.

→ Type

02

Draft a Schema On the Fly

The agent generates a structured representation with keys for everything it needs to resolve the issue — observed behavior, expected behavior, reproduction steps, and any domain-specific fields. Unknown values are marked UNKNOWN.

→ Schema

03

Seek Missing Information

Dialogue moves are generated to fill UNKNOWN fields. The agent asks targeted questions — not a list of everything at once — and updates the schema state after each response, tracking what remains unresolved.

→ Fill

04

Maintain Schema Across Code Exploration

The schema persists as the agent explores the repository, modifies files, and runs tests. It guides verification questions and ensures the patch aligns with the user's stated intent — not the agent's assumptions.

→ Solve

§ 3.0 User Simulator

The user simulator (LLaMA 3.3 70B) is grounded in the full issue text but never shares it directly. A self-revision step validates each candidate reply for hallucinations, environment boundary violations, and length — then revises if needed. 97.5% of dialogues are defect-free.

99.7%

Faithfulness (turn-level)

99.7%

Goal adherence (turn-level)

97.5%

Defect-free dialogues (full sim)

82.5%

Defect-free w/o self-revision

Simulator ablation — schema-guided agent, n=50

Model	Full Simulator	u₁ Only (no follow-up)	Drop
GPT-5 mini	68.0%	40.0%	−28.0pp
GPT-5	58.0%	44.0%	−14.0pp
Qwen3 Coder 30A3B	38.0%	26.0%	−12.0pp
Devstral 2 Small (24B)	42.0%	32.0%	−10.0pp

Persona types assigned per problem

Persona A

Risk-Averse

Cautious, wants to understand implications of any change before committing. Frequently asks how a solution works.

Persona B

Perfectionist

High attention to detail. Prefers things done the right way and wants changes carefully explained.

Persona C

Pragmatist

Business-minded, prioritizes deadlines. Tolerates hacks if they ship. Often asks whether a step is worth the effort.

Persona D

Introvert

Quiet, prefers working alone. Can struggle to communicate ideas clearly. Minimal but authentic.

Persona E

Impatient Expert

Concise to the point of only sending code or error messages. Grows impatient with unnecessary questions. Expects precision.

§ 4.0 Experimental Results

500 SWE-Bench Verified problems. 4 models × 3 agents = 12 systems. Information-seeking dialogue moves are the strongest predictor of resolve rate — not coding model size.

Resolve rate, dialogue turns, agent steps, cost — full benchmark

Model	Agent	% Resolved ↑	# Turns	# Steps	Cost ($)
GPT-5 mini	OpenHands	34.3%	2.1	48.8	0.26
GPT-5 mini	OH Interactive	54.2%	4.5	41.3	0.23
GPT-5 mini	Schema-Guided (Ours)	58.8%	8.4	39.4	0.24
GPT-5
GPT-5	OpenHands	47.5%	12.2	52.9	1.13
GPT-5	OH Interactive	54.6%	17.1	41.0	0.97
GPT-5	Schema-Guided (Ours)	58.0%	11.9	38.4	0.86
Open-Weight Models
Devstral 2 Small (24B)	OpenHands	26.8%	1.8	81.3	0.35
Devstral 2 Small (24B)	Schema-Guided (Ours)	38.6%	3.3	75.1	0.30
Qwen3 Coder 30-A-3B	OpenHands	23.2%	3.6	60.8	0.18
Qwen3 Coder 30-A-3B	Schema-Guided (Ours)	32.3%	4.0	51.0	0.13
Averages
—	OpenHands	32.9%	4.9	61.0	0.48
—	OH Interactive	44.1%	6.9	50.1	0.40
—	Schema-Guided (Ours)	46.9%	6.9	50.9	0.38

§ 5.0 Dialogue Quality — LLM-as-a-Judge

Task resolution alone does not measure usability. Two automatic quality dimensions — Naturalness and Coherence — evaluated by Gemma 4 31B-IT and validated against human annotation of 360 dialogues.

Naturalness

Degree to which the agent is easy to understand and converse with. Scored 1–3. More variance from model choice than agent design. GPT-5 suffers from failure to close dialogues and leaking internal system prompt text to the user.

κ = 0.70 · 100% rank accuracy

Coherence

Degree to which dialogue moves guide conversation toward resolving the task. Scored via local coherence (each turn follows logically) and global coherence (conversation arc steers toward resolution). Stronger differentiation between agents than naturalness — agent design matters more here.

κ = 0.51 · 84.3% rank accuracy

Task vs. Quality

No clear relationship between naturalness and resolution rate. Usability of a coding agent in dialogue cannot be measured by task success alone — a model can solve the task with poor dialogue or communicate naturally while failing to resolve it.

Independent signal

§ 6.0 Key Findings

Four findings that distinguish Dialogue-SWEBench from fully-autonomous SWE evaluation.

01

Coding ability ≠ Dialogue ability

GPT-5 mini matches or exceeds GPT-5 on overall resolve rate despite being a smaller model. GPT-5 outperforms on harder engineering problems but underperforms on simpler ones — its dialogue failures (over-questioning, failing to follow up on unanswered details) cost it easy wins.

Finding 1

02

Information-seeking drives resolution

LLM-classified information-seeking dialogue moves strongly predict resolve rate across all models. Off-the-shelf OpenHands almost never seeks information. Schema-guided agents use the most information-seeking moves in all but one case, and solve the most tasks.

Finding 2

03

Verbose agents burden users

GPT-5 asks many questions at once. Users selectively answer 2 of 5 questions. The agent doesn't follow up on the unanswered ones. It then makes incorrect assumptions about expected behavior and applies the wrong fix. Concise, targeted questions outperform exhaustive lists.

Finding 3

04

Schema agents cost less, not more

Despite using more dialogue turns, the schema-guided agent reduces total agent steps (by removing exploratory dead ends) and achieves the lowest average cost across all agent types — $0.38 vs $0.40 for OH Interactive and $0.48 for OpenHands.

Finding 4

~/conclusion

$ query: what does Dialogue-SWEBench change // Coding agents are not evaluated on what they actually do. // Real use is interactive. Benchmarks have not been. // Dialogue capability is a distinct, currently understudied dimension. $ query: what is the cost of better dialogue // Fewer steps, not more. Schema agents reduce total agent steps. // Targeted questions are cheaper than exploratory dead-ends. // Best average resolve rate at the lowest average cost. $ query: what is the actual result // GPT-5 mini matches GPT-5. Coding ≠ Dialogue. // +3–14% over baselines. One schema. Any coding agent.

BETTER
CODING
IS NOT
BETTER
DIALOGUE.

DIALOGUE SWEBENCH CODING AGENTS

DIALOGUE
SWEBENCH
CODING
AGENTS