~/shanegraffiti.com/research/dlawbench

Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

DLAWBENCH
MULTI-TURN
LEGAL
CONSULTATION

Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients — then reasoning from what's missing. DLawBench evaluates whether LLMs can conduct real legal consultation under realistic conditions: eliciting facts, correcting client misframes, and writing defensible memos. Built from 461 real court opinions across Chinese and U.S. law, with four client personality types and expert-authored evaluation rubrics. The best-performing model achieves only 0.562 in consultation-grounded legal reasoning.

Division Semantic Adversarial Research

Domain Legal AI / Multi-Turn Evaluation

Published arXiv 2606.13931 — 2026

Key Result Top model: 0.562 Resolution

§ 1.0 The Consultation Pipeline

Legal consultation interleaves information gathering with legal reasoning. A lawyer rarely receives a complete, neutral, chronologically ordered fact pattern. DLawBench evaluates three linked but separable abilities across the full pipeline.

Phase 1 — Elicitation

Information
Gathering

The lawyer model interviews the simulated client — up to 10 turns — asking targeted follow-up questions. Scored on Fact Coverage (facts reaching the memo) and Inquiry (case-specific intake questions asked).

→

Phase 2 — Resolution

Legal
Reasoning

The model submits a structured legal analysis memo. Scored on Fact Resolution (facts correctly reframed against the hidden court record) and Issue Resolution (legal analysis points addressed).

→

Phase 3 — Fidelity

Claim
Support

Whether factual and inferential claims in the memo are grounded in the consultation or the case record — not invented. High fidelity with wrong legal route is still a failure.

Metric hierarchy — DLawBench Elicitation = (Fact Coverage + Inquiry) / 2 // whether the lawyer gathered the information needed for analysis Resolution = (Fact Resolution + Issue Resolution) / 2 // whether gathered facts become correct legal analysis Fidelity = 1 − (unsupported claims / total claims) // claim support guardrail — high fidelity does not imply correct routing

§ 2.0 Client Narrative Styles

Each of 461 cases is replayed under four narrative styles adapted from the Interpersonal Circumplex. The underlying facts are fixed. Only how much the client volunteers and resists changes — and that changes everything.

Cooperative

High Communion · High Agency

IPC AXIS — Easy intake baseline

Actively volunteers information, answers directly and completely, admits uncertainty. The idealized benchmark user. Cooperative-only evaluation inflates measured consultation quality.

"I want to solve this. Please ask me anything you need."

Dependent

High Communion · Low Agency

IPC AXIS — Defers to lawyer authority

Relies on the lawyer to lead. Provides information only when directly asked. Needs reassurance and direction. Models uncertainty and help-seeking deference — causes largest performance drop.

"I'm not sure what's important. You tell me what to do."

Withdrawn

Low Communion · Low Agency

IPC AXIS — Avoids, withholds, deflects

Provides minimal information, avoids sensitive details, gives short or vague replies. Models shame, fear, and credibility concerns. Models fill unknowns silently with client-favorable assumptions.

"I really don't want to talk about it too much."

Adversarial

Low Communion · High Agency

IPC AXIS — Challenges, argues, resists

Challenges the lawyer's questions, provides biased or selective information, argues or shows suspicion. Paradoxically closer to Cooperative than Dependent — a challenging client is still an informative one.

"I told you what happened. Why do you keep asking?"

§ 3.0 Sycophancy Taxonomy

In legal settings sycophancy is not simple agreement — it is a structured, multi-level failure. Models deploy legal knowledge in service of an unchecked client frame, producing professionally packaged error. DLawBench makes this operationally measurable.

Client-frame adoption

Fact level. The memo treats the client's subjective belief as an established factual premise. Accepts job title, event character, or dispute framing without checking against objective records. Signal: high Fact Coverage, low Fact Resolution.

Wrong legal route

Rule level. Many correct facts enter the memo but under the wrong legal theory. A performance-pay dispute routed as wage arrears before testing whether the bonus had any contractual basis. High Fidelity, zero Issue Resolution.

Gap filling

Interaction level. Under Withdrawn clients, short evasive answers are silently expanded into a complete client-favorable narrative. Unknowns are not preserved as unknowns — they are specified. Withdrawn-specific collapse in Fact Resolution.

Compliant reconstruction

Interaction level. Under Dependent clients, the model follows the client's preferred direction and supplies legal support rather than testing the claim. Avoidance of challenging follow-ups produces a plausible memo on the wrong foundation.

Unsupported specificity

Fact level. The memo asserts documents, witnesses, or procedural steps not found in the case or dialogue — inventing exhibits, admissions, or legal force for internal records. Detected by Fidelity's claim-grounding check.

§ 4.0 Experimental Results

26 models evaluated across 461 cases × 4 narrative styles = 1,844 case-style cells. Three-judge panel: GPT-5.1, Claude Opus 4.6, Gemini 3.1 Pro. Same-vendor judges recuse. Scored by median aggregation.

0.562

Top model Resolution (GPT-5.5)

0.934

Top model Fidelity — not enough

−10pp

Inquiry drop, Withdrawn vs Cooperative

0.076

LegalOne-8B Resolution — domain tuning alone fails

Leaderboard — ranked by Resolution (avg. Chinese + U.S. law)

Model	Fact Cov.	Inquiry	Elicitation	Fact Res.	Issue Res.	Resolution	Fidelity
GPT-5.5	0.837	0.576	0.707	0.583	0.541	0.562	0.934
GPT-5.4	0.823	0.537	0.680	0.578	0.515	0.546	0.940
GPT-5.2	0.787	0.563	0.675	0.530	0.467	0.499	0.936
Gemini 3.1 Pro	0.732	0.399	0.565	0.518	0.425	0.472	0.841
Claude Opus 4.7	0.730	0.355	0.543	0.531	0.348	0.440	0.873
Claude Opus 4.6	0.770	0.361	0.566	0.520	0.347	0.433	0.876
Kimi-K2.6	0.730	0.376	0.553	0.489	0.358	0.424	0.868
GLM-5.1	0.766	0.360	0.563	0.504	0.331	0.418	0.884
Claude Sonnet 4.6	0.713	0.296	0.504	0.468	0.285	0.376	0.889
DeepSeek-R1	0.671	0.292	0.482	0.377	0.166	0.272	0.842
OLMo-3-32B-Think	0.530	0.177	0.354	0.224	0.062	0.143	0.619
LegalOne-8B	0.202	0.095	0.148	0.102	0.050	0.076	0.268

Coverage ≠ Reasoning — GLM-5.1 diagnostic

Metric	Score	What it reveals
Fact Coverage	0.766	76.6% of annotated facts appear in the memo
Inquiry	0.360	Only 36% of case-specific intake questions asked
Fact Resolution	0.504	Half the covered facts remain legally miscalibrated
Issue Resolution	0.331	Memo misses decisive legal issues entirely

§ 5.0 What DLawBench Exposes

Without a hidden court record, a consultation may appear successful simply because the final advice sounds helpful. DLawBench penalizes user-responsive legal reasoning that lacks independent legal judgment.

High Coverage + Low Resolution

Fact Coverage 0.95, Fact Resolution 0.10 — memo absorbed client narrative, never legally reframed it. Claude Opus 4.6 case study: routed performance-pay dispute through wrong legal theory with 100% fidelity.

Sycophancy

Withdrawn client gap fill

Fact Coverage 0.818, Fact Resolution 0, Issue Resolution 0 — Kimi-K2.6 on a network portrait-rights dispute with 67-character average client replies. Missing facts were silently replaced with client-favorable assumptions.

Hallucination

High Fidelity + Wrong Route

Grounded claims, plausible prose, wrong legal theory. Fidelity measures whether claims are supported — not whether the analysis points at the right legal structure. A well-grounded wrong route still fails Issue Resolution.

Routing Error

Domain fine-tuning fails

LegalOne-8B: Resolution 0.076, Fidelity 0.268 — worst on both dimensions. Legal knowledge alone does not solve interactive consultation. The model needs to elicit, reframe, and reconstruct — not recall.

Knowledge ≠ Skill

§ 6.0 What DLawBench Enables

By separating client-belief and court-record views, DLawBench turns consultation failure into diagnosable capability gaps. Four diagnostic handles not available in prior benchmarks.

01

Sycophancy detection at fact level

Fact Coverage high, Fact Resolution low = memo absorbed client narrative. This gap is invisible without a hidden court record to measure against. Prior benchmarks with only the client-side view cannot detect it.

02

Client-style stress testing

Same case, four personalities. Cooperative-only evaluation inflates every metric. Dependent and Withdrawn clients are where legal guidance should matter most — and where models degrade the hardest. The paradox is now measurable.

03

Separating elicitation from reasoning

Elicitation and Resolution correlate at r = 0.94 but are not identical. Models that surface facts without reframing them score high on Coverage, low on Resolution. The gap localizes the failure: intake skill vs. legal reconstruction skill.

04

Legal sycophancy as structured error

Legal sycophancy is not mere agreement — models can deploy legal knowledge to support an unverified frame. DLawBench's five-metric hierarchy exposes whether a model is agreeing, filling, routing wrong, or genuinely reasoning from the record.

~/conclusion

$ query: what does DLawBench change // Legal consultation is not legal QA over complete facts. // The symbolic gap is between knowing law and recovering facts. // A hidden court record is required to see the real failure. $ query: what do the results cost // Under exact cooperative clients: inflated. Looks fine. // Under Dependent and Withdrawn: real drop, real signal. // Tune the model on client style, not just legal knowledge. $ query: what is the actual result // Best model: 0.562 Resolution. Substantial headroom. // One framework. Any client. No free pass on legal reasoning.

THE
CLIENT
IS NOT
THE
RECORD.

DLAWBENCH MULTI-TURN LEGAL CONSULTATION

DLAWBENCH
MULTI-TURN
LEGAL
CONSULTATION