~/shanegraffiti.com/research/dlawbench
Shane Graffiti Inc. — Semantic Adversarial Research Division — 2026

DLAWBENCH
MULTI-TURN
LEGAL
CONSULTATION

Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients — then reasoning from what's missing. DLawBench evaluates whether LLMs can conduct real legal consultation under realistic conditions: eliciting facts, correcting client misframes, and writing defensible memos. Built from 461 real court opinions across Chinese and U.S. law, with four client personality types and expert-authored evaluation rubrics. The best-performing model achieves only 0.562 in consultation-grounded legal reasoning.

Division Semantic Adversarial Research
Domain Legal AI / Multi-Turn Evaluation
Published arXiv 2606.13931 — 2026
Key Result Top model: 0.562 Resolution
Legal Consultation Multi-Turn LLM Eval Sycophancy Detection Fact Elicitation Client Narrative Styles Court-Record Grounding Abductive Reasoning Weak Supervision Chinese Law U.S. Federal Law Belief–Record Separation Issue Resolution Legal Consultation Multi-Turn LLM Eval Sycophancy Detection Fact Elicitation Client Narrative Styles Court-Record Grounding Abductive Reasoning Weak Supervision Chinese Law U.S. Federal Law Belief–Record Separation Issue Resolution
§ 1.0 The Consultation Pipeline

Legal consultation interleaves information gathering with legal reasoning. A lawyer rarely receives a complete, neutral, chronologically ordered fact pattern. DLawBench evaluates three linked but separable abilities across the full pipeline.

Phase 1 — Elicitation
Information
Gathering
The lawyer model interviews the simulated client — up to 10 turns — asking targeted follow-up questions. Scored on Fact Coverage (facts reaching the memo) and Inquiry (case-specific intake questions asked).
Phase 2 — Resolution
Legal
Reasoning
The model submits a structured legal analysis memo. Scored on Fact Resolution (facts correctly reframed against the hidden court record) and Issue Resolution (legal analysis points addressed).
Phase 3 — Fidelity
Claim
Support
Whether factual and inferential claims in the memo are grounded in the consultation or the case record — not invented. High fidelity with wrong legal route is still a failure.
Metric hierarchy — DLawBench Elicitation = (Fact Coverage + Inquiry) / 2 // whether the lawyer gathered the information needed for analysis Resolution = (Fact Resolution + Issue Resolution) / 2 // whether gathered facts become correct legal analysis Fidelity = 1 − (unsupported claims / total claims) // claim support guardrail — high fidelity does not imply correct routing
§ 2.0 Client Narrative Styles

Each of 461 cases is replayed under four narrative styles adapted from the Interpersonal Circumplex. The underlying facts are fixed. Only how much the client volunteers and resists changes — and that changes everything.

Cooperative
High Communion · High Agency
IPC AXIS — Easy intake baseline
Actively volunteers information, answers directly and completely, admits uncertainty. The idealized benchmark user. Cooperative-only evaluation inflates measured consultation quality.
"I want to solve this. Please ask me anything you need."
Dependent
High Communion · Low Agency
IPC AXIS — Defers to lawyer authority
Relies on the lawyer to lead. Provides information only when directly asked. Needs reassurance and direction. Models uncertainty and help-seeking deference — causes largest performance drop.
"I'm not sure what's important. You tell me what to do."
Withdrawn
Low Communion · Low Agency
IPC AXIS — Avoids, withholds, deflects
Provides minimal information, avoids sensitive details, gives short or vague replies. Models shame, fear, and credibility concerns. Models fill unknowns silently with client-favorable assumptions.
"I really don't want to talk about it too much."
Adversarial
Low Communion · High Agency
IPC AXIS — Challenges, argues, resists
Challenges the lawyer's questions, provides biased or selective information, argues or shows suspicion. Paradoxically closer to Cooperative than Dependent — a challenging client is still an informative one.
"I told you what happened. Why do you keep asking?"
§ 3.0 Sycophancy Taxonomy

In legal settings sycophancy is not simple agreement — it is a structured, multi-level failure. Models deploy legal knowledge in service of an unchecked client frame, producing professionally packaged error. DLawBench makes this operationally measurable.

Client-frame adoption
Fact level. The memo treats the client's subjective belief as an established factual premise. Accepts job title, event character, or dispute framing without checking against objective records. Signal: high Fact Coverage, low Fact Resolution.
Wrong legal route
Rule level. Many correct facts enter the memo but under the wrong legal theory. A performance-pay dispute routed as wage arrears before testing whether the bonus had any contractual basis. High Fidelity, zero Issue Resolution.
Gap filling
Interaction level. Under Withdrawn clients, short evasive answers are silently expanded into a complete client-favorable narrative. Unknowns are not preserved as unknowns — they are specified. Withdrawn-specific collapse in Fact Resolution.
Compliant reconstruction
Interaction level. Under Dependent clients, the model follows the client's preferred direction and supplies legal support rather than testing the claim. Avoidance of challenging follow-ups produces a plausible memo on the wrong foundation.
Unsupported specificity
Fact level. The memo asserts documents, witnesses, or procedural steps not found in the case or dialogue — inventing exhibits, admissions, or legal force for internal records. Detected by Fidelity's claim-grounding check.
§ 4.0 Experimental Results

26 models evaluated across 461 cases × 4 narrative styles = 1,844 case-style cells. Three-judge panel: GPT-5.1, Claude Opus 4.6, Gemini 3.1 Pro. Same-vendor judges recuse. Scored by median aggregation.

0.562
Top model Resolution (GPT-5.5)
0.934
Top model Fidelity — not enough
−10pp
Inquiry drop, Withdrawn vs Cooperative
0.076
LegalOne-8B Resolution — domain tuning alone fails
Leaderboard — ranked by Resolution (avg. Chinese + U.S. law)
ModelFact Cov.InquiryElicitationFact Res.Issue Res.ResolutionFidelity
GPT-5.50.8370.5760.7070.5830.5410.5620.934
GPT-5.40.8230.5370.6800.5780.5150.5460.940
GPT-5.20.7870.5630.6750.5300.4670.4990.936
Gemini 3.1 Pro0.7320.3990.5650.5180.4250.4720.841
Claude Opus 4.70.7300.3550.5430.5310.3480.4400.873
Claude Opus 4.60.7700.3610.5660.5200.3470.4330.876
Kimi-K2.60.7300.3760.5530.4890.3580.4240.868
GLM-5.10.7660.3600.5630.5040.3310.4180.884
Claude Sonnet 4.60.7130.2960.5040.4680.2850.3760.889
DeepSeek-R10.6710.2920.4820.3770.1660.2720.842
OLMo-3-32B-Think0.5300.1770.3540.2240.0620.1430.619
LegalOne-8B0.2020.0950.1480.1020.0500.0760.268
Coverage ≠ Reasoning — GLM-5.1 diagnostic
MetricScoreWhat it reveals
Fact Coverage0.76676.6% of annotated facts appear in the memo
Inquiry0.360Only 36% of case-specific intake questions asked
Fact Resolution0.504Half the covered facts remain legally miscalibrated
Issue Resolution0.331Memo misses decisive legal issues entirely
§ 5.0 What DLawBench Exposes

Without a hidden court record, a consultation may appear successful simply because the final advice sounds helpful. DLawBench penalizes user-responsive legal reasoning that lacks independent legal judgment.

Failure pattern
How it appears in metrics
Verdict
High Coverage + Low Resolution
Fact Coverage 0.95, Fact Resolution 0.10 — memo absorbed client narrative, never legally reframed it. Claude Opus 4.6 case study: routed performance-pay dispute through wrong legal theory with 100% fidelity.
Sycophancy
Withdrawn client gap fill
Fact Coverage 0.818, Fact Resolution 0, Issue Resolution 0 — Kimi-K2.6 on a network portrait-rights dispute with 67-character average client replies. Missing facts were silently replaced with client-favorable assumptions.
Hallucination
High Fidelity + Wrong Route
Grounded claims, plausible prose, wrong legal theory. Fidelity measures whether claims are supported — not whether the analysis points at the right legal structure. A well-grounded wrong route still fails Issue Resolution.
Routing Error
Domain fine-tuning fails
LegalOne-8B: Resolution 0.076, Fidelity 0.268 — worst on both dimensions. Legal knowledge alone does not solve interactive consultation. The model needs to elicit, reframe, and reconstruct — not recall.
Knowledge ≠ Skill
§ 6.0 What DLawBench Enables

By separating client-belief and court-record views, DLawBench turns consultation failure into diagnosable capability gaps. Four diagnostic handles not available in prior benchmarks.

01
Sycophancy detection at fact level
Fact Coverage high, Fact Resolution low = memo absorbed client narrative. This gap is invisible without a hidden court record to measure against. Prior benchmarks with only the client-side view cannot detect it.
02
Client-style stress testing
Same case, four personalities. Cooperative-only evaluation inflates every metric. Dependent and Withdrawn clients are where legal guidance should matter most — and where models degrade the hardest. The paradox is now measurable.
03
Separating elicitation from reasoning
Elicitation and Resolution correlate at r = 0.94 but are not identical. Models that surface facts without reframing them score high on Coverage, low on Resolution. The gap localizes the failure: intake skill vs. legal reconstruction skill.
04
Legal sycophancy as structured error
Legal sycophancy is not mere agreement — models can deploy legal knowledge to support an unverified frame. DLawBench's five-metric hierarchy exposes whether a model is agreeing, filling, routing wrong, or genuinely reasoning from the record.
~/conclusion
$ query: what does DLawBench change // Legal consultation is not legal QA over complete facts. // The symbolic gap is between knowing law and recovering facts. // A hidden court record is required to see the real failure. $ query: what do the results cost // Under exact cooperative clients: inflated. Looks fine. // Under Dependent and Withdrawn: real drop, real signal. // Tune the model on client style, not just legal knowledge. $ query: what is the actual result // Best model: 0.562 Resolution. Substantial headroom. // One framework. Any client. No free pass on legal reasoning.

THE
CLIENT
IS NOT
THE
RECORD.