Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients — then reasoning from what's missing. DLawBench evaluates whether LLMs can conduct real legal consultation under realistic conditions: eliciting facts, correcting client misframes, and writing defensible memos. Built from 461 real court opinions across Chinese and U.S. law, with four client personality types and expert-authored evaluation rubrics. The best-performing model achieves only 0.562 in consultation-grounded legal reasoning.
Legal consultation interleaves information gathering with legal reasoning. A lawyer rarely receives a complete, neutral, chronologically ordered fact pattern. DLawBench evaluates three linked but separable abilities across the full pipeline.
Each of 461 cases is replayed under four narrative styles adapted from the Interpersonal Circumplex. The underlying facts are fixed. Only how much the client volunteers and resists changes — and that changes everything.
In legal settings sycophancy is not simple agreement — it is a structured, multi-level failure. Models deploy legal knowledge in service of an unchecked client frame, producing professionally packaged error. DLawBench makes this operationally measurable.
26 models evaluated across 461 cases × 4 narrative styles = 1,844 case-style cells. Three-judge panel: GPT-5.1, Claude Opus 4.6, Gemini 3.1 Pro. Same-vendor judges recuse. Scored by median aggregation.
| Model | Fact Cov. | Inquiry | Elicitation | Fact Res. | Issue Res. | Resolution | Fidelity |
|---|---|---|---|---|---|---|---|
| GPT-5.5 | 0.837 | 0.576 | 0.707 | 0.583 | 0.541 | 0.562 | 0.934 |
| GPT-5.4 | 0.823 | 0.537 | 0.680 | 0.578 | 0.515 | 0.546 | 0.940 |
| GPT-5.2 | 0.787 | 0.563 | 0.675 | 0.530 | 0.467 | 0.499 | 0.936 |
| Gemini 3.1 Pro | 0.732 | 0.399 | 0.565 | 0.518 | 0.425 | 0.472 | 0.841 |
| Claude Opus 4.7 | 0.730 | 0.355 | 0.543 | 0.531 | 0.348 | 0.440 | 0.873 |
| Claude Opus 4.6 | 0.770 | 0.361 | 0.566 | 0.520 | 0.347 | 0.433 | 0.876 |
| Kimi-K2.6 | 0.730 | 0.376 | 0.553 | 0.489 | 0.358 | 0.424 | 0.868 |
| GLM-5.1 | 0.766 | 0.360 | 0.563 | 0.504 | 0.331 | 0.418 | 0.884 |
| Claude Sonnet 4.6 | 0.713 | 0.296 | 0.504 | 0.468 | 0.285 | 0.376 | 0.889 |
| DeepSeek-R1 | 0.671 | 0.292 | 0.482 | 0.377 | 0.166 | 0.272 | 0.842 |
| OLMo-3-32B-Think | 0.530 | 0.177 | 0.354 | 0.224 | 0.062 | 0.143 | 0.619 |
| LegalOne-8B | 0.202 | 0.095 | 0.148 | 0.102 | 0.050 | 0.076 | 0.268 |
| Metric | Score | What it reveals |
|---|---|---|
| Fact Coverage | 0.766 | 76.6% of annotated facts appear in the memo |
| Inquiry | 0.360 | Only 36% of case-specific intake questions asked |
| Fact Resolution | 0.504 | Half the covered facts remain legally miscalibrated |
| Issue Resolution | 0.331 | Memo misses decisive legal issues entirely |
Without a hidden court record, a consultation may appear successful simply because the final advice sounds helpful. DLawBench penalizes user-responsive legal reasoning that lacks independent legal judgment.
By separating client-belief and court-record views, DLawBench turns consultation failure into diagnosable capability gaps. Four diagnostic handles not available in prior benchmarks.
THE
CLIENT
IS NOT
THE
RECORD.