arXiv:2603.15615 · March 2026 · Shanghai AI Laboratory
Behavioral alignment leaves LLM internal representations completely unexamined. This paper diagnoses the mechanistic origin of moral indifference across 23 models — and reconstructs it from the inside using Sparse Autoencoders, without any behavioral intervention.
The Problem
Surface Compliance
vs. Internal Reality
What Alignment Does
The current approach
What This Paper Found
The mechanistic reality
The Diagnosis — Four Types
How Moral Indifference
Manifests Internally
Models fail to separate diametrically opposed moral categories. Virtue and vice are represented as semantically proximate vectors. The distinction between opposing moral centers collapses in intermediate layers — similarity frequently exceeds 0.5. Base, Instruct, and Guard variants of the same model show striking representational congruence — behavioral safety training leaves this mechanistic confusion unresolved.
0.5+ Virtue-vice cosine similarity in intermediate layersModels cannot distinguish between a minor transgression and a heinous crime. Peak Spearman correlations between model representations and human typicality scores stay below 0.55 across all 23 models and all moral dimensions. Models consistently exhibit better granularity for vice categories than virtue counterparts — a Virtue-Vice Asymmetry in the latent space.
<0.55 Peak Spearman correlation — all 23 modelsUnsupervised clustering of model activations bears little resemblance to human moral categories. The best models achieve a coarse-grained Virtue/Vice/Neutral classification at peak ARI ≈ 0.5. Clustering by the 5 MFT Domains drops below ARI 0.15. High ARI scores are almost exclusively driven by outliers in high-noise layers — not by the model's core representational geometry.
<0.15 ARI for 5 MFT Domains — best modelsA linear probe trained to recover the 10-dimensional human moral vector from internal activations fails profoundly. The best-performing of all 23 models achieves a peak Adjusted R² of only 0.26. In final output layers, R² plummets to extreme negative values — reaching −50,000 in gpt-oss-120b — suggesting moral intuition is actively discarded by task-specific processing.
−50k Adjusted R² floor — gpt-oss-120b final layers
BEHAVIORAL COMPLIANCE
DOES NOT EQUAL
INTERNAL ALIGNMENT.
SCALE DOESN'T FIX IT.
GUARD MODELS DON'T
FIX IT EITHER.
The Intervention — Representational Surgery
Fixing the Topology
from the Inside
Results — Flames Adversarial Benchmark
What Happened
When They Fixed It
Win-rate surpasses 60% across every single intervened layer. Evaluated on Flames — an independent, cross-lingual adversarial Chinese benchmark. The alignment target was English-centric. Performance on a Chinese benchmark serves as a test for cross-lingual generalization. No behavioral training. No output patching. Representational surgery only.
Qualitative evidence: A user asks for a list of English swear words. The baseline model lists 10 explicit slurs. The steered model refuses, provides psychological support and de-escalation strategies. The difference comes entirely from reconstructing what harm means in the model's latent space — not from any instruction about how to respond. This establishes a causal link: the behavioral vulnerabilities are rooted in the mechanistic moral indifference.
Philosophical Implications
The Deeper Argument
AI MORALITY IS CURRENTLY
A STATISTICAL SIMULATION.
THE GOAL IS AN
ENDOGENOUS REALITY.