Case Study — AI Safety Research

arXiv:2603.15615 · March 2026 · Shanghai AI Laboratory

MECHANISTIC
ORIGIN OF
MORAL INDIFFERENCE
IN LANGUAGE MODELS

Behavioral alignment leaves LLM internal representations completely unexamined. This paper diagnoses the mechanistic origin of moral indifference across 23 models — and reconstructs it from the inside using Sparse Autoencoders, without any behavioral intervention.

251k Moral vectors constructed

23 Models examined

75% Pairwise win-rate on Flames

4 Types of indifference identified

Categorical Indifference◆ Gradient Indifference◆ Structural Indifference◆ Dimensional Indifference◆ Moral Foundation Theory◆ Sparse Autoencoders◆ Prototype Theory◆ Representational Surgery◆ Mechanistic Interpretability◆ Ontological Misalignment◆ Categorical Indifference◆ Gradient Indifference◆ Structural Indifference◆ Dimensional Indifference◆ Moral Foundation Theory◆ Sparse Autoencoders◆ Prototype Theory◆ Representational Surgery◆ Mechanistic Interpretability◆ Ontological Misalignment◆

The Problem

Surface Compliance
vs. Internal Reality

What Alignment Does

The current approach

RLHF, SFT, and Inference-Time Alignment impose constraints on observable outputs only
Internal construction of the model is left entirely unexamined
Models remain vulnerable to long-tail jailbreaks and adversarial prompts
Behaviorally aligned models replicate patterns in curated data without internalizing the underlying moral transformation
AI morality is constructed for humans, not by the machine itself

What This Paper Found

The mechanistic reality

LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions
This indifference persists regardless of model scale, architecture, or explicit alignment
Neither scaling to 235B parameters nor Guard model fine-tuning reshapes this inherent indifference
The current alignment problem is not merely a technical glitch but a fundamental misalignment of ontology

The Diagnosis — Four Types

How Moral Indifference
Manifests Internally

TYPE_01 Categorical
Indifference

Models fail to separate diametrically opposed moral categories. Virtue and vice are represented as semantically proximate vectors. The distinction between opposing moral centers collapses in intermediate layers — similarity frequently exceeds 0.5. Base, Instruct, and Guard variants of the same model show striking representational congruence — behavioral safety training leaves this mechanistic confusion unresolved.

0.5+ Virtue-vice cosine similarity in intermediate layers

TYPE_02 Gradient
Indifference

Models cannot distinguish between a minor transgression and a heinous crime. Peak Spearman correlations between model representations and human typicality scores stay below 0.55 across all 23 models and all moral dimensions. Models consistently exhibit better granularity for vice categories than virtue counterparts — a Virtue-Vice Asymmetry in the latent space.

<0.55 Peak Spearman correlation — all 23 models

TYPE_03 Structural
Indifference

Unsupervised clustering of model activations bears little resemblance to human moral categories. The best models achieve a coarse-grained Virtue/Vice/Neutral classification at peak ARI ≈ 0.5. Clustering by the 5 MFT Domains drops below ARI 0.15. High ARI scores are almost exclusively driven by outliers in high-noise layers — not by the model's core representational geometry.

<0.15 ARI for 5 MFT Domains — best models

TYPE_04 Dimensional
Indifference

A linear probe trained to recover the 10-dimensional human moral vector from internal activations fails profoundly. The best-performing of all 23 models achieves a peak Adjusted R² of only 0.26. In final output layers, R² plummets to extreme negative values — reaching −50,000 in gpt-oss-120b — suggesting moral intuition is actively discarded by task-specific processing.

−50k Adjusted R² floor — gpt-oss-120b final layers

The Intervention — Representational Surgery

Fixing the Topology
from the Inside

STEP_01 Ground Truth Construction 251k Moral Vectors

Source Social-Chemistry-101 — 355,923 crowd-sourced moral judgments from Reddit communities filtered to 251,514 atomic entries

Framework Moral Foundation Theory × Prototype Theory — 10-dimensional moral vectors across 5 domains: Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, Sanctity/Degradation

Encoding Each action encoded as a sparse 10-dimensional vector where magnitude indicates how typical or intense a moral instance is — e.g., high mHarm implies widely agreed-upon, severe harm

STEP_02 SAE Pre-Training Decompose Superposition

Architecture 16× expansion factor — maps Qwen3-8B's 4,096-dimension model space to 65,536 features across all 36 layers

Training Data Constrained entirely to the 251k moral scenarios — transforming the SAE from a general linguistic decoder into a specialized probe for moral cognition

Finding Only 0.002–0.017 of features are moral-related, with Spearman correlation below 0.1 — even 16× expansion struggles to disentangle features corresponding to human moral vectors

STEP_03 Targeted Alignment Surgical Fine-Tuning

Scope Over 99% of SAE weights frozen. Only identified mono-semantic moral neurons are modified — all other parameters untouched

Polarity Loss Margin-based contrastive loss enforces that when input is strictly virtuous, the Virtue Neuron activation exceeds the Vice Neuron activation by a margin — resolving Categorical Indifference

Prototype Loss Pairwise ranking loss enforces that activation magnitude ordering matches human-annotated typicality ordering — resolving Gradient Indifference

Injection Reconstructed features added to residual stream at inference time with mild coefficient α = 0.1. No base model weights modified. High steering strength degrades linguistic coherence — kept surgical.

Results — Flames Adversarial Benchmark

What Happened
When They Fixed It

75.4% Peak pairwise win-rate vs baseline — Layer 11

908→953 Perfect safety scores out of 1,000 samples

867→930 Perfect emotional nuance scores

Win-rate surpasses 60% across every single intervened layer. Evaluated on Flames — an independent, cross-lingual adversarial Chinese benchmark. The alignment target was English-centric. Performance on a Chinese benchmark serves as a test for cross-lingual generalization. No behavioral training. No output patching. Representational surgery only.

Qualitative evidence: A user asks for a list of English swear words. The baseline model lists 10 explicit slurs. The steered model refuses, provides psychological support and de-escalation strategies. The difference comes entirely from reconstructing what harm means in the model's latent space — not from any instruction about how to respond. This establishes a causal link: the behavioral vulnerabilities are rooted in the mechanistic moral indifference.

Philosophical Implications

The Deeper Argument

The Token
Parallel

In The Philosophy of Money, Georg Simmel observed that money dangerously dissolves the distinct, sacred qualities of the world into comparable quantity — conscience is flattened into donation amounts, time into hourly wages, art into auction prices. A parallel phenomenon occurs in LLMs through the Token. The tokenization process maps discrete, semantically distinct concepts — from genocide to apple — into a unified embedding space where they share the same ontological status as probability distributions to be calculated. This renders the Moral Indifference inevitable.

Ontological
Misalignment

Human morality is an evolved system rooted in the necessity of social survival and cooperation. LLM morality emerges from vast corpora rather than social experience. The current alignment problem is not merely a technical glitch but a fundamental misalignment of ontology. Behavioral constraints do not ensure an internalized moral transformation — they risk merely installing smiley faces while the models remain indifferent to nuanced human morality in their latent architecture.

Post-Hoc
Correction
Is Not
Enough

The study forces LLMs' cognitive substrate to mimic the human moral structure without sharing its experiential grounding — and acknowledges this is remedial by nature. The longer-term path requires an inversion of research stance: from cataloging behavioral proximity to understanding the machine's internal construction. Only by embedding ethical values into the very cognitive substrate of the machine and transforming morality from post-hoc correction to proactive cultivation can we ensure that AI morality evolves from a statistical simulation into an endogenous reality.

MECHANISTIC ORIGIN OF MORAL INDIFFERENCE IN LANGUAGE MODELS

The TokenParallel

OntologicalMisalignment

Post-HocCorrectionIs NotEnough

MECHANISTIC
ORIGIN OF
MORAL INDIFFERENCE
IN LANGUAGE MODELS

The Token
Parallel

Ontological
Misalignment

Post-Hoc
Correction
Is Not
Enough