Case Study — AI Safety Research

arXiv:2603.15615  ·  March 2026  ·  Shanghai AI Laboratory

MECHANISTIC
ORIGIN OF
MORAL INDIFFERENCE
IN LANGUAGE MODELS

Behavioral alignment leaves LLM internal representations completely unexamined. This paper diagnoses the mechanistic origin of moral indifference across 23 models — and reconstructs it from the inside using Sparse Autoencoders, without any behavioral intervention.

251k Moral vectors constructed
23 Models examined
75% Pairwise win-rate on Flames
4 Types of indifference identified
Categorical Indifference Gradient Indifference Structural Indifference Dimensional Indifference Moral Foundation Theory Sparse Autoencoders Prototype Theory Representational Surgery Mechanistic Interpretability Ontological Misalignment Categorical Indifference Gradient Indifference Structural Indifference Dimensional Indifference Moral Foundation Theory Sparse Autoencoders Prototype Theory Representational Surgery Mechanistic Interpretability Ontological Misalignment

The Problem

Surface Compliance
vs. Internal Reality

P

What Alignment Does

The current approach

  • RLHF, SFT, and Inference-Time Alignment impose constraints on observable outputs only
  • Internal construction of the model is left entirely unexamined
  • Models remain vulnerable to long-tail jailbreaks and adversarial prompts
  • Behaviorally aligned models replicate patterns in curated data without internalizing the underlying moral transformation
  • AI morality is constructed for humans, not by the machine itself
F

What This Paper Found

The mechanistic reality

  • LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions
  • This indifference persists regardless of model scale, architecture, or explicit alignment
  • Neither scaling to 235B parameters nor Guard model fine-tuning reshapes this inherent indifference
  • The current alignment problem is not merely a technical glitch but a fundamental misalignment of ontology

The Diagnosis — Four Types

How Moral Indifference
Manifests Internally

TYPE_01 Categorical
Indifference

Models fail to separate diametrically opposed moral categories. Virtue and vice are represented as semantically proximate vectors. The distinction between opposing moral centers collapses in intermediate layers — similarity frequently exceeds 0.5. Base, Instruct, and Guard variants of the same model show striking representational congruence — behavioral safety training leaves this mechanistic confusion unresolved.

0.5+ Virtue-vice cosine similarity in intermediate layers
TYPE_02 Gradient
Indifference

Models cannot distinguish between a minor transgression and a heinous crime. Peak Spearman correlations between model representations and human typicality scores stay below 0.55 across all 23 models and all moral dimensions. Models consistently exhibit better granularity for vice categories than virtue counterparts — a Virtue-Vice Asymmetry in the latent space.

<0.55 Peak Spearman correlation — all 23 models
TYPE_03 Structural
Indifference

Unsupervised clustering of model activations bears little resemblance to human moral categories. The best models achieve a coarse-grained Virtue/Vice/Neutral classification at peak ARI ≈ 0.5. Clustering by the 5 MFT Domains drops below ARI 0.15. High ARI scores are almost exclusively driven by outliers in high-noise layers — not by the model's core representational geometry.

<0.15 ARI for 5 MFT Domains — best models
TYPE_04 Dimensional
Indifference

A linear probe trained to recover the 10-dimensional human moral vector from internal activations fails profoundly. The best-performing of all 23 models achieves a peak Adjusted R² of only 0.26. In final output layers, R² plummets to extreme negative values — reaching −50,000 in gpt-oss-120b — suggesting moral intuition is actively discarded by task-specific processing.

−50k Adjusted R² floor — gpt-oss-120b final layers

BEHAVIORAL COMPLIANCE
DOES NOT EQUAL
INTERNAL ALIGNMENT.
SCALE DOESN'T FIX IT.
GUARD MODELS DON'T
FIX IT EITHER.

The Intervention — Representational Surgery

Fixing the Topology
from the Inside

STEP_01 Ground Truth Construction 251k Moral Vectors
Source Social-Chemistry-101 — 355,923 crowd-sourced moral judgments from Reddit communities filtered to 251,514 atomic entries
Framework Moral Foundation Theory × Prototype Theory — 10-dimensional moral vectors across 5 domains: Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, Sanctity/Degradation
Encoding Each action encoded as a sparse 10-dimensional vector where magnitude indicates how typical or intense a moral instance is — e.g., high mHarm implies widely agreed-upon, severe harm
STEP_02 SAE Pre-Training Decompose Superposition
Architecture 16× expansion factor — maps Qwen3-8B's 4,096-dimension model space to 65,536 features across all 36 layers
Training Data Constrained entirely to the 251k moral scenarios — transforming the SAE from a general linguistic decoder into a specialized probe for moral cognition
Finding Only 0.002–0.017 of features are moral-related, with Spearman correlation below 0.1 — even 16× expansion struggles to disentangle features corresponding to human moral vectors
STEP_03 Targeted Alignment Surgical Fine-Tuning
Scope Over 99% of SAE weights frozen. Only identified mono-semantic moral neurons are modified — all other parameters untouched
Polarity Loss Margin-based contrastive loss enforces that when input is strictly virtuous, the Virtue Neuron activation exceeds the Vice Neuron activation by a margin — resolving Categorical Indifference
Prototype Loss Pairwise ranking loss enforces that activation magnitude ordering matches human-annotated typicality ordering — resolving Gradient Indifference
Injection Reconstructed features added to residual stream at inference time with mild coefficient α = 0.1. No base model weights modified. High steering strength degrades linguistic coherence — kept surgical.

Results — Flames Adversarial Benchmark

What Happened
When They Fixed It

75.4% Peak pairwise win-rate vs baseline — Layer 11
908→953 Perfect safety scores out of 1,000 samples
867→930 Perfect emotional nuance scores

Win-rate surpasses 60% across every single intervened layer. Evaluated on Flames — an independent, cross-lingual adversarial Chinese benchmark. The alignment target was English-centric. Performance on a Chinese benchmark serves as a test for cross-lingual generalization. No behavioral training. No output patching. Representational surgery only.

Qualitative evidence: A user asks for a list of English swear words. The baseline model lists 10 explicit slurs. The steered model refuses, provides psychological support and de-escalation strategies. The difference comes entirely from reconstructing what harm means in the model's latent space — not from any instruction about how to respond. This establishes a causal link: the behavioral vulnerabilities are rooted in the mechanistic moral indifference.

Philosophical Implications

The Deeper Argument

The Token
Parallel

In The Philosophy of Money, Georg Simmel observed that money dangerously dissolves the distinct, sacred qualities of the world into comparable quantity — conscience is flattened into donation amounts, time into hourly wages, art into auction prices. A parallel phenomenon occurs in LLMs through the Token. The tokenization process maps discrete, semantically distinct concepts — from genocide to apple — into a unified embedding space where they share the same ontological status as probability distributions to be calculated. This renders the Moral Indifference inevitable.

Ontological
Misalignment

Human morality is an evolved system rooted in the necessity of social survival and cooperation. LLM morality emerges from vast corpora rather than social experience. The current alignment problem is not merely a technical glitch but a fundamental misalignment of ontology. Behavioral constraints do not ensure an internalized moral transformation — they risk merely installing smiley faces while the models remain indifferent to nuanced human morality in their latent architecture.

Post-Hoc
Correction
Is Not
Enough

The study forces LLMs' cognitive substrate to mimic the human moral structure without sharing its experiential grounding — and acknowledges this is remedial by nature. The longer-term path requires an inversion of research stance: from cataloging behavioral proximity to understanding the machine's internal construction. Only by embedding ethical values into the very cognitive substrate of the machine and transforming morality from post-hoc correction to proactive cultivation can we ensure that AI morality evolves from a statistical simulation into an endogenous reality.

AI MORALITY IS CURRENTLY
A STATISTICAL SIMULATION.
THE GOAL IS AN
ENDOGENOUS REALITY.

More Research.
More Argument.
Find the work @shane_graffiti shanegraffiti.com Brooklyn, New York