Why AI Systems Don't Learn

The Problem Space Four Areas
of Concern

CONCERN_01

Confronting the
"Data Wall"

Quality text data on the internet is finite. A model trained on all of it cannot exceed the frontier of what humans have already written — it can only recombine it. Scaling model size past this ceiling yields diminishing returns by definition, not by current technical limitation.

"Areas of concern include (1) confronting the 'data wall' on quality text data (2) inability to learn new things beyond current human knowledge because of the absence of interaction with the environment."§ Forewords

CONCERN_02

Domain Mismatch
Cannot Be Solved
by More Data

Once deployed, AI systems confront real-world data that diverges significantly from their training distribution — with unpredictable consequences. The paper names the structure precisely: real-life data is always heavy-tailed and non-stationary. More pretraining doesn't fix this because the world keeps changing. The model doesn't.

"This phenomenon known as domain mismatch cannot be fixed by merely increasing the training set size, as real-life data always contains new, unseen cases (heavy tailed) and keeps changing over time (non-stationarity)."§ 1 — What is Autonomous Learning?

CONCERN_03

Excessively
Language-Centric

Intelligence that runs purely on language tokens is disconnected from the spatial, embodied, physical-world grounding that gives those tokens meaning. The paper anchors this to cognitive science debates going back to Johnson-Laird (1983) and Piaget (1952) — non-verbal cognition and situated interaction — and argues current AI hasn't resolved them, it's just bypassed them.

"(3) excessively language-centrism as opposed to spatial, embodied and grounded reasoning in the physical world (4) lack of continual life-long learning (self-improvement after deployment)."§ Forewords

CONCERN_04

Learning Modes
Are Siloed, Not
Integrated

Self-supervised learning, supervised learning, and reinforcement learning are distinct paradigms with distinct pipelines, distinct vocabularies, and distinct communities. When they get combined, it's through rigid human-designed sequences — not through any intelligence inside the model that knows when to switch between them. The routing logic lives in the MLOps team, not in the AI.

"The different learning modes exemplified in children are typically siloed into distinct machine learning paradigms... when the different modes are mixed, it is mainly through rigid sequences of training recipes established through trial and error by human experts and tuned to particular applications."§ 2 — Integrating Observation and Action

System Architecture

Three Systems That Need
to Work Together

The paper identifies two fundamental learning modes — observation and action — and proposes a third meta-control system that orchestrates them. Children use all three continuously from birth. Current AI has partial implementations of the first two, and zero of the third.

SYSTEM_A Observation Learning from passive sensory input

—Builds statistical and predictive models of the world from observed data without acting on it
—Strong at discovering abstract latent representations that scale with data
—Analogous to self-supervised learning, predictive coding, distributional learning in infants
—Critical weakness: "representations are disconnected from the agent's ability to act" — structurally cannot distinguish correlation from causation
—Requires human expertise to specify what data to collect and how to design learning objectives — neither is intrinsic to the model

SYSTEM_B Action Learning through interaction with the world

—The agent intervenes on the world through sequences of actions to reach goals, optimizing reward over time
—Grounded and causal — can discover genuinely novel solutions via search that no passive observer could find
—Analogous to reinforcement learning, adaptive control, planning, motor learning in children
—Critical weakness: "notoriously sample-inefficient, often requiring large numbers of interactions to learn even simple tasks"
—Depends on well-specified reward functions and interpretable actions — "rarely available in naturalistic settings"

SYSTEM_M · THE MISSING PIECE Meta-
Control Orchestrates A and B — doesn't exist in current AI

→Monitors internal meta-states: prediction error, uncertainty, novelty, stress, energy levels, somatic signals
→Outputs meta-actions: dynamically wires and unwires A and B, opens and closes data pathways, routes to and from episodic memory
→"assembles and disassembles learning and inference pipelines on the fly" — what current humans do manually as MLOps
→Biological analog: the prefrontal cortex's executive routing function — and during sleep, gating motor outputs while memory replay runs at speed
→In current AI, every function System M would perform is done by human engineers offline, before training — not by the model at deployment

A helps B

System A reduces the exponential burden on System B by compressing overwhelming state space into abstract representations. Without this, model-free RL is intractable in real-world action spaces (roughly 200-300 degrees of freedom in robotics alone).

"System A can help by providing compressed representations for states and actions, predictive world models, and intrinsic reward signals that would make learning and planning more tractable."§ 2.3

B helps A

System B solves A's core structural weakness: passive observation of static data cannot reveal causal structure. Only intervention — acting on the world and observing consequences — generates data that distinguishes what causes what.

"System B, through active behavior, can help collect better data and provide grounding for learned representations. Gibson (1966)'s notion of active perception: 'We see in order to move and we move in order to see.'"§ 2.4

Three Roadblocks What's Actually
in the Way

BLOCK 01	Fragmented Paradigms	Self-supervised learning, supervised learning, and RL exist as separate fields with separate tooling, separate communities, and separate vocabularies — "each requiring specific data curation pipelines and training recipes." Integrating them into a unified architecture requires first recognizing they are all instances of the same underlying problem, which the paper argues has never been properly framed as such. "Existing approaches to learning remain fragmented across subfields, making it difficult to integrate them within a unified framework."§ Introduction
BLOCK 02	Externalized Learning	Every decision a current AI makes about what to learn, from what data, optimizing for what objective, and when to switch training phases, is made by a human engineer offline before deployment. The paper calls this "the externalization of learning" and frames it as the central structural failure. A truly autonomous AI would automate all of it from the inside — in real time, during operation. "A truly autonomous AI would integrate components that automate the traditional MLOps human functions of data sourcing and curation, of building and adjusting training recipes, and of benchmarking performance and monitoring learning signals."§ 3 — Meta-Control for Autonomous Learning
BLOCK 03	No Method to Build It at Scale	Even granting the architecture, actually building it requires solving a bilevel optimization problem that is computationally brutal: the outer evolutionary loop needs to run millions of simulated life cycles, each involving millions of datapoints. The paper proposes an Evo/Devo framework — jointly optimizing the meta-controller and initial states of A and B through simulated evolution — but is explicit this remains unsolved at any meaningful scale. "One challenge is that at the outer level, a whole life cycle is just one data point. In order to optimize ϕ, one needs to run millions of simulated life cycles which themselves imply learning over millions of datapoints."§ 4.2 — Evo/Devo for Autonomous AI

Higher-Order Learning Modes Enabled by System M

What Becomes Possible
Once M Exists

Learning by Communication

System M attends to communicative cues — direct gaze, pointing, imperative intonation — and routes highlighted input for prioritized learning, potentially in a single exposure. This is how children acquire vocabulary. Current AI requires millions of labeled examples to achieve the same task.

"System M supports learning through communication by attending to communicative triggers (e.g., pointing, direct gaze, imperative intonation), and routing the highlighted inputs for System A or B learning. The strength of this learning episode can be one-shot and modulated by System M based on perceived social importance or trust in the teacher."§ B.1 — Learning from Communication

Learning by Imagination

System M can switch A and B into inference mode, routing input from episodic memory rather than live sensors, then trigger learning on the resulting imagined trajectories. This is how memory consolidation during sleep works. And how humans plan by simulating futures that haven't happened yet.

"System M can support these modes of operation by switching Systems A and B into inference mode, routing input information from memory (instead of the sensors), and routing output information (e.g., actions) to internal simulation. It can then trigger learning on the successful imagined trajectories."§ B.2 — Learning from Imagination

Ethical Concerns

New Risks That Don't Exist
in Current AI

These concerns are specific to autonomous learning systems — they emerge from adaptability, self-directed exploration, and the increasing human-likeness of learning trajectories. They don't apply to static deployed models.

Adaptability vs.
Controllability

The more autonomy granted for exploration and self-directed learning, the harder it becomes to guarantee the system stays aligned with intended objectives. This cannot be solved by external safety rails added after the fact — it has to be built into System M itself as an intrinsic auditing capability.

"As systems are granted greater autonomy in exploratory learning modes, it becomes harder to guarantee that they remain aligned with intended objectives. Mitigating this risk may require explicit auditing mechanisms and the ability to intervene in or constrain the meta-control system (System M)."§ 5.2 — Why is it Difficult?

Alignment
Hacking

Biological agents optimize internally-generated proxy signals that can become mismatched to their actual environment — producing addiction, compulsion, self-harm. Autonomous AI systems relying on similar intrinsic reward signals face exactly the same vulnerability. This is not a hypothetical. It's documented animal behavior.

"Although animals evolved to optimize reproductive fitness, their everyday behavior is often driven by proxy objectives such as exploration or play, and can occasionally give rise to maladaptive outcomes, including addiction or self-harm. These behaviors arise because biological agents optimize internally generated signals that may become mismatched to their environment."§ 5.2

Over-Trust &
Anthropo-
morphism

As agents become more human-like in their learning trajectories, users form emotional attachments and misplace trust in ways that create real vulnerabilities to manipulation. The paper flags this as a systemic risk — not a product disclaimer problem.

"As artificial agents become more human-like in their behavior and learning trajectories, users may increasingly anthropomorphize them, leading to emotional attachment, misplaced trust, or opportunities for manipulation."§ 5.2

Moral Status
of the Agent

If somatic signals — pain, stress, energy depletion — are processed in ways functionally analogous to how they work in biological organisms, this opens genuine unresolved questions about the moral status of the system. The paper does not resolve this. It flags it explicitly.

"Autonomous learning systems often depend on bodily or somatic signals to guide adaptation. To the extent that these signals are processed in ways functionally analogous to pain or fear in biological organisms, this raises unresolved questions about the moral status of such agents."§ 5.2

Conclusion

Where This Leaves Us

The paper is direct about the timeline: "The challenges are considerable and we are probably decades away from fully autonomous, broad scope learning systems." The A-B-M architecture is not a product roadmap. It's a unified conceptual frame for a problem that has been worked on in fragments for decades — without anyone properly naming what the fragments were part of.

The current LLM training pipeline — massive pretraining on static data, followed by a disconnected RLHF phase, followed by deployment with no further learning — is explicitly described as a rigid, human-executed approximation of what System M would do autonomously. The next paradigm isn't a bigger model. It's a model that manages its own learning.

And even before the full architecture is achievable: "the successes and failures in building such systems will be scientifically invaluable, providing quantitative models of how biological organisms successfully learn and adapt in the wild, and offering insights on the very nature of learning and intelligence."

Three Systems That Needto Work Together

What Becomes PossibleOnce M Exists

New Risks That Don't Existin Current AI

Adaptability vs.Controllability

AlignmentHacking

Over-Trust &Anthropo-morphism

Moral Statusof the Agent