The dominant AI paradigm — hyperscaling LLMs with ever-larger models, data, and compute — is built on a fundamental omission. Autonomous learning is not a feature of current AI. It never was. This paper argues that is the central problem, and proposes a unified three-system architecture drawn directly from how biological organisms learn.
AI MODELS, ONCE DEPLOYED,
LEARN ESSENTIALLY NOTHING.
THEIR MODE OF OPERATION IS FIXED.
Confronting the
"Data Wall"
Quality text data on the internet is finite. A model trained on all of it cannot exceed the frontier of what humans have already written — it can only recombine it. Scaling model size past this ceiling yields diminishing returns by definition, not by current technical limitation.
Domain Mismatch
Cannot Be Solved
by More Data
Once deployed, AI systems confront real-world data that diverges significantly from their training distribution — with unpredictable consequences. The paper names the structure precisely: real-life data is always heavy-tailed and non-stationary. More pretraining doesn't fix this because the world keeps changing. The model doesn't.
Excessively
Language-Centric
Intelligence that runs purely on language tokens is disconnected from the spatial, embodied, physical-world grounding that gives those tokens meaning. The paper anchors this to cognitive science debates going back to Johnson-Laird (1983) and Piaget (1952) — non-verbal cognition and situated interaction — and argues current AI hasn't resolved them, it's just bypassed them.
Learning Modes
Are Siloed, Not
Integrated
Self-supervised learning, supervised learning, and reinforcement learning are distinct paradigms with distinct pipelines, distinct vocabularies, and distinct communities. When they get combined, it's through rigid human-designed sequences — not through any intelligence inside the model that knows when to switch between them. The routing logic lives in the MLOps team, not in the AI.
A CHILD LEARNING A NEW TOY
USES ALL FOUR MODES IN SECONDS.
CURRENT AI USES ONE. PERMANENTLY.
The paper identifies two fundamental learning modes — observation and action — and proposes a third meta-control system that orchestrates them. Children use all three continuously from birth. Current AI has partial implementations of the first two, and zero of the third.
System A reduces the exponential burden on System B by compressing overwhelming state space into abstract representations. Without this, model-free RL is intractable in real-world action spaces (roughly 200-300 degrees of freedom in robotics alone).
System B solves A's core structural weakness: passive observation of static data cannot reveal causal structure. Only intervention — acting on the world and observing consequences — generates data that distinguishes what causes what.
| BLOCK 01 |
Fragmented Paradigms |
Self-supervised learning, supervised learning, and RL exist as separate fields with separate tooling, separate communities, and separate vocabularies — "each requiring specific data curation pipelines and training recipes." Integrating them into a unified architecture requires first recognizing they are all instances of the same underlying problem, which the paper argues has never been properly framed as such. "Existing approaches to learning remain fragmented across subfields, making it difficult to integrate them within a unified framework."§ Introduction
|
| BLOCK 02 |
Externalized Learning |
Every decision a current AI makes about what to learn, from what data, optimizing for what objective, and when to switch training phases, is made by a human engineer offline before deployment. The paper calls this "the externalization of learning" and frames it as the central structural failure. A truly autonomous AI would automate all of it from the inside — in real time, during operation. "A truly autonomous AI would integrate components that automate the traditional MLOps human functions of data sourcing and curation, of building and adjusting training recipes, and of benchmarking performance and monitoring learning signals."§ 3 — Meta-Control for Autonomous Learning
|
| BLOCK 03 |
No Method to Build It at Scale |
Even granting the architecture, actually building it requires solving a bilevel optimization problem that is computationally brutal: the outer evolutionary loop needs to run millions of simulated life cycles, each involving millions of datapoints. The paper proposes an Evo/Devo framework — jointly optimizing the meta-controller and initial states of A and B through simulated evolution — but is explicit this remains unsolved at any meaningful scale. "One challenge is that at the outer level, a whole life cycle is just one data point. In order to optimize ϕ, one needs to run millions of simulated life cycles which themselves imply learning over millions of datapoints."§ 4.2 — Evo/Devo for Autonomous AI
|
Learning by Communication
System M attends to communicative cues — direct gaze, pointing, imperative intonation — and routes highlighted input for prioritized learning, potentially in a single exposure. This is how children acquire vocabulary. Current AI requires millions of labeled examples to achieve the same task.
Learning by Imagination
System M can switch A and B into inference mode, routing input from episodic memory rather than live sensors, then trigger learning on the resulting imagined trajectories. This is how memory consolidation during sleep works. And how humans plan by simulating futures that haven't happened yet.
These concerns are specific to autonomous learning systems — they emerge from adaptability, self-directed exploration, and the increasing human-likeness of learning trajectories. They don't apply to static deployed models.
The more autonomy granted for exploration and self-directed learning, the harder it becomes to guarantee the system stays aligned with intended objectives. This cannot be solved by external safety rails added after the fact — it has to be built into System M itself as an intrinsic auditing capability.
Biological agents optimize internally-generated proxy signals that can become mismatched to their actual environment — producing addiction, compulsion, self-harm. Autonomous AI systems relying on similar intrinsic reward signals face exactly the same vulnerability. This is not a hypothetical. It's documented animal behavior.
As agents become more human-like in their learning trajectories, users form emotional attachments and misplace trust in ways that create real vulnerabilities to manipulation. The paper flags this as a systemic risk — not a product disclaimer problem.
If somatic signals — pain, stress, energy depletion — are processed in ways functionally analogous to how they work in biological organisms, this opens genuine unresolved questions about the moral status of the system. The paper does not resolve this. It flags it explicitly.
The paper is direct about the timeline: "The challenges are considerable and we are probably decades away from fully autonomous, broad scope learning systems." The A-B-M architecture is not a product roadmap. It's a unified conceptual frame for a problem that has been worked on in fragments for decades — without anyone properly naming what the fragments were part of.
The current LLM training pipeline — massive pretraining on static data, followed by a disconnected RLHF phase, followed by deployment with no further learning — is explicitly described as a rigid, human-executed approximation of what System M would do autonomously. The next paradigm isn't a bigger model. It's a model that manages its own learning.
And even before the full architecture is achievable: "the successes and failures in building such systems will be scientifically invaluable, providing quantitative models of how biological organisms successfully learn and adapt in the wild, and offering insights on the very nature of learning and intelligence."
THE NEXT PARADIGM
ISN'T A BIGGER MODEL.
IT'S ONE THAT MANAGES
ITS OWN LEARNING.