Scene Models Are Not Domain Models
In the current moment in AI, the concept of a world model is both inescapable and confused.
There is no applied artificial intelligence without some notion of the domain the intelligence is meant to operate over. Whether the term is used directly or merely implied by the stated goals and architectural approach, every system carries some representation of what exists, what can happen, what matters, and what can be acted upon. Sometimes more formal than others.
But the term “world model” can mean very different things depending on who is using it. It can mean a model of physical reality: visual scenes, spatial dynamics, navigable 3D environments, the kind of structure a neural network learns from video or simulation. Or it can mean the structured representation an application maintains about a particular domain: a business, a patient, a supply chain, a legal case, a research project, a household, or a life.
The discourse framing what these models are ultimately for is dangerously underspecified. We have extraordinary language models. We have increasingly serious efforts to train systems on video, robotics, simulation, physical interaction, and generated 3D worlds, confident gestures toward “world models” as the missing next step. But world models of what, exactly? And in service of what?
The Bitter Lesson (the hard-won consensus that methods using raw computation tend to outperform methods built around hand-crafted human knowledge) made an enormously successful bet: that representation could be learned from data at scale. Train a large enough model on enough text, images, video, and sensory data, and meaningful structure would emerge in the weights. And it did.
The last several years have produced substantial advances in speech, vision, language, image generation, video generation, music, code, and robotics. There is real structure in these systems, something meaningfully like a world model exists implicitly in the weights of today's LLMs. The spatial regularities learned by vision systems genuinely reflect how physical environments are organized. The statistical engine was made extraordinarily powerful.
But the bitter lesson showed how to make a process powerful. It did not specify the application-layer problem. It did not tell us what kind of structure a useful system needs to maintain about the world it is meant to serve.
Nobody lives inside a benchmark.
The question that matters now is not “can the system beat a champion at Go?” The question is: can the system help manage a supply chain? Can it hold a structured understanding of a patient’s medical history? Can it reason over an engineering project, a business, a legal case, a school, a household, a person’s actual obligations and constraints? That is not a capability question, it's the application challenge.
And the structure it demands is not the physicist’s state: positions, velocities, forces at an instant. It is the state of a domain: which abstract entities exist, how are they represented concretely, how do they relate, what has changed, what depends on what, what remains unresolved. This is the kind of relational structure the previous essay described: specific entities in specific relations, maintained over time, correctable against reality.
This is what computers have always done. The spreadsheet is a model of a financial reality. The CAD system is a model of a physical and design reality, not geometry alone but intent: which wall is load-bearing, which tolerance matters, which revision superseded which. The database is a model of an organizational reality. Every useful application of computing is a contraption operating over a structured representation of a specific domain.
That is why the current investment in spatial and physical world models needs to be examined more carefully. Granted, the work is not meaningless. It will likely produce better robots, better simulations, better medical diagnostics, better AR and VR environments. These are amongst many plausible valuable directions.
But they don't address the whole application challenge. They are not even focused on the central problem for many of the domains people and institutions actually inhabit. And the labs building these capabilities rarely articulate a destination more concrete than capability itself.
The unexamined assumption is that if a system can understand, generate, and interact with the physical world in 3D, it has somehow crossed from language about reality into reality itself. But reality is not exhausted by space.
A model of appearances is not a model of a life. A model of physical dynamics is not a model of a business. A model that knows where every shelf sits in a warehouse and where the forklift is heading still does not know which supplier is unreliable, which order is late, or which substitution violates a contract.
Those are not spatial facts. They are relational facts.
The surprising abilities of language models sparked genuine excitement about a revolution in computation. Recently that excitement has begun to cool as reliability problems and integration challenges have proven harder to overcome than anticipated. Text, it's implied, is too detached from reality, is a layer of indirection. The emerging response is to train on physical reality instead — video, 3D scenes, simulation — on the theory that spatial grounding will bridge the gap between language and the world. It may ultimately bridge part of it, but changing the training substrate changes the fuel the engine can run on. It does not remove the need for the vehicle.
In the terms of an earlier essay: a model trained on video, geometry, or physical interaction may be a different kind of engine than what an LLM is. Maybe a better engine, one that runs on different fuel. But an engine is still not a car. You do not get transportation by simply strapping wheels directly to the engine. Power has to be mounted, constrained, steered, instrumented, and applied to the road.
In the automotive world, the structure to accomplish this is the chassis — what every other system mounts to, and what makes the assembly a vehicle rather than a pile of parts. In applied AI, it is the harness — the mediating structure of domain model, controls, interfaces, and feedback loops that turns raw capability into something effective to a particular context of use.
The clearest currently deployed example is found in coding. AI coding systems work particularly well because software projects already contain a limited world model of themselves. A codebase is not just a pile of text. It is an expression of a structured domain: syntax trees, symbols, types, imports, call graphs, call stacks, tests, build systems, runtime traces, dependency graphs, tickets, commits, documentation, and CI.
The model is not merely chatting about code. It is operating against a domain that exposes structure and correction.
The abstract syntax tree tells the system what the program is made of. The type system tells it which combinations are valid. The call graph tells it what depends on what. The stack trace tells it where execution broke. The tests tell it when an expectation has failed. Version control tells it what changed. Documentation and tickets tell it what the humans intended, or at least what they said they intended. This is also a world model, limited but real.
Coding tools work because software is already one of the most formalized, inspectable, self-correcting human domains we have.
Most human domains are not prepackaged that way. A business also has entities, relations, constraints, histories, exceptions, and changes. So does a patient history. So does a legal case. So does a household. But that structure is usually scattered across documents, inboxes, calendars, spreadsheets, post-it notes, habits, tacit knowledge, and human memory.
There may be a contract in one system, a calendar event in another, a promise made in a text message, a decision buried in a meeting transcript, a dependency implied by an email, and a constraint held only in someone’s head. The world exists, but not in a form the machine can reliably inspect, update, or act within. That is the hard application challenge.
Not merely making models that can see the world. Not just making models that can generate or act in plausible simulated worlds. The challenge is in expressing the shape of a domain so intelligence has something stable to operate against.
Which entities exist? How do they relate? What has changed? What needs correcting, and by whom?
A visual model does not answer these questions. A language model alone does not reliably answer them either. No AI model, manifest or proposed, can — unless the relevant domain has been encoded explicitly in a way it can access and manipulate.
In terms of building systems regular people can use, what's missing from the frontier imagination is not more raw capability. It's the layer that lets a system maintain a structured, inspectable, correctable understanding of the particular world it is meant to help with.
The well-known failure modes of today's LLMs — hallucination, fabrication, brittleness, loss of context, lack of durable state, conflation of current facts with stale inference — are not surface bugs. They are what happens when generation and implicit representation substitute for a deliberately maintained world model.
I have argued elsewhere that chat is the wrong place to solve this. Chat collapses user intent, system state, and computational process into the same stream. The next useful systems will need persistent state, structured knowledge, formal orchestration thereof, and a separation between interface and substrate. The missing layer is not just “context.” It is the maintained relational structure of the domain itself.
The bitter lesson was not wrong. The engine is genuinely powerful. The pattern-matching half of the architecture has been built to extraordinary sophistication. But the field drew too broad a conclusion from that success.
The explicit-modeling branch did not fail because explicit models were fundamentally the wrong idea. It stalled because the tools were not ready, the maintenance costs were too high, the interfaces were too rigid, the systems were too brittle, and the computational substrate was not yet powerful enough to make the whole thing fluid. That stall was treated as a verdict. It may have only been a delay — explicit relational modeling was not a detour from the arc of computing but its inevitable destination, temporarily outpaced by its own tools.
Now the computation exists. The models exist. The generative capacity exists. What remains underspecified is the structured representation those capabilities should operate over.
In the context that actually matters, building systems people and businesses use, world models are relational. They are not merely 3D scenes or physical simulations, not merely implicit statistical structure. They are maintained webs of entities and relations, continuously updated through encounter.
That is the kind of world model a child builds by bonking his head on a coffee table. It is the kind you navigate when you walk through your dark house at night. It is the kind a business needs to hold it's domain of interest together in service of AI augmentation. It is the kind a person lives inside every day without ever externalizing in full.
3D scene world models, AI models of dynamic physics, spatial intelligence — all useful. But the human application layer requires something else: explicit relational representation that can stand for physical as well as incorporeal reality, that can persist, update, compose, be corrected, and be audited.
Until the field can say that clearly, it risks mistaking each new engine for the vehicle itself. And wondering why it's so hard to drive it where people actually want to go.