8 min read

World Models Are Relational

World Models Are Relational

My son used to bonk his head constantly while learning to crawl. He'd spot something across the room and make a beeline for it, not yet aware of the relative dimension of his body and the coffee table he was about to crawl under. Bonk. Over and over again, until eventually he got it. Each time, something like a prediction had failed, the assertion of a not quite articulated assumption. His body moved as if the path were clear, and the world informed him otherwise. There was a forehead up there, outside his field of vision, that needed clearance.

What was he learning? It wasn't a visual rendering of the space under the table or a geometric measurement of clearance heights. He was learning to intuit the boundaries of his own embodied system, calibrating proprioception. Where "he" ended in relation to everything else. How his intention translated through a body he didn't yet fully control.

Before any of that head-bonking could happen, he had to solve more basic problems. Learning to use his eyes, to focus, direct attention, separate the glare on the surface of the eyeball from the scene beyond it, from the conflation with glare on the window. Sorting a chaotic flood of sensory information into streams: what's visual, what's auditory, what's the feeling of his own body moving through space. Which correlations matter and which can be ignored. There's no instruction manual for any of this. Every infant runs a bootstrap program — a kind of perceptual BIOS — that directs to sort signal from noise across every modality at once. An impulse that drives to create a sense of coherence from a fundamentally incoherent environment.

He didn't arrive with knowledge. He arrived with two things: a set of drives, and an enormous capacity to integrate information.

The drives ran constantly — rolling over, reaching for things, looking toward sounds, putting everything in his mouth, crying when his mom left the room, tracking her face when she came back. These weren't decisions but impulse. A low-level algorithm as relentless interpretive process that took whatever was perceived to be happening and tried to make something mentally tangible of it. Every reach, every crawl, every failed grab was a calibration run against the world, and whatever came back shaped what he tried next.

The capacity was nearly limitless and almost entirely unfilled. He could have grown up speaking any language. Could still develop expertise in any domain. Could build up an understanding of environments neither evolution nor human imagination could ever have anticipated. The potential for the mental structure he was building had been prepared by millions of years of development. The content — what would and will eventually occupy that structure — is almost entirely up to encounter.

By the time he was asking "that?" while pointing at objects, demanding names for the concepts he recognized in his space, he was already navigating a complex understanding. Through trial, error, and observation, he had learned that some surfaces supported his weight, others did not. That some things can be hard and sharp while others are soft and forgiving. And that, at times, he would have to duck when crawling under things to avoid the pain of head-bonk. That coffee table wasn't fundamentally a coffee table, it was a nexus of relations and affordances he had built up through encounter, that we then agreed to call by a particular name. The label came later, the construct the label was being applied to had been established in his mind for a while before.

Try navigating your house in the dark

You're not replaying video footage. You're traversing something else — door is left of bed, a few steps to the bathroom, light switch around shoulder height on the right. You know the floor creaks in the middle of the hallway so you try to avoid it. You know the dresser drawer sticks out a little to where it could jam you in the pelvis if it wasn't closed all the way.

None of this is particularly visual. You can do it with your eyes closed. Some people who are blind do it without ever having had visual input in the first place. A lot of information about the world comes in through other senses. Sound carries information about the size and shape of spaces, the texture of surfaces, the proximity of objects. Near-field acoustic effects let you know something is close to your head before you've turned to look. Smell is stereoscopic — your two nostrils sample independently and your brain triangulates. Air currents on your skin tell you a door has opened in the other room.

These aren't backup systems for when vision fails, they are sibling systems. Together they inform the individual's internal understanding of the state of their environment. This notion of where we're at and what's going on, being built up over years of encounter, operates underneath any particular sensory modality. Vision informs it, sound informs it, touch, smell, and proprioception inform it. But what's been built, the understanding you're actually using to find the bathroom in the dark, isn't primarily visual or auditory or tactile. It's something else, more abstract, that vision and sound and touch are all for.

What gets synthesized from all those modalities, despite their wildly different physics, is the same underlying form: relations. This is closer than that. This belongs in that. This happens before that. This caused that. Whether the signal arrives as photons or pressure waves or volatile molecules or warmth on skin, what the nervous system builds is something like a structure of relations. The mug is on the table. The coffee is in the mug. The mug is near your hand. This mug is a manifestation of the idea of a mug. A mug can hold coffee, in concept. This mug belongs to you. None of those relations are visual. Vision is one way of populating them, but then so are the other senses.

If a tree falls in the forest and no one is around to hear it

...none of that just happened. Without the perceiver — and crucially, their mental construct of 'tree' and 'forest', 'falling' and 'sound' — nothing can be said about it. Maybe something happened somewhere, but even that doesn't get us away from the paradox of the observer as an integral part of what we commonly refer to as 'reality'.

When you make an espresso, the machine on your counter isn't just there as an espresso machine, waiting to be recognized. The pattern of matter in front of you affords being interpreted that way — it has the right shape, supports the right operation, behaves in the right way when you press the right button — but no amount of simply staring at it would produce espresso machine for someone who had never experienced the concept. A janky homebrew rig of copper pipes and a propane burner can also be an espresso machine, if it can force steam through coffee grounds. What makes the encounter produce espresso machine experience is the meeting of two things: a pattern of matter that affords the interpretation, and a perceiver carrying the concept that lands on the affordance.

This is also true of some rock you step over, a rotting tree stump, the stars in the night sky. Each is the result of the underlying physical substrate supporting a particular interpretation by a mind that supplies the structure of concepts that enables the interpreting. The world you experience is the product of this ongoing process of interaction. It is not unlike the way information is encoded in artifacts — present and real, but not reducible to either side alone. The book is paper and ink and language and narrative all at once, but also the act of being read; your world is raw material and mental model and interpretation all at once.

This model is the interface. You don't experience reality and then consult your understanding of it — your understanding of it is your reality, from where you stand. There's no other layer readily available to you. Think of how we use computers: we don't operate directly on the binary math actually happening in the silicon, we operate through the desktop, the windows, the icons, files, and applications. The whole computer is the stack of substrate, raw computation, and user-interface layered on top. But we really only concern ourselves with the surface abstraction, simplified for ergonomics. The cognitive model works in a similar way. As at all times, walking through your dark house at three in the morning, every step and every reach is happening fundamentally at the model layer because that's the means through which we organize and navigate reality.

But unlike the computer's GUI, the cognitive interface isn't sealed, reality pushes through. You step on a lego in the hallway and for a moment the seamlessness breaks, the assumptions baked into the model were incongruous with reality and the discrepancy surfaced as pain and surprise. That break is information, the model adjusts, absorbs the correction. The next time you walk that hallway in the dark you'll move a little differently, place your feet down more tentatively, knowing there's a chance the floor isn't clear of injection-molded caltrops. The interface has updated itself. The process bootstrapped by the perceptual BIOS is never complete. Though it may begin dramatically with an explosion of cognitive development, it is a living model, continuously maintained. Reinforced when corroborated by experience, updated by the friction between what it expects and what it encounters.

Lived reality is a structure of concepts in relation with other concepts

Espresso machine makes espresso. Espresso is a kind of coffee. Coffee is consumed in the morning. Your sister prefers tea. The kitchen connects to the living room. Each of these is a concept (espresso, coffee, morning, sister, kitchen) connected to other concepts by relations (makes, is-a-kind-of, prefers, connects-to). Those relations are themselves concepts — makes involves something doing the making and something being made; connects-to implies two things in direct adjacency; is a kind of places one concept as a more specific version of another, broader one. The vocabulary of relations sits in the same web as everything it relates, itself built of relations of concepts. Reality is self-referential and homoiconic.

"Things" aren't containers for attributes. An espresso machine isn't an object that has properties — silver, countertop-sized, Italian. It's a constellation of related concepts: is a manufactured artifact, has an operating principle of pressurized steam, originates from Italy, is located in the kitchen, is used in the morning, was bought at Target, was gift from wedding. Pull on one concept and dozens come with it. The web of relations, as understood by the observer, is what gives any individual concept its meaning and identity across dimensions, not some intrinsic essence sitting inside it.

This is what you've spent a lifetime building. The bootstrap machinery you were born with ran tirelessly into the raw capacity, and over years of living the relational web filled in. Some information came from direct experience (the coffee table, the doorway). Some came from language (your sister told you the espresso machine was broken). Some came from culture (espresso is for adults; juice is for kids). All of it accumulated as relations between concepts, refining and extending with every encounter.

When you walk through your dark house at three in the morning, this is what you're leveraging. Not primarily images or sounds, but the structure. The kitchen is next to the living room is down the hall from the bedroom. The light switch is at shoulder height on the right side of the door. Your foot lands firmly where it does, without stumbling, because the floor is where the floor is, in relation to everything else.

This is a world model

Not in the sense the field has been using the term. Lately, what is referred to as "world model" in AI is a model primarily of appearances — how visual scenes evolve, what a room looks like from another angle, how a ball falls. Those are static models of one or a few sensory slices, aspiring to capture some truth about the structure of the reality that undergirds them. What we've been describing is something different: an omni-domain structure of relations between concepts, malleable in response to the world, that serves as the simplified, interpretable proxy for whatever exists 'out there'.