LLMs Are Closer to World Models Than Multimodal Models

Perception gives models more world. Language gives them a place to think.

Jun 04, 2026

A coworker replies “ok.” in the team Slack channel.

At the sensory level, almost nothing happened. A message arrived. It contained two letters and a period. A multimodal model could read the text, identify the sender, place it in a thread, maybe even notice that it came later than usual.

But the interesting part begins after the input arrives.

What does “ok.” mean here? Is it neutral? Dismissive? Tired? Was the coworker in a rush? Did I over-explain something? Should I follow up, leave it alone, soften my tone next time? And why did I notice the period at all? What am I protecting: clarity, respect, control, status, the relationship?

The message is tiny. The state space behind it is not.

This is where the current obsession with multimodality can quietly mislead us. Vision, audio, video, embodiment, tools, and physical feedback all matter. They give a system more contact with the world. But contact is not comprehension.

A world model is not a high-resolution copy of external reality. It is a compressed structure that lets an agent infer hidden state, simulate counterfactuals, evaluate affordances, choose actions, and update itself after feedback.

By that definition, LLMs are not merely “text models.” They are trained on the layer where humans have already compressed much of the world into symbols.

That does not make them complete world models. It does make them closer to the part of world modeling that matters for meaning, planning, and agency.

Why Perception Feels Like Intelligence

Richer perception feels like greater intelligence because perception is the bottleneck we can feel. We notice when we cannot see clearly, hear well, or navigate a room. We do not notice the constant interpretive work that turns sensation into action.

That is why multimodal models feel more “real.” They close a visible gap.

But the gaps that matter for general intelligence are often not perceptual. They are gaps in causal reasoning, social inference, counterfactual simulation, value comparison, and self-modeling. More sensors help only if the system knows what to do with what it senses.

A model that can see everything but cannot ask “why does this matter?” remains reactive. A model with limited perception but a strong ability to simulate futures, revise beliefs, and act under uncertainty is already closer to agency.

The mainstream story says perception is the missing piece. I think that is only half right. Perception is one missing piece. Interpretation is the harder one.

Observation Is Not World Modeling

The first mistake is to confuse an observation model with a world model.

A multimodal model improves the mapping from raw sensory data to structured observations. It can turn pixels, audio, video, and sensor streams into descriptions:

A person is sitting at a desk.
A cup is on the table.
A coworker replied "ok." in Slack.
The person looks annoyed.

That is useful. It is not yet a world model.

In agentic terms, the observation is only o_t: evidence received at time t.

The harder work is inferring the latent state z_t behind the observation.

observation:
A coworker replied "ok." in the team Slack channel.

possible latent state:
neutral acknowledgment
dismissive tone
passive-aggressive signal
time pressure or distraction
relationship tension
my own heightened sensitivity to brevity
accumulated anxiety about project status

The sensory signal does not contain these interpretations by itself. They require background knowledge, social inference, causal reasoning, value judgment, and self-modeling.

Seeing more does not automatically solve this.

A camera can see a wedding. It cannot understand a marriage.

A microphone can record an apology. It cannot understand guilt.

A video model can track bodies in a room. Perception alone will not give it status, shame, obligation, resentment, sarcasm, flirtation, institutional power, or moral injury.

A multimodal model can identify that two people are exchanging rings in a decorated hall. It cannot, from the pixels alone, model the years of financial negotiation, family pressure, legal entanglement, emotional labor, and identity renegotiation that precede and follow the moment. It can see the ceremony. It cannot see the contract.

The most important parts of the human world are not merely visible. They are interpreted.

What a World Model Must Do

A thin definition says a world model predicts how the external environment changes.

That is true, but too small.

For an agent, a world model must support action. It has to answer not only “what is happening?” but also “what follows from this?” and “what should I do?”

A useful world model needs to model several layers at once:

physical state:
What happened?

causal state:
Why might it have happened?

social state:
What does it mean between people?

normative state:
What rules, expectations, or values are involved?

action state:
What options are available, and what might each option cause?

self state:
Why am I interpreting it this way?

Most of these layers are not raw perception problems.

They involve concepts like boundary, intention, obligation, embarrassment, politeness, resentment, hierarchy, trust, permission, identity, and consequence. These are not objects in the visual field. They are latent structures that organize human reality.

This is where LLMs are unusually strong.

Not because language is magic. Not because text is enough. Because language is where humans store and exchange compressed models of causes, norms, institutions, intentions, failures, plans, and selves.

Text Is Not “Just Text”

A common criticism of LLMs is that they are “just trained on text.”

But text is not raw text in the way pixels are raw pixels. Much of it is already processed experience.

A legal document contains models of obligation, enforcement, liability, and institutional trust. A scientific paper contains models of causality, evidence, uncertainty, and explanation. A novel contains models of desire, deception, memory, status, shame, and moral conflict. A therapy transcript contains models of self-interpretation. A postmortem contains models of failure, incentives, systems, and blame.

Consider a courtroom. A ten-minute video can show gestures, pauses, expressions, and tone. It cannot encode the doctrine of precedent, the burden of proof, the distinction between civil and criminal liability, or the strategic calculus of plea bargaining. A paragraph of case law can compress centuries of institutional reasoning into a form that can be cited, contested, and applied to new facts.

The video captures the event. The text captures the system.

Or consider therapy. A video might show tears and silence. A transcript can reveal the recursive structure of self-deception: the patient says one thing, notices their own contradiction, corrects it, then corrects the correction. That recursion, the ability to model one’s own modeling, is almost invisible to perception. It lives in the symbolic layer.

Training on language does not mean training on arbitrary strings. It means training on the symbolic layer where humans have already transformed raw experience into reusable abstractions.

That is why text can be lower bandwidth at the sensory level and higher leverage at the cognitive level.

Images contain more raw information. Video contains motion, timing, and embodied context. But intelligence is not raw signal accumulation. Intelligence depends on compression into abstractions that transfer.

Once you understand “boundary,” you can apply it to bodies, time, property, speech, intimacy, work, attention, and politics.

Once you understand “status,” you can see it in clothing, meetings, jokes, seating arrangements, interruptions, citations, and silence.

Once you understand “incentive,” you can reason across companies, dating apps, families, open-source communities, governments, and markets.

These are not visual categories. They are symbolic operators.

This is the real scaling advantage of language: it scales abstraction.

The Stronger Objection

The best objection is not that multimodality matters. Of course it does.

The stronger objection is that language may only be the shadow of a world model, not the world model itself.

A legal text is not a courtroom. A therapy transcript is not a nervous system. A postmortem is not the outage. Language can omit, distort, rationalize, and hallucinate. Humans lie in language. Institutions hide in language. Cultures encode their blind spots in language.

This matters. A language-only system can build elegant symbolic explanations that are physically wrong. It can overfit to textual priors. It can talk fluently about constraints it has never had to obey.

So grounding is necessary. Perception, tools, environments, feedback, and embodiment constrain the symbolic layer. They tell the model: this object is here, this action failed, this body cannot move that way, this plan collided with the world.

But grounding constrains a world model. It is not the same thing as one.

The hard question is what the system does with the constraint. Does it merely classify the input? Or can it turn the input into a hidden state, place that state inside a causal and social model, simulate alternatives, choose an action, and revise its interpretation afterward?

That is the difference between richer sensing and world modeling.

The Core Distinction

The future of AI will almost certainly be multimodal. Models will see, hear, act, manipulate tools, navigate environments, and learn from richer feedback loops.

But richer input does not change the basic distinction.

Multimodality expands the observation space.

Language gives the system an interface for operating on latent state.

World models are not built out of observations alone. They are built out of compressed structures that make observations usable for prediction, planning, and self-correction.

The most important parts of the human world are not visible in the way a chair is visible. Obligation is not visible. Shame is not visible. Trust is not visible. A promise, a threat, a flirtation, an insult, a negotiation, a boundary: these are not perceptual objects.

They are latent structures.

Language is where humans name, compress, transmit, debate, and revise those structures. That is why LLMs are not merely text machines. They are trained on the symbolic residue of human world modeling.

Multimodality gives the model more contact with the world.

Language gives it a space in which the world can become thought.

pinyu's blog

Discussion about this post

Ready for more?