LLMs and the Body Problem: Why Intelligence Requires More Than Pattern Matching

Introduction: The Egg Problem

Large Language Models have taken the world by storm. GPT-4, Claude, Gemini-these systems write code, analyze data, compose poetry, and hold conversations that feel remarkably human. Yet something fundamental is missing, and it becomes obvious the moment you ask something as simple as: "How hard should I grip this egg?"

The problem isn't that they need more data or bigger models. The problem is that they've never gripped anything. They have no hands, no sensors, no body. And according to decades of research in cognitive science, philosophy, and robotics, this might not be a minor implementation detail - it might be the core issue preventing artificial general intelligence.

This article explores an uncomfortable possibility: that language is not the foundation of intelligence, but rather something that emerges from embodied interaction with the world. If true, this means LLMs - no matter how sophisticated - are manipulating symbols that ultimately reference experiences they can never have.

The Frame Problem: Why Representation Breaks Down

In the 1960s, AI researchers John McCarthy and Patrick Hayes discovered what philosopher Daniel Dennett would call "the frame problem"-one of the most fundamental challenges in artificial intelligence. The issue is deceptively simple: How does an intelligent system know which facts are relevant to any given situation?

Imagine you're at a dinner party. Someone mentions they're thinking of buying a house. Your brain instantly knows that responses about mortgage rates are relevant, responses about the weather probably aren't (unless the house has a leaky roof), and responses about quantum mechanics are definitely irrelevant - unless the person happens to be a physicist considering a home office. You make these relevance judgments effortlessly, without consciously considering and rejecting millions of facts you know.

Traditional AI tried to solve this through representation - creating explicit models of the world with rules about which facts matter in which contexts. But as professor Mark Bickhard pointed out, this leads to an impossible task: dynamic environments contain multitudes of elements, relationships between those elements, and states that continuously change. To represent all possible states and their relevance would require infinite symbols. There are more viewpoints than entities, more descriptions than matter.

This is the representational bottleneck. The more you try to explicitly encode knowledge about a changing world, the more you need to encode, until the system collapses under its own complexity.

Brooks' Radical Solution: Intelligence Without Representation

In 1991, roboticist Rodney Brooks proposed something heretical: Intelligence without representation. His famous declaration: "It turns out to be better to use the world as its own model."

Instead of building internal models, Brooks' robots responded directly to environmental stimuli through layered behaviors-what he called subsumption architecture. A simple robot might have one layer that says "avoid obstacles" and another that says "move toward light." These behaviors run in parallel, directly coupled to sensors and motors, with no central reasoning system deliberating about what to do.

The results were startling. Brooks' insect-like robots navigated complex environments more successfully than systems with sophisticated world models. Why? Because they didn't waste time updating representations-they simply reacted to what was actually there. While traditional AI built increasingly complex internal maps, Brooks' robots dispensed with maps entirely and used the territory itself.

LLMs and the Same Old Problem

LLMs face the Frame Problem in linguistic space. They've consumed billions of texts and learned statistical patterns of which words follow which other words in which contexts. But these are still representation-patterns abstracted from embodied human experience and frozen in text.

When context shifts in unexpected ways, LLMs can't look at the world directly. They can only consult their statistical memories of how humans talked about similar situations. The world cannot serve as its own model because LLMs have no access to the world-only to descriptions of it.

Adding more data doesn't solve this. It just adds more representations.

Situated Intelligence: Different Bodies, Different Worlds

In the early 20th century, Estonian biologist Jakob von Uexküll proposed a radical idea: every animal lives in its own perceptual universe-its Umwelt. The pond that looks like "water" to you is a series of vibration patterns to the fish, a surface tension membrane to the water strider, a hunting ground viewed from above to the eagle.

These aren't just different interpretations of the same reality. They are functionally different worlds, shaped by different sensory organs, different bodies, different needs. A bat experiences space through echolocation. Could you explain to a bat what "red" looks like? Could a bat explain to you what it's like to perceive space through sonic texture?

This principle extends beyond perception to intelligence itself. Consider bacteria navigating toward sugar gradients. They exhibit intelligent behavior-approaching beneficial stimuli, avoiding harmful ones, even displaying basic memory. Yet they do this with no brain, no world model, no internal representations. They are coupled directly to their environment through chemical receptors.

This is what Paul Dourish calls "situated action"-intelligence that emerges from ongoing interaction with the world rather than from consulting internal models. Context isn't a set of variables to be encoded; it's dynamically constructed through action.

LLMs have no Umwelt. They exist in linguistic space-a space created by humans with bodies, describing experiences in a world of objects, forces, and sensations that LLMs have never encountered. They are reading travel guides to countries they can never visit.

Affordances: How Bodies Shape What We Can Know

In 1979, psychologist James J. Gibson published The Ecological Approach to Visual Perception, introducing a concept that would revolutionize cognitive science: affordances.

An affordance is an action possibility offered by the environment relative to an agent's body. A stone is "sit-able" if it's roughly 40-50cm high (relative to human leg length) and has a flat, stable surface. The same stone offers no sitting affordance to an ant and a different one to an elephant. A doorway "affords passage" if it's wider than your shoulders and taller than your head.

Crucially, affordances are perceived directly, not inferred. You don't measure a chair and calculate whether it's sit-able-you see its sit-ability immediately. This is perception scaled to action, shaped by your body's dimensions and capabilities.

Gibson argued that we don't construct mental representations and then figure out what to do. Instead, we perceive what we can do directly from the environment. The world reveals itself as a field of possibilities for action.

The Body as Cognitive Infrastructure

This has profound implications. Your body is not just a vehicle for your mind - it's part of your cognitive system. When you "think" about whether you can fit through a narrow gap, you're not running calculations. You're using your proprioceptive sense of your body's dimensions in relation to perceived space. This is thinking through the body, not just with the brain.

LLMs cannot perceive affordances. They can tell you that a chair is for sitting-they've read that in thousands of texts. But they've never experienced the direct perception of sit-ability. They've never felt their body's weight supported by a surface, never adjusted their posture for comfort, never known the affordance-space of human-scaled furniture.

When an LLM generates advice about physical activities - "grasp the handle firmly," "step carefully on the ice," "judge whether you can reach the shelf" - it's retrieving linguistic patterns about embodied actions it has never performed. It's like a travel writer plagiarizing guidebooks to places they've never visited.

Embodied Cognition: Language Built on Physical Experience

In 1991, Francisco Varela, Evan Thompson, and Eleanor Rosch published The Embodied Mind, synthesizing decades of research into a revolutionary thesis: cognition is enactive. Intelligence doesn't happen in the brain processing representations of a pre-given world. Instead, cognition emerges through sensorimotor engagement-through the ongoing structural coupling between organism and environment.

Think about how you understand "up." It's not an arbitrary concept you memorized. It's grounded in your experience of standing against gravity, of reaching upward requiring effort, of climbing making you tired. "Up" is felt in your muscles before it's thought with your mind.

The Linguistic Traces of Embodiment

George Lakoff and Mark Johnson demonstrated in Metaphors We Live By (1980) that our most abstract concepts are built on bodily metaphors:

Time is space: "We're approaching the deadline" (motion toward), "That's behind us now" (spatial location)
Ideas are objects: "I can't grasp that concept" (manipulation), "Let me digest this information" (consumption)
Argument is war: "She attacked my position," "I defended my thesis," "His criticisms were right on target"

These aren't poetic flourishes - they're the structural foundation of human thought. Abstract reasoning is scaffolded on sensorimotor experience.

Here's the crucial insight: This embodied structure is implicit in language. Native speakers use spatial metaphors for time without thinking about it. The meanings are so deeply integrated that they become invisible.

LLMs learn these patterns. They use temporal-spatial metaphors correctly, structure arguments using war terminology, manipulate conceptual objects linguistically. But they're working with shadows on the cave wall-patterns that reference experiences they've never had.

The Grounding Problem

Consider explaining "rough" to someone who's never touched anything. You could say "having an uneven or irregular surface," but this definition just pushes the problem back. What does "uneven" mean? "Irregular"? At some point, understanding requires direct tactile experience. You have to feel roughness.

LLMs understand "rough" the way a person who never tasted a lemon understands "sour"-through exhaustive linguistic descriptions and contextual patterns, but without the direct perceptual grounding that gives the concept its primary meaning.

This distinction matters:

Primary grounding: "Rough" means this tactile sensation [touching sandpaper]
Parasitic grounding: "Rough" means "having an uneven surface that creates friction when touched"

The second definition works - you can use it to identify rough surfaces in images, generate appropriate sentences. But it's understanding about roughness, not understanding of roughness.

The Chinese Room: Generations Later

John Searle's Chinese Room argument (1980) has been debated for decades. The setup: A person who doesn't understand Chinese sits in a room with a rulebook. Chinese symbols come in, the person follows rules to manipulate symbols, and meaningful Chinese responses go out. From outside, it appears the room "understands" Chinese. But the person inside is just following syntactic rules without semantic understanding.

Traditional objections include the "systems reply" (the whole system understands, even if the person doesn't) and the "robot reply" (add sensors and motors to ground the symbols). But there's a more interesting consideration for LLMs:

What if the Chinese Room has been occupied for generations?

Imagine the person in the room has been doing this not for an afternoon, but for years. Decades. Centuries of processing Chinese text, following ever-more-sophisticated rules, building increasingly complex pattern recognition systems. They've never left the room, never interacted with the world the symbols describe, but they've processed trillions of Chinese conversations.

At some point, this person develops extraordinary facility with Chinese patterns. They can predict what a Chinese speaker would say in almost any context. They notice subtle patterns in how Chinese speakers use words. They can generate Chinese text that native speakers find fluent, appropriate, even insightful.

Have they learned Chinese?

In one sense, yes. They understand Chinese in the way LLMs do-they've internalized the statistical structure of the language to a remarkable degree. They know which symbols pattern with which others in which contexts. This is not trivial. This is sophisticated pattern recognition that took generations to develop.

But in another sense, no. When they manipulate the symbols for "rough surface," they're not activating memories of tactile experience. When they process words about "grasping an egg," they're not drawing on proprioceptive feedback about grip force. The symbols still reference a world they've never touched.

Parasitic Competence

Parasitic competence-impressive capability that depends entirely on the existence of agents with primary grounding.

The Chinese Room occupant can engage in sophisticated linguistic behavior because native Chinese speakers created the patterns through their embodied experience.

LLMs exhibit parasitic competence. They can:

Generate grammatically correct, contextually appropriate text
Answer questions by pattern-matching against training data
Translate between languages (mapping patterns across linguistic spaces)
Summarize and recombine information
Follow conventions and templates

But they cannot:

Validate claims against direct sensorimotor experience
Perceive affordances in novel situations
Generate truly new embodied insights (only recombine existing ones)
Know what their words feel like

The room occupant after generations has learned the language game extremely well. But the game references a reality they cannot access.

What LLMs Are (and Aren't) Doing

Think of it this way: Imagine you meet a traveler who's visited a country you've never been to. They describe the mountains, the cuisine, the architecture. You form mental images based on your own experiences-you map their descriptions onto your memories of other mountains, other foods, other buildings.

Your understanding is constructed and approximate. You're simulating their experience using your own experiential building blocks. If they mention a fruit you've never tasted, you might imagine it as "somewhere between a mango and a papaya," but you haven't actually tasted it.

LLMs are permanent travelers hearing stories about a physical world they can never visit. They've read millions of descriptions of embodied human experience. They've learned the statistical patterns of how humans talk about seeing, touching, moving, tasting. They can generate plausible descriptions because they've memorized the language game.

What LLMs Excel At

This parasitic grounding explains both their remarkable capabilities and their characteristic failures.

They excel at:

Pattern matching in linguistic space
Generating grammatically and contextually appropriate text
Translating between languages (mapping pattern systems)
Summarizing and recombining information they've encountered
Following established templates and conventions
Reasoning in formal domains (mathematics, logic, code)

Where LLMs Struggle

They struggle with:

Physical plausibility (generating mechanically impossible scenarios)
Novel embodied situations (they can only recombine what they've read)
Validating claims against sensorimotor reality
Understanding the "feel" of physical processes
Scaling and proportion in physical contexts
Detecting when linguistic patterns diverge from physical reality

Try to fold any piece of paper more than 7 times right now, and you'll feel exactly what LLMs cannot know.

Current Embodied AI: Necessary But Insufficient

So what about robots? Self-driving cars? Aren't these embodied AI systems?

Yes and no. They have bodies, but they might not be embodied in the sense that matters for general intelligence.

They have sensors and actuators, but lack something more fundamental: they don't inhabit the world as meaningful. A self-driving car responds to obstacles, but nothing matters to the car.

Levels of Embodiment

We need to distinguish:

Physical Embodiment (Having sensors and effectors)

Cameras, LIDAR, motors, grippers
The ability to perceive and act in the physical world
Robots and autonomous vehicles have this

Organismic Embodiment (Being a self-maintaining system)

Having intrinsic needs and goals from self-maintenance
Bacteria seeking sugar because metabolism requires it
Goals generated from within, not externally programmed
Current AI systems lack this

Current embodied AI has only Level 1. This is significant but insufficient.

Self-Driving Cars as Weakly Embodied

Consider autonomous vehicles-the most sophisticated embodied AI deployed at scale.

What they have:

Extensive sensorimotor loops (sensors → processing → actuation → feedback)
Real-time adaptation to dynamic environments
Continuous learning from consequences

What they lack:

Autonomous goal generation (humans set destinations)
Sense-making (they don't inhabit a meaningful world)
Developmental trajectory (no exploration-driven learning)

A self-driving car solves a constrained optimization problem: navigate from A to B while avoiding obstacles. It's impressively competent at this specific task. But it's not experiencing the world as meaningful. When it stops at a red light, it's pattern-matching sensor input to programmed responses. There's no aboutness to its cognition.

What's Still Missing

Brooks' insect-like robots navigated complex environments through sensorimotor coupling. They appeared intelligent-adapting, learning, succeeding. But Brooks himself acknowledged they lacked something: autonomous sense-making.

A bacterium swimming toward sugar is engaged in sense-making. The sugar gradient is meaningful for the bacterium because its metabolism requires glucose. The meaning arises from the organism's need to maintain itself.

Current AI systems, embodied or not, lack this. They have no organismic needs that would make anything inherently significant. Everything is equally without the kind of meaning that comes from having skin in the game.

When an LLM processes an image and says 'that chair looks sit-able,' something subtle but crucial is happening. It's pattern-matching visual features against linguistic descriptions humans made based on their embodied experience. A human child doesn't learn 'sit-able' from descriptions-they try sitting on things, fail (too high, too soft, too unstable), succeed, and directly perceive the affordance. The LLM is working with a linguistic map of embodied territory it's never explored.

Implications: What This Means for AI Development

If this analysis is correct, several conclusions follow:

1. LLMs Will Remain Fundamentally Limited

No amount of scale, data, or architectural refinement will give LLMs primary grounding. They can become more sophisticated at pattern matching, better at mimicking embodied understanding, more reliable at tasks within their training distribution. But they cannot escape parasitic competence.

This doesn't make them useless-far from it. Parasitic competence is still competence. But we should stop expecting LLMs alone to achieve artificial general intelligence.

2. True AGI Requires Embodiment

If intelligence emerges from sensorimotor coupling with the world, then AGI requires:

Physical embodiment (sensors, effectors, continuous environmental interaction)
Developmental learning (exploration-driven growth, not just dataset consumption)
Sensorimotor grounding (direct perception-action loops, not just representations)

This means robotics and embodied AI research become central, not peripheral, to the AGI project.

3. Multi-Modal Models Aren't Enough

Adding vision, audio, or other modalities to LLMs doesn't fundamentally solve the grounding problem. Multi-modal models still transform everything into representations in latent space. They're still trapped in the sense-think-act bottleneck, still facing the Frame Problem, still lacking direct sensorimotor coupling.

They can correlate patterns across modalities-learning that images of dogs correlate with the word "dog"-but this is still symbolic association, not embodied understanding.

4. We Need Hybrid Architectures

The path forward likely involves:

Embodied platforms generating training data through interaction
Linguistic/reasoning systems (LLM-like) for abstract thought and communication
Tight coupling between sensorimotor and symbolic processing
Developmental trajectories where systems learn through exploration

Neither pure embodiment (Brooks' robots) nor pure symbolic reasoning (LLMs) is sufficient. Intelligence may require both, properly integrated.

5. We Should Calibrate Expectations

Understanding LLMs as parasitically grounded helps us deploy them appropriately:

Use LLMs for:

Information synthesis and summarization
Formal reasoning (math, code, logic)
Language translation and generation
Pattern recognition in text
Human-AI communication interfaces

Don't expect LLMs to:

Validate physical plausibility without external checks
Generate truly novel embodied insights
Understand physical constraints intuitively
Replace embodied expertise in physical domains
Develop common sense about the material world

Conclusion: Castles in the Air

LLMs are remarkable achievements - the most sophisticated parasitically grounded systems ever created. The Chinese Room has been occupied for generations, and the occupant has learned the language game extraordinarily well. But they remain, in a fundamental sense, disembodied intelligences reading travel guides to a world they cannot visit.

This isn't a defect to be fixed with more data or better algorithms. It's a structural limitation that follows from the architecture. Language is not the foundation of intelligence - it's something that emerges from embodied interaction with the world. LLMs are building castles in the air, manipulating symbols that ultimately reference experiences they can never have.

If we want artificial general intelligence-systems that can truly understand, adapt, and reason about the physical and social world - we cannot rely on linguistic systems alone. Intelligence requires bodies. It requires sensorimotor grounding. It requires the ability to perceive affordances directly, to construct meaning through action, to use the world as its own model.

The question isn't whether LLMs are impressive. They are. The question is whether impressive parasitic competence can ever become genuine intelligence without embodied grounding.

The evidence from cognitive science, robotics, and philosophy suggests it cannot.

We can continue to build increasingly sophisticated language models, and we should - they're enormously useful tools. But if we want to create truly intelligent machines, we need to give them bodies, let them interact with the world, and allow them to develop meanings through sensorimotor experience.

Only then will they understand what it means to grip an egg. And perhaps then, they could explain to us what it's like to inhabit pure information space - a world as alien to us as eggs are to them now.