When a humanoid robot navigates a warehouse aisle and picks a tote off a shelf, it looks, from a distance, a bit like a person doing the same thing. Both are using their eyes. Both are guiding their hands with visual information. The similarity ends there.
The robot’s cameras are capturing arrays of pixel values — numbers representing light intensity across a sensor grid. The human’s visual system is doing something far more complex: integrating depth cues, recognising objects from partial views, drawing on years of prior experience to know what a tote is, where it probably is relative to other things, and what it will feel like to lift it. One of these processes is reasonably well understood and can be implemented in silicon. The other is not. That gap — between detection and understanding — is arguably the central unsolved problem in humanoid robotics today.
The Sensor Stack: What Robots Actually Use to Perceive the World
A modern humanoid robot doesn’t rely on a single camera the way a simple surveillance system might. It carries a layered set of sensors, each capturing a different kind of information about its environment.
RGB cameras are the baseline — standard digital image sensors that capture colour and intensity across a field of view. Most humanoid platforms carry multiple RGB cameras with overlapping fields of view to reduce blind spots and provide stereo depth estimation. Stereo vision works by comparing the slight offset between two cameras to infer how far away objects are — the same principle human binocular vision uses.
Depth cameras go further by measuring distance directly. There are three main approaches. LiDAR (Light Detection and Ranging) fires laser pulses and times how long they take to return, building a precise three-dimensional point cloud of the environment. Structured light systems project a known pattern — typically infrared dots or stripes — onto a scene and measure how that pattern deforms across surfaces to infer depth. Time-of-Flight (ToF) sensors measure the phase shift of modulated infrared light to determine distance at each pixel. Each has trade-offs: LiDAR is accurate at range but expensive and power-hungry; structured light excels at close range; ToF is fast and compact but can struggle in bright sunlight.
Inertial Measurement Units (IMUs) combine accelerometers and gyroscopes to track the robot’s own motion and orientation in space. This is less about perceiving the external world and more about the robot knowing where its own body is — essential for stable locomotion and for correctly interpreting what its cameras are seeing as the robot moves.
Proprioception — sensing the robot’s own body state through joint encoders, torque sensors, and force/torque sensors in the limbs — closes the loop between perception and action. When a robot’s hand makes contact with an object, force feedback tells it how hard it’s pressing, whether the object is moving, and whether a grasp is secure. Without this, visual-only grasping would be profoundly unreliable.
Different manufacturers weight these differently. Boston Dynamics’ Atlas relies heavily on onboard LiDAR and stereo cameras, with significant processing dedicated to terrain mapping for bipedal navigation. Figure and Apptronik lean more on depth cameras and vision models optimised for manipulation tasks. Unitree tends toward cost-constrained sensor packages that push more of the perceptual work onto software. Tesla’s Optimus takes an explicitly camera-centric approach, reflecting the company’s bet — carried over from its automotive AI work — that pure vision plus neural networks can substitute for more expensive depth-sensing hardware.
From Pixels to Scene: The Computer Vision Pipeline
Raw sensor data is not perception. Turning it into something useful requires a processing pipeline that has grown significantly more capable over the past decade, though it remains far from human-level.
The first stage is typically object detection: identifying the presence and rough location of known categories of objects within an image. Modern detection models, built on convolutional neural networks and transformer architectures, can identify hundreds of object categories in real time with accuracy rates that exceed human performance on standard benchmarks. This is the step robots do relatively well.
Semantic segmentation goes further, assigning a category label to every pixel in an image — distinguishing not just “there is a person” but “these pixels are person, these pixels are floor, these pixels are table.” This is more computationally demanding and begins to require more context, but current models handle it adequately in clean conditions.
Instance segmentation adds another layer, distinguishing between multiple instances of the same category — not just “people” but “person A” and “person B” as separately tracked entities. This matters for robots operating in environments with multiple humans moving independently.
3D scene reconstruction combines depth data from range sensors with camera imagery to build a spatial model of the environment — a map of what is where, updated in real time as the robot moves. This is what allows a robot to plan a path through a room without walking into things.
What the pipeline produces, at the end of all this, is a structured representation of the scene: a set of labelled objects with estimated positions, sizes, and orientations. It is not understanding. It is organised detection.
Where Current Robots Are Genuinely Capable
Given all of this, there are environments and tasks where robot perception works well enough to support reliable deployment.
In structured environments with known objects — a logistics warehouse with standardised totes, a factory line with the same components in predictable positions — detection-based perception is often adequate. The robot knows what it’s looking for, the objects look the same every time, and the lighting is controlled. Agility Robotics’ Digit operating in Amazon fulfilment centres is an example of this approach succeeding in practice.
Close-range depth sensing for manipulation is another area of genuine strength. When a robot needs to grasp an object it can see clearly, at arm’s length, in good lighting, with depth sensors providing accurate distance data, the grasping pipeline works with reasonable reliability. The failure modes are known and can often be engineered around.
Terrain mapping and obstacle avoidance for navigation have also matured considerably. Boston Dynamics’ Atlas can traverse genuinely rough terrain, aided by real-time 3D mapping of the surface ahead. This is a hard problem that has been substantially addressed through years of focused engineering.
Where Robots Still Struggle
The limitations are at least as instructive as the capabilities, and they’re worth being specific about.
Occlusion — objects partially hidden behind other objects — is a persistent challenge. A human reaching for a mug behind a stack of books uses prior knowledge about what mugs look like, where they tend to be, and what “behind” means to complete the task. A robot’s detector, seeing only a partial mug handle, often fails outright. The partial view doesn’t match training data well enough for confident detection.
Novel objects expose the core limitation of learned detection: it works on the categories it was trained on. Present a robot with an unusual tool, an unfamiliar container, or anything meaningfully outside its training distribution, and performance degrades quickly. Humans generalise from prior experience; current detection models mostly don’t.
Ambiguous or dynamic lighting causes well-documented failures. Shadows that partially obscure objects, bright sunlight causing overexposure, the transition from indoor to outdoor illumination — all of these are conditions where camera-based perception degrades in ways that wouldn’t trouble a human eye.
Transparent and reflective surfaces are a particular problem for depth cameras. Glass, polished metal, and water essentially confuse structured light and ToF sensors, which rely on predictable light reflection. A glass of water on a table can appear to have no depth, or wildly incorrect depth. This failure mode turns up constantly in kitchen and office environments, precisely the spaces where humanoid robots are most often envisioned working.
Predicting human motion is perhaps the hardest challenge in mixed-environment deployment. Humans move unpredictably — they stop, turn, reach for things, step sideways without warning. Robots can detect people; they struggle to anticipate what those people are about to do, which is what safe close-proximity operation actually requires.
The Sim-to-Real Gap
Much of the perception capability in modern humanoid robots was developed and refined in simulation. Training a neural network to detect objects requires enormous amounts of labelled data, and generating that data synthetically in a physics simulator is far cheaper than collecting and labelling it in the real world.
The problem is that simulated environments don’t perfectly replicate real ones. Simulated lighting is cleaner and more consistent. Simulated objects have perfect geometry without scratches, dust, or deformation. Simulated physics doesn’t capture every nuance of how real objects behave when grasped or disturbed.
The result is a “sim-to-real gap”: models that perform impressively in simulation sometimes fail in the real world because the real world doesn’t look like the simulation they were trained in. Closing this gap requires either better simulation — through domain randomisation, photo-realistic rendering, and more accurate physics — or supplementing simulated data with real-world collection, which is expensive and slow. It’s one of the central engineering challenges in the field, and one reason deployment in genuinely unstructured environments has proven harder than simulation benchmarks suggest it should be.
Foundation Models and the Push Toward Semantic Understanding
The most significant recent development in robot perception is the integration of large vision-language models (VLMs) into the robot’s processing pipeline. Models like GPT-4o and Google’s Gemini can take an image and a natural language query and return a semantically rich description: not just “there is a bottle” but “there is a half-full water bottle on the left side of the desk, slightly behind the keyboard.”
This is genuinely new capability. It gives robots access to world knowledge embedded in large language models — knowledge about what objects are, what they’re used for, and how they relate to each other — without requiring that knowledge to be explicitly programmed. A robot equipped with a VLM can, in principle, be asked to “bring me the red mug from the kitchen counter” and understand the task semantically rather than requiring a pre-programmed object class for “red mug.”
Several companies are building directly on this. Figure’s demonstrations have featured natural language interaction backed by large model inference. Google DeepMind’s robotics work is explicitly grounded in foundation models. The bet is that the semantic understanding gap — the difference between detecting objects and knowing what to do with them — can be at least partially bridged by importing world knowledge from models trained on the breadth of human text and imagery.
The limitations are real. VLM inference is slow and computationally expensive relative to real-time detection models. The models can hallucinate — confidently describe objects that aren’t there, or misidentify what they see. They don’t yet provide the spatial precision that manipulation tasks require. But the direction is clear, and the progress over the last two years has been faster than most of the field expected.
What Understanding Would Actually Require
It’s worth being precise about what “understanding” would actually mean for a robot, because the word gets used loosely in ways that obscure how much remains unsolved.
A robot that genuinely understands its environment would not just detect objects — it would know their affordances: what actions they support, how they behave under different conditions, how they relate functionally to other objects. It would maintain a persistent model of its environment that updates correctly when things change and degrades gracefully under incomplete information. It would reason about causality: not just that a glass is full, but that tipping it will spill the water, and that a wet floor creates a hazard near the elderly person in the adjacent room.
That kind of contextual, causal, functionally grounded understanding doesn’t emerge from detection pipelines, however well-tuned. It may eventually emerge from foundation models trained at sufficient scale, combined with enough real-world robotic experience to ground abstract knowledge in physical reality. But the current generation of deployed systems is, at best, at the early stages of that journey.
Why This Matters for Deployment
The perception gap isn’t a theoretical problem. It directly determines which environments humanoid robots can operate in reliably — and which they can’t.
A logistics warehouse can be engineered around the limitations: consistent lighting, standardised objects, known layouts, no glass surfaces, predictable human traffic patterns. That’s why warehouse deployment is where the field is today.
A home cannot be engineered the same way. A home has irregular lighting, novel objects, glass everywhere, unpredictable occupants, and continuous change. The perception requirements for reliable home deployment are substantially harder than warehouse deployment — not incrementally harder, but categorically harder. This is a large part of why the “robot in your home” timeline keeps being pushed out even as warehouse robots reach commercial scale.
The sensor hardware is increasingly capable. The detection software, in structured environments, is increasingly reliable. What the field is still working toward is the gap between seeing and understanding — and closing that gap is less a matter of better cameras than of deeper machine intelligence. That’s the actual frontier, and it’s further out than the demos suggest.