When a humanoid robot successfully picks up an unfamiliar object, places it precisely, and repeats the action reliably across thousands of cycles, it looks smooth enough that it's easy to assume the hard problem is mechanical — joints, motors, sensors. In fact, the harder problem is upstream: getting the robot to know what to do in the first place. Training a robot to perform physical tasks in a world that wasn't built for robots is a genuinely difficult research problem, and understanding how companies are approaching it explains a lot about where the technology currently sits and why deployment is harder than the demos suggest.

This is a field where the vocabulary is technical but the core ideas are accessible. What follows is an attempt to explain how robot learning actually works — and why the gap between a robot that performs well in a lab and one that performs well on a factory floor is wider than it appears.

Three Ways Robots Learn

There is no single method for training a humanoid robot. In practice, most advanced systems use a combination of three broad approaches, each with distinct strengths and limitations.

The first is teleoperation-based imitation learning. A human operator wears a motion capture suit or uses handheld controllers to physically demonstrate a task — picking up a part, placing it in a fixture, turning a valve. The robot records that demonstration: every joint angle, every force applied, every movement of the human's hands and body. It then attempts to replicate the motion. With enough demonstrations of enough variations of the same task, the robot builds a statistical model of how that task should be performed. Figure AI's early BMW demonstrations relied heavily on this approach. It produces impressive results on well-defined tasks, but it is labour-intensive — collecting high-quality demonstrations at the scale required for robust performance is a significant bottleneck.

The second is reinforcement learning in simulation — often abbreviated RL, it is a method in which the robot learns by trial and error. A virtual environment models the physical world: gravity, friction, object properties, the robot's own body dynamics. The robot tries to complete a task, receives feedback on whether it succeeded or failed, and gradually adjusts its behaviour to maximize success. Because simulation runs faster than real time and failures have no physical cost, a robot can effectively accumulate millions of hours of experience in days. Boston Dynamics has used simulation extensively in developing Atlas's whole-body movement capabilities; the robot's ability to perform athletic manoeuvres that would be impractical to teach through demonstration relies on this kind of training.

The third, newer approach involves training large AI models — often called foundation models, meaning general-purpose models trained on broad datasets that can then be adapted to specific tasks — on video data of humans performing physical work. The idea is that a model trained on enough footage of human hands manipulating objects will develop a generalised understanding of physical interaction that can be transferred to a robot. Google DeepMind's robotics team has been among the most active in this direction, and the approach is compelling in principle: the internet contains an enormous amount of video of humans doing physical tasks. The challenge is that watching a task and doing it are different in ways that matter — a model that understands what a good grasp looks like does not automatically know how to execute one in a body it controls.

The Simulation Gap

Reinforcement learning in simulation sounds like it should solve the training problem neatly. Run millions of trials in a physics engine, transfer the learned behaviour to the real robot. In practice, this runs into what robotics researchers call the sim-to-real gap — the difference between how the simulated world behaves and how the real world behaves, which turns out to be larger than it should be.

Physics engines model friction, deformation, and contact imperfectly. Real objects have surface properties that vary in ways no simulation captures fully. The robot's own hardware — its actuators, which are the motors and drives that move its limbs, its sensors, its structural flex under load — behaves slightly differently than the simulation assumes. A robot that has learned to pick up a smooth cylinder in simulation may fail on a slightly rough cylinder in the real world because its grip strategy assumes simulated friction properties. The failure modes are often subtle and appear only when the robot encounters conditions that differ from what its training covered.

The standard mitigation is domain randomisation — deliberately varying the simulated environment during training so the robot learns to handle a wider range of conditions. Rather than training on a single simulated table with fixed friction, you train on thousands of tables with randomly varied friction, surface texture, lighting, and object positioning. The theory is that a robot trained on enough variation will generalise better when it encounters real-world variation. It works, but imperfectly. The real world still finds edge cases that no training distribution anticipated.

Why Data Collection Is the Actual Bottleneck

The limiting factor for most humanoid robotics companies right now is not the model architecture or the simulation infrastructure — it is data. Specifically, high-quality data of robot-relevant physical interactions in real environments.

Training a large language model — the kind of AI that powers ChatGPT or Claude — benefits from the fact that text exists in enormous quantities on the internet. There are billions of documents, conversations, and articles available to train on. Physical manipulation data does not exist in comparable quantities. Robots performing physical tasks in real environments, with sensor readings and outcomes recorded, is a dataset that has to be created from scratch, task by task, robot by robot.

This is why the race to collect robot demonstration data is intensifying. Physical Intelligence, a San Francisco robotics company founded in 2023 by former Google and academic researchers, has explicitly positioned data collection as its core competitive strategy — the argument being that the company with the most and best physical interaction data will build the most capable robot policies (the software systems that translate perception into action). The company raised $400 million in late 2023 specifically to fund that data collection effort.

Tesla's approach with Optimus is notable for the same reason. The company has argued that its existing fleet of vehicles, which continuously collect sensor data from the real world, gives it an advantage in understanding physical environments — and that its manufacturing facilities provide a convenient deployment context for collecting robot-specific data at scale. Whether that advantage is as significant in practice as it sounds in theory is genuinely uncertain, but the underlying logic — that data collection infrastructure is a durable competitive advantage — is sound.

What "Generalisation" Actually Means — and Why It's Hard

The goal of all this training is a robot that generalises — that can handle novel situations it wasn't explicitly trained on. A robot that can only perform tasks it has seen exact demonstrations of is not commercially useful in most environments. Real workplaces are not scripted. Objects are in unexpected positions. Lighting changes. Co-workers leave things in the way.

Generalisation in physical AI is harder than generalisation in language AI for a basic reason: the consequences of failure are immediate and material. A language model that misunderstands a sentence produces a bad answer. A robot that misunderstands a grasping situation drops a part, or worse, damages equipment or injures someone nearby. The reliability bar for physical generalisation is much higher than for conversational generalisation, and the evaluation is more demanding — you have to run the robot in actual environments, not just score it on benchmarks.

This is the honest picture of where robot learning sits right now: the techniques are real and advancing, the demonstrations are genuinely impressive in controlled conditions, and the gap between controlled conditions and deployment-grade reliability in varied real-world environments is narrowing. But it has not closed. Companies that claim otherwise are either working in a narrower task domain than they acknowledge, or they are describing what they hope the technology will do rather than what it currently does consistently.

The training problem is where most of the interesting unsolved work in humanoid robotics actually lives. The mechanical hardware — joints, actuators, sensors — is advancing steadily and is arguably ahead of the software. What limits the industry right now is not the ability to build a body, but the ability to teach it.