Figure released Helix-02 Bedroom Tidy, a two-humanoid demonstration in which a pair of robots reset a bedroom in under two minutes — opening doors, hanging clothes, putting away headphones, closing a book, taking out trash, pushing a chair under a desk, and cooperatively making the bed. Both robots run a single learned vision-language-action policy, with no shared planner, no message passing, and no central coordinator. Each robot reads the room through its own cameras and infers its partner's intent from motion alone. To Figure's knowledge this is the first published demonstration of a single learned neural network performing multi-humanoid collaborative locomanipulation directly from pixels to actions, and it is the most concrete public step yet from the February 2025 grocery-putaway demo where they first showed two robots running one shared policy.
The motor repertoire on display is unusually diverse for a unified policy. Helix-02 opens lever-handled doors with whole-body coordination, balancing as the door swings; pushes an office chair under a desk by generating force through stance and foot placement rather than arm motion alone; carries a garment across the room and drapes it onto a coat tree using both hands; picks up headphones and reorients them mid-air to seat the headband on a narrow stand; closes an open book by handling its hinged, mass-shifting cover; and operates a trash-can foot pedal with single-leg balance — using a foot as an end-effector while standing on the other leg. The bed-making segment has both robots manipulating a deformable comforter from opposite sides of the bed, lifting, unfurling, spreading, folding, and smoothing, while continuously updating their predictions about each other's contact points as the fabric drapes and slides under shared tension.
Figure frames the difficulty as three compounding problems. Two humanoids in one room is more than two single-robot problems running in parallel: every action one robot takes redefines the problem the other is solving, and each is reading its partner's intent from motion alone in real time while its own actions are simultaneously changing what the partner sees. The central object is deformable, with no fixed pose, no rigid geometry, and no canonical grasp — there is no natural seam between "your half" and "mine," so each robot has to commit to a contact point while predicting what the other will do, then update both predictions tens of times per second. And the whole sequence runs in two minutes of whole-room locomanipulation, with the robot walking naturally between locations, balancing dynamically on one leg, and switching between rigid, deformable, articulated, and collaborative manipulation without scripted handoffs.
The architectural claim is that none of this required task-specific controllers. The same underlying policy that previously learned logistics, laundry folding, kitchen cleanup, and living-room tidying now performs collaborative bedroom reset by adding more data, with no changes to the core algorithm. That is a strong scaling claim for the VLA paradigm, and it lands in the same week as several archive papers — including When to Trust Imagination on World Action Models and ReflectDrive-2 on RL-aligned masked diffusion driving — that are working similar territory in simulation. Figure does not release weights, training data, or compute numbers with these demos, so the public surface is video plus claims; the result is impressive but unfalsifiable in the way Physical Intelligence's pi-0 was at first. Still, multi-humanoid coordination from pixels with one policy is the cleanest expression so far of the bet that locomotion, dexterity, sensing, and inter-agent reasoning collapse into one network when you add data — and Figure is hiring against that bet.