Physical Intelligence shows robot model with LLM-like generalization, flaws included US start-up Physical Intelligence has introduced π0.7, a new robot foundation model designed to recombine skills learned during training, similar to how a language model reassembles text fragments from its training data. The researchers describe this as early signs of "compositional generalization" in robotics. The model is built on Google's open Gemma3 language model with four billion parameters, paired with a smaller 860-million-parameter action expert that generates the actual robot motions.
According to PI, however, the decisive factor is not the architecture but the training recipe. Previous robot models typically receive only a short task description during training, such as "fold the t-shirt." π0.7 additionally gets a range of contextual information: subtask instructions in natural language, episode metadata on quality and speed of the demonstration, control mode labels, and even subgoal images that show what the result of an intermediate step should look like. These subgoal images are generated at runtime by a second, lightweight world model. This approach makes it possible to train on data of varying quality.
Failed attempts or slow demonstrations can simply be tagged with corresponding metadata rather than discarded. One generalist instead of many specialists PI reports that a single π0.7 model matches the performance of the previously RL-fine-tuned π*0.6 specialists on laundry folding, espresso making, and box building. Cross-embodiment transfer also works: a bimanual UR5e industrial manipulator folded t-shirts with an 80 percent success rate, even though no folding data had been collected for this robot. According to PI, this matches the zero-shot performance of experienced human teleoperators attempting the task on this robot for the first time. New tasks can also be taught via language coaching. A human walks the robot through the activity step by step by giving individual instructions.
These coaching episodes can then be used to train a high-level policy that performs the task autonomously, without the need to collect conventional teleoperation data. The air fryer and the question of compositional generalization As a prime example of compositional capability, PI cites loading a sweet potato into an air fryer. Without guidance the model fails, with step-by-step coaching it succeeds. In the technical report, the team writes that it found only two episodes in the training data in which a robot closes an air fryer, plus data from the open-source DROID dataset involving a Franka robot arm. A closer look at the accompanying demo video reveals, however, that the Franka arm from the DROID dataset opens an air fryer drawer and places a bottle inside.
Structurally, this is very close to the sweet potato task that π0.7 supposedly solves by recombining known skills. PI describes these episodes as "quite different" from what the mobile robot does in the experiment, and interprets the result as evidence that the model composes skills anew, much like language models recombine text fragments from the web. Video from the DROID dataset. This carries a debate familiar from the language model world into robotics: the question of whether a model genuinely solves a new task through generalization, or essentially recalls very similar training data.
With language models, this has been discussed for years under the heading of data contamination, when evaluation tasks appear identically or in very similar form in the training material. PI itself concedes in the report that given the sheer size and diversity of the dataset, it can hardly be determined with certainty which tasks are truly novel.
The team argues, however, that this very recombination of known building blocks is the essence of "compositional generalization." In practice, they say, it makes no difference whether a skill is a product of generalization or transferred from similar situations (remixed, as they call it.) Language model phenomena reach robotics π0.7 suggests that robot foundation models are reaching a scale at which effects similar to those in large language models become visible: the nature of the prompt gains considerable importance, performance depends heavily on the context provided, and distinguishing between “genuine” generalization, remixing, and retrieval of similar examples becomes the central evaluation problem. Additional ablations in the report also show how important the metadata is for scaling.
Without quality annotations, the model deteriorates when more but lower-quality data is added. With metadata, it continues to benefit from additional data even as average quality declines. The report does not address the topic of reasoning models. PI only hints at the end that steerable models like π0.7 could in the future solve more complex tasks by "thinking through" possible approaches in advance.
The current model does not yet take such a step on its own. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe nowAI news without the hype Curated by humans. - More than 16% discount. - Read without distractions – no Google ads. - Access to comments and community discussions. - Weekly AI newsletter. - 6 times a year: “AI Radar” – deep dives on key AI topics. - Up to 25 % off on KI Pro online events. - Access to our full ten-year archive. - Get the latest AI news from The Decoder..