Vision-language models with extended reasoning succeed on internal problems but fail when external tools are needed: under standard GRPO, the policy attempts tool calls on only ~30% of rollouts, and ~40% of those tool-using rollouts are all-wrong within their group, gutting the learning signal exactly where it's needed. The authors call this the Thinking-Acting Gap and propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples just the tool call and its continuation, with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO by +1.8pp Pass@1 and +1.8pp Pass@4 at 8B; the SFT+AXPO 8B model matches Qwen3-VL-Thinking-32B Base on Pass@4 with 4× fewer parameters. The asymmetry insight matters because it generalises any RL recipe where some action branches are high-variance auxiliaries to a default behavior — the standard group-normalized estimator silently down-weights exactly those branches.
- arXiv abstract and HF Daily framing emphasize the surfaced diagnostic — 30% tool-attempt rate and 40% all-wrong subgroups — as the actual measurable cause of the gap.
- _akhaliq's thread on X highlighted the 8B-with-AXPO-matching-32B-base Pass@4 result as the headline number.