Simon Willison published an annotated version of his PyCon US 2026 lightning talk surveying the last six months of LLM development. He frames November 2025 as the inflection point: the "best" model crown changed hands five times between Anthropic, OpenAI, and Google in that month alone — Claude Sonnet 4.5 was overtaken by GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, then Claude Opus 4.5 reclaimed it. He argues the more important November story was that coding agents crossed a usability threshold. OpenAI and Anthropic had spent most of 2025 running reinforcement learning from verifiable rewards against their Codex and Claude Code harnesses; by November the results compounded, and coding agents went from often-work to mostly-work — usable as daily drivers without spending most of your time fixing their mistakes.
The other tentpole of the six months was the rise of personal AI assistants Simon collectively calls "Claws", after the breakout success of OpenClaw (which started life in late November 2025 as a quietly-committed repo called Warelay and went through several rename cycles before exploding into public attention in February). Mac Minis started selling out around Silicon Valley because people were buying them as the "aquarium for your Claw" — local hardware to host a personal assistant. Simon's running pelican-on-a-bicycle SVG benchmark traces the model progression visually: Sonnet 4.5 in September was the baseline; the November cohort improved markedly; Gemini 3.1 Pro in February drew a pelican with a fish in its basket; Jeff Dean tweeted an animated multi-animal version including a frog on a penny-farthing and a turtle kickflipping a skateboard, suggesting the labs have indeed been training on this. The April releases pushed open weights into new territory: Google's Gemma 4 series is the most capable open-weight model Simon has seen from a US lab, and GLM-5.1 from China — a 754B-parameter, 1.5TB model — drew a credible pelican and a notably good animated North Virginia opossum on an e-scooter that other models cannot match. Qwen3.6-35B-A3B, a 20.9GB file that runs on Simon's laptop, drew a better pelican than Claude Opus 4.7. The synthesis: coding agents got really good, and the laptop-available models have started wildly outperforming expectations. The pelican benchmark, Simon concedes, has firmly exceeded its limits as a useful measure.