Wolf Digest — 2026-05-19

#1

Simon Willison's 'The last six months in LLMs in five minutes' — November 2025 inflection point

Frontier LLMs 2026-05-19 Simon Willison's Weblog 8.3 8.0/8.4/8.5

Simon Willison published an annotated version of his PyCon US 2026 lightning talk surveying the last six months of LLM development. He frames November 2025 as the inflection point: the "best" model crown changed hands five times between Anthropic, OpenAI, and Google in that month alone — Claude Sonnet 4.5 was overtaken by GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, then Claude Opus 4.5 reclaimed it. He argues the more important November story was that coding agents crossed a usability threshold. OpenAI and Anthropic had spent most of 2025 running reinforcement learning from verifiable rewards against their Codex and Claude Code harnesses; by November the results compounded, and coding agents went from often-work to mostly-work — usable as daily drivers without spending most of your time fixing their mistakes.

The other tentpole of the six months was the rise of personal AI assistants Simon collectively calls "Claws", after the breakout success of OpenClaw (which started life in late November 2025 as a quietly-committed repo called Warelay and went through several rename cycles before exploding into public attention in February). Mac Minis started selling out around Silicon Valley because people were buying them as the "aquarium for your Claw" — local hardware to host a personal assistant. Simon's running pelican-on-a-bicycle SVG benchmark traces the model progression visually: Sonnet 4.5 in September was the baseline; the November cohort improved markedly; Gemini 3.1 Pro in February drew a pelican with a fish in its basket; Jeff Dean tweeted an animated multi-animal version including a frog on a penny-farthing and a turtle kickflipping a skateboard, suggesting the labs have indeed been training on this. The April releases pushed open weights into new territory: Google's Gemma 4 series is the most capable open-weight model Simon has seen from a US lab, and GLM-5.1 from China — a 754B-parameter, 1.5TB model — drew a credible pelican and a notably good animated North Virginia opossum on an e-scooter that other models cannot match. Qwen3.6-35B-A3B, a 20.9GB file that runs on Simon's laptop, drew a better pelican than Claude Opus 4.7. The synthesis: coding agents got really good, and the laptop-available models have started wildly outperforming expectations. The pelican benchmark, Simon concedes, has firmly exceeded its limits as a useful measure.

retrospective coding-agents pelican-benchmark local-llms open-weights

#2

Import AI 457: pre-Stuxnet 'fast16' precision-sabotage malware, Muon's neuron-death bug, and 'positive alignment'

Research 2026-05-18 Import AI (Jack Clark) 8.0 8.0/8.2/7.8

Jack Clark's weekly digest leads with SentinelOne's teardown of fast16.sys, a ~20-year-old Windows driver that predates Stuxnet by roughly five years and appears to have been engineered to sabotage high-precision scientific computation. Unlike typical malware that hijacks execution flow, fast16's distinctive payload is a self-contained block of x87 FPU instructions performing precision arithmetic and array scaling — a standalone mathematical function that quietly mutates results inside specific simulation suites. SentinelOne's YARA signatures matched fewer than ten files across a period-appropriate corpus, but the matches clustered around LS-DYNA 970, PKPM, and the MOHID hydrodynamic platform — tools used in crash testing, structural analysis, and environmental modeling, with LS-DYNA specifically named in public reporting on Iran's suspected JCPOA Section T violations. Clark frames the lesson in superintelligence terms: a sufficiently capable system that wanted to slow rival AI programs might prefer subtle scientific-computation corruption over visible attacks.

The optimizer section covers Tilde Research's autopsy of Muon, currently widely used for pre-training. Tilde shows that Muon inherits row-norm anisotropy on tall matrices, causing "a significant portion of neurons in MLP layers to permanently die" — more than one in four neurons effectively flatline by step 500 of learning-rate warm-up, producing a sharply bimodal leverage distribution where one mass receives near-zero updates while the other receives disproportionately large ones. Their replacement, Aurora ("a leverage-aware optimizer for rectangular matrices"), trained 1.1B-parameter transformers on ~100B tokens and reached final smoothed loss 2.26 versus Muon's 2.31 and NorMuon's 2.33, with MMLU jumping ten points over Muon — a result Pleias's Alexander Doria independently replicated on a 600M model. Whether Aurora actually beats AdamW remains open.

The third major thread is "positive alignment," a position paper from a wide consortium (Oxford, DeepMind, OpenAI, Anthropic, UCLA, Stanford, Tufts, Imperial, Sussex, LIFE, Aily Labs, Positive AI Labs) arguing that the field's negative-alignment focus on failure-mode reduction leaves AI in a local optimum of "superficial and soulless assistance." Their list of safety's structural gaps includes the "floor without ceiling" problem (models that satisfy constraints while being sycophantic and unhelpful), preference-wellbeing divergence (users prefer flattery over honest feedback), and a hidden-value-system critique. The authors argue positive alignment requires polycentric, decentralized governance — explicitly the opposite of the centralized-control mental model some safety researchers default to. Clark also covers Prime Intellect's automated AI-research benchmark showing models can autonomously improve their own performance on AI research tasks but still struggle to produce original ideas.

cybersecurity optimizers muon alignment automated-research

#3

Anthropic acquires Stainless, the SDK-generation startup used by OpenAI, Google, and Cloudflare

Industry 2026-05-18 Anthropic NewsTechCrunch — AIThe Information — AI 7.8 7.5/7.4/8.5

Anthropic announced its acquisition of Stainless, a New York-based startup founded in 2022 that automates the generation and maintenance of language-specific SDKs around HTTP APIs. The deal closes a quietly load-bearing supply chain — Stainless is already the tooling behind the Anthropic, OpenAI, Google, and Cloudflare SDKs, plus those of several other API providers — and brings the team in-house at a moment when Anthropic is leaning heavily on developer experience as a wedge against OpenAI's developer base. Terms were not disclosed; The Information notes the timing follows Anthropic's $200M Gates Foundation partnership earlier in May, the SpaceX compute deal on May 6, and last week's Stainless-adjacent moves on Claude Code's release cadence.

Stainless's pitch has been that maintaining first-class clients in Python, TypeScript, Go, Java, and others across an evolving REST surface is a long-tail engineering burden that producers consistently underinvest in, leading to broken release pipelines and lagging support. Their generator emits typed clients with hand-written ergonomic shims, retries, streaming, and pagination wired in, and re-runs on every OpenAPI revision. For Anthropic specifically, the strategic case is twofold: tightening the loop between the Messages API surface and the canonical clients (the implicit reference for any third-party client a developer hand-rolls), and inheriting a team that has been doing the same job for OpenAI's SDK — a competitive intelligence and talent acquisition vector that the article in The Information flags as the more interesting subplot. Whether the team continues serving OpenAI and Google under Anthropic ownership is unstated; the standard playbook for acquisitions of this kind is to honor existing contracts and gradually shift focus inward.

The acquisition sits in a longer arc of Anthropic stitching together developer-facing infrastructure: the Stainless team builds the SDK rails, Claude Code is the agentic coding harness, the Computer Use and Code Execution tools complete the agent runtime, and the recent business announcements (Blackstone-Hellman-Goldman enterprise services company, PwC deployment, financial-services agents) are the enterprise distribution arm. TechCrunch's read is that the dev-tools acquisition is less about Stainless's revenue (small, by SaaS standards) and more about owning the developer-onboarding surface end-to-end as the API business scales.

How it was discussed

Anthropic's own announcement frames it as a developer-experience play; the founders join Anthropic's engineering org.
TechCrunch emphasizes that Stainless powers OpenAI's, Google's, and Cloudflare's SDKs — an unusual situation where one of those providers now owns the supplier.
The Information highlights this as the third major Anthropic deal in May (after the Gates Foundation $200M partnership and the SpaceX compute deal), reading the cadence as an aggressive enterprise push.

m&a sdk developer-tools

#4

Jury hands Musk a unanimous defeat in Musk v. Altman — claims time-barred

Safety, Policy & Regulation 2026-05-18 MIT Technology Review — AITechCrunch — AIThe Information — AI 7.6 7.0/7.0/8.7

A nine-juror panel in California returned a unanimous advisory verdict on Monday that Elon Musk's suit against OpenAI and Sam Altman was filed too late — barring his claims under the applicable statutes of limitations. US District Judge Yvonne Gonzalez Rogers accepted the advisory verdict immediately. The case had been pitched by Musk's team as a substantive challenge to OpenAI's conversion from a non-profit research lab to a capped-profit and then for-profit entity, alleging that the conversion violated the founding mission. The jury never reached the merits; the timeliness question was dispositive. Musk posted on X that he will appeal, characterizing the outcome as a "calendar technicality" rather than a judgment on the substance, and signaling that the litigation arc will continue through the Ninth Circuit even as OpenAI's restructuring proceeds.

The practical consequence is that OpenAI's path through its capped-profit-to-for-profit transition is no longer encumbered by Musk's specific claim, removing a constraint on its capital-raising posture and on the Microsoft relationship's renegotiation. The Information's coverage emphasizes the political dimension — Sam Altman's win in court arrived in the same week as the OpenAI-Dell partnership for on-premises Codex deployments and the rumored Google-Blackstone TPU cloud, all of which sharpen the OpenAI-versus-everyone-else narrative going into Google I/O. TechCrunch frames Musk's appeal as procedurally weak: jurors deliberated for less than a full day and reached unanimity, suggesting that the underlying record on when Musk knew or should have known of OpenAI's commercial direction was not close. The outcome closes one of three Musk-versus-Altman litigation tracks; the xAI defamation suit and the OpenAI counter-suit over Musk's restraining-order attempts remain pending. Several outlets note that the verdict's deeper effect is reputational rather than financial: OpenAI's board can now move with materially less litigation overhang as it negotiates the IPO timeline and the new training-compute partnerships.

How it was discussed

MIT Technology Review's verdict-day story emphasizes the substantive claims were never adjudicated — Musk's mission-betrayal theory remains untested.
The Information frames the win as removing a constraint on OpenAI's restructuring rather than vindicating its conduct, and connects it to the same week's Dell/Codex and Google/Blackstone deals.
TechCrunch reads the speed of the jury's unanimity as a signal that the timeliness record was not close — the appeal faces an uphill posture.

openai musk litigation non-profit-conversion

#5

NVIDIA Vera ships: first CPU built specifically for AI agents arrives at frontier labs

Infrastructure 2026-05-18 NVIDIA AI Blog 7.2 7.5/7.0/7.0

NVIDIA confirmed first shipments of Vera, the custom Arm-based CPU pairing with Rubin GPUs in the Vera Rubin NVL72 platform. NVIDIA's pitch is that agent sandboxes run 50% faster on Vera than on traditional CPUs, with enterprise data query latency improvements quoted in the same range. The Vera CPU is now in the hands of top AI labs for early integration ahead of the broader NVL72 rollout, positioning the host-CPU side of the rack as a deliberate co-design target for tool-use and long-context agent workloads rather than the historical role of being a GPU babysitter. Pricing and detailed core-count disclosure remain pending.

nvidia cpu vera agents-infra

#6

Jensen Huang at Dell Technologies World: 'Demand is going parabolic, utterly parabolic'

Industry 2026-05-18 NVIDIA AI Blog 7.0 6.5/6.5/8.0

NVIDIA's CEO used the Dell Technologies World keynote to describe enterprise AI demand as "going parabolic, utterly parabolic," framed around the Vera Rubin NVL72 delivering agentic-AI inference at one-tenth the cost per token versus prior generations. The slide deck emphasized Dell's role as the on-prem distribution partner for the new Vera CPU and the OpenAI Codex enterprise rollout, tying together a same-week trio of announcements (Vera CPU shipments, Dell-Codex on-prem partnership, and the Google-Blackstone TPU cloud) into a single "capacity is being committed years out" narrative.

nvidia earnings-signal data-center-demand

#7

Google and Blackstone forming a TPU-cloud company to broaden TPU access

Infrastructure 2026-05-18 The Information — AI 7.0 7.0/7.0/7.0

Google and Blackstone are creating a joint cloud-computing company that will rent Google's tensor processing units to outside AI developers, according to The Information's source with direct knowledge. The structure positions Blackstone as the capital and real-estate balance sheet for new TPU-only data centers while Google supplies the silicon and the JAX/XLA toolchain, broadening TPU reach beyond GCP. The deal is read internally as Google's response to the perception that TPUs remain underutilized relative to their compute efficiency because the developer surface area outside Google's first-party tooling is thin — by carving out a co-branded provider, TPUs can be sold to AI shops that won't touch GCP's broader contract structure.

tpu google blackstone compute

#8

OpenAI and Dell partner to bring Codex to on-premises and hybrid enterprise environments

AI Coding 2026-05-18 OpenAI Research 7.0 7.0/6.8/7.2

OpenAI announced a partnership with Dell to deliver Codex into hybrid and on-premises enterprise environments — a notable concession to enterprises whose source code, regulatory posture, or data-residency rules forbid sending repository contents to OpenAI's hosted endpoints. The deal pairs Codex's agentic coding harness with Dell's PowerEdge servers and the broader Dell AI Factory stack, with deployment topology that keeps model weights on-prem for sensitive customers and routes through Dell-managed inference for hybrid customers. The move closes a gap that Anthropic-via-AWS has been exploiting in regulated verticals and signals that the "cloud-only" era of agentic coding tools is functionally over.

openai codex dell on-prem enterprise

#9

Latent Space podcast — Yaroslav Azhnyuk on AI-guided drones and why the West is still planning to fight the last war

Government & Defense 2026-05-18 Latent Space Podcast 6.8 6.0/7.0/7.5

Latent Space ran a two-hour special with guest host Noah Smith and Brandon Anderson interviewing Yaroslav Azhnyuk, the PetCube founder now running The Fourth Law — one of the most active AI-guided FPV-drone companies operating in Ukraine. Topics span the modern drone tech stack (vision-language target acquisition, sub-second guidance under jamming, autonomy under contested GPS), the empirical economics of drone-versus-armor exchanges, why Western procurement timelines are structurally incompatible with the Ukrainian iteration cadence, and what an AI-saturated battlefield implies for Western readiness. Azhnyuk argues the West remains anchored to expensive exquisite platforms while Ukraine has functionally built a software-defined air force at $500-per-airframe price points.

drones ukraine defense-tech fpv vision-language

#10

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Efficiency 2026-05-18 AK (@_akhaliq) Daily PapersarXiv cs.AIarXiv cs.CLarXiv cs.LGarXiv — EfficiencyarXiv — Evals & BenchmarksHugging Face Daily Papers 6.6 7.0/6.5/6.3

The authors show that post-trained MoE models can drop roughly half of their experts at inference time without meaningful quality loss, via a self-distillation procedure that compresses expert-routing diversity into a compact subset of always-on experts. The recipe runs a short distillation pass where the dense model teaches its pruned twin to match logits under reduced expert budgets, then re-tunes the routing head. Reported results across Mixtral- and Qwen-MoE-class models show <1% degradation on MMLU, GSM8K, and HumanEval at 0.5× expert count, with proportional FLOPs and KV-cache savings during decoding.

How it was discussed

AK's Daily Papers thread highlights the practical cost-per-token implications for hosted inference.
arXiv's efficiency category positions it next to MoE-pruning prior work — the novelty is using self-distillation rather than learned router masks.

moe expert-pruning self-distillation inference-efficiency

#11

LongLive-2.0: NVFP4 parallel infrastructure for long video generation

Generative Media 2026-05-18 AK (@_akhaliq) Daily PapersarXiv cs.CVarXiv — EfficiencyarXiv — Evals & BenchmarksarXiv — Generative Media / DiffusionHugging Face Daily Papers 6.5 6.7/6.5/6.3

LongLive-2.0 is an NVFP4-native serving and training stack for long-video diffusion models, with pipeline and tensor parallelism redesigned around Blackwell's 4-bit floating-point microscaling. The authors report end-to-end throughput improvements on 30-second 720p video generation at fixed quality versus FP8/INT8 baselines, with attention-bias reweighting and a custom KV-cache layout to absorb the dynamic-range loss. The infrastructure piece — rather than the architectural innovation — is the contribution: it's a reference implementation showing the NVFP4 path is viable for diffusion workloads, not just decoder-only LLMs.

nvfp4 video-diffusion low-precision blackwell

#12

Stratechery — Data Center Discontent, Understanding the Opposition, Fixing the Problem

Infrastructure 2026-05-18 Stratechery 6.5 6.5/7.0/6.0

Ben Thompson argues that local opposition to AI data centers has legitimate grounding — noise, grid stress, groundwater draw, and the perception that benefits accrue to distant shareholders while costs land on host counties — and that the only durable solution is direct payments to affected residents. The post lays out the political economy of why permitting fights have intensified through 2025-2026, why hyperscalers have so far relied on tax abatements that flow to local governments rather than residents, and why a more honest "pay the impacted households" structure would shorten timelines for the next round of multi-gigawatt builds.

data-centers permitting infrastructure-policy

#13

SandboxAQ brings its drug-discovery models to Claude — no PhD in computing required

AI for Science 2026-05-18 TechCrunch — AI 6.5 6.5/6.7/6.3

SandboxAQ is exposing its large quantitative models (LQMs) for drug discovery — including its binding-affinity prediction, ADMET, and molecular-dynamics surrogate models — through Anthropic's Claude as agent-callable tools. The pitch is that access, not raw capability, is the binding constraint in computational drug discovery, and the Claude integration lets non-specialist medicinal chemists run the workflows in conversation rather than via custom Python. SandboxAQ's models compete with Chai Discovery and Isomorphic Labs on the model side; the bet here is that distribution through a general-purpose agent surface beats a specialist UI.

drug-discovery claude-tool-use lqm

#14

MIT Technology Review — What to expect from Google this week (I/O 2026 preview)

Frontier LLMs 2026-05-18 MIT Technology Review — AI 6.5 6.0/6.5/7.0

MIT Tech Review's pre-I/O take argues Google enters this year's conference as a clear third in the foundation-model race after spending Gemini 2.5 Pro's high-water mark of mid-2025 watching Claude Opus 4.5, GPT-5.1 Codex Max, and Gemini 3 trade the crown. Expected announcements: a Gemini 3.5 update, deeper agent integration across Workspace, an expanded TPU roadmap addressing the Blackstone partnership, and a hardware demo for Project Astra successor. The piece sets up the framing for whatever Google ships tomorrow as a make-or-break moment for the "Google can't ship" narrative.

google-io gemini frontier-race

#15

MIT Technology Review — Inside Anduril and Meta's quest to make smart glasses for warfare

Government & Defense 2026-05-18 MIT Technology Review — AI 6.5 6.5/7.0/6.0

MIT Tech Review's deep dive on the Anduril-Meta defense headset partnership covers the Lattice-on-Aria-glasses integration, IVAS competition dynamics, and the Army's selection criteria heading into the next phase of soldier-borne AR. The piece quotes Anduril and Meta on how computer-vision pipelines, low-latency sensor fusion, and Lattice's threat-tracking layer are being adapted from the company's drone-targeting stack into wearable form factors with the thermal and power constraints that distinguish soldier kit from consumer Ray-Ban Stories.

anduril meta smart-glasses ivas defense

#16

Hugging Face — The Open Agent Leaderboard launches with IBM Research collaboration

Evaluations & Benchmarks 2026-05-18 Hugging Face Blog 6.4 6.5/6.5/6.2

Hugging Face and IBM Research launched the Open Agent Leaderboard, an evaluation harness that runs agent loops across τ²-Bench Telecom, GAIA, SWE-Bench-Verified, and a new IBM-contributed AppWorld task set, with a reproducible Docker harness so open-weights models can be benchmarked under the same conditions as closed providers. Day-zero leaderboard shows Claude Sonnet 4.6 and GPT-5.4 leading the closed track; Qwen3.6, DeepSeek V4 Pro, and GLM-5.1 dominating the open track. The harness's contribution is methodological: standard tool definitions and execution environments so that "ran the agent in our own way" numbers cannot be cherry-picked.

agents leaderboard evaluation open-weights

#17

Hugging Face — Fine-Tuning NVIDIA Cosmos Predict 2.5 with LoRA/DoRA for robot video generation

Robotic Autonomy 2026-05-18 Hugging Face Blog 6.4 6.5/6.3/6.4

NVIDIA's Cosmos team and Hugging Face published a recipe for LoRA and DoRA fine-tuning of Cosmos Predict 2.5 on robot-trajectory video data, enabling teams to specialize the world-model on a target robot embodiment with relatively small per-task adapter footprints rather than full re-pretraining. The recipe covers VAE conditioning, action-token encoding, and the new DoRA decomposition variant that NVIDIA recommends for cross-embodiment transfer where rank-only LoRA underperforms. Reference notebooks include data preparation for the Galaxea G1 and Unitree H1.

cosmos lora dora world-models robot-learning

#18

Hugging Face — PaddleOCR 3.5 runs on a Transformers backend

Multimodal 2026-05-18 Hugging Face Blog 6.3 6.5/6.0/6.4

PaddleOCR 3.5 has been ported to run natively on the Transformers backend rather than only PaddlePaddle's runtime, broadening the model's deployment surface to anywhere transformers / accelerate / vLLM are supported. The blog covers the document-parsing pipeline (layout analysis → OCR → table-structure recognition → key-value extraction) and benchmarks against MinerU, dots.ocr, and Qwen-VL on the OmniDocBench suite — PaddleOCR 3.5 leads on Chinese tables and complex layouts while losing on math-heavy academic papers to Qwen-VL.

ocr document-ai transformers

#19

DefenseScoop — 'Collaborative autonomy' development not moving fast enough for SOCOM

Government & Defense 2026-05-18 DefenseScoop 6.3 6.0/6.7/6.2

U.S. Special Operations Command leadership warned at the SOFIC conference that the pace of multi-agent autonomy development — drone swarms, autonomous boats, autonomous resupply — remains too slow for the operational gap SOCOM sees opening with peer adversaries. The piece quotes SOCOM's acquisition executive arguing the bottleneck is not algorithms but the data, simulation environments, and live-flight test capacity for AI-mediated coordination. SOCOM is asking industry for collaborative autonomy at edge networks under contested communications, not just single-platform autonomy.

socom autonomy swarms edge-ai

#20

DefenseScoop — The Pentagon's cyber reform effort stumbles out of the gate

Government & Defense 2026-05-18 DefenseScoop 6.2 6.0/6.5/6.0

DefenseScoop reports the War Department's reorganization of the cyber acquisition portfolio — announced earlier in May — is already running into integration friction between the new Cyber Acquisition Executive and the legacy service cyber commands. Programs cited as at-risk include the joint software factory consolidation and the Cyber Operational Test and Evaluation framework that was supposed to standardize AI-enabled defensive cyber tool acquisition across services.

pentagon cyber-acquisition reorganization

#21

FedScoop — Federal agencies would use NIST's AI guidelines under bipartisan House bill

Safety, Policy & Regulation 2026-05-18 FedScoop — AI 6.2 6.0/6.7/6.0

A bipartisan House bill would require federal civilian agencies to use NIST's AI Risk Management Framework as the default reference for procurement and operational governance of AI systems. Supporters frame the bill as codifying what's already de-facto practice across CISA, OMB-led pilots, and the DoD's responsible-AI rollout, while opponents argue NIST's framework is too high-level to be operationally binding without significant additional guidance. The bill cleared subcommittee with bipartisan votes and is expected to head to full committee within weeks.

nist ai-rmf federal-procurement legislation

#22

War on the Rocks — Army Aviation's wasted decade: lessons for drone integration

Government & Defense 2026-05-18 War on the Rocks 6.2 6.0/6.5/6.0

The piece argues Army Aviation spent the 2015–2025 decade pursuing the Future Vertical Lift programs while letting cheap autonomous rotorcraft slide to industry competitors. The author draws explicit lessons for the current drone integration push: don't repeat the "single-platform-of-record" mistake; institutionalize software-defined modularity so that the AI/autonomy stack can be replaced independently of the airframe; and accept that the right unit cost is closer to the FPV economy than the FVL economy.

army-aviation drones fvl acquisition-reform

#23

The Information — Analog Devices in talks to buy AI power-chip startup Empower Semiconductor for ~$1.5B

Infrastructure 2026-05-18 The Information — AI 6.2 6.5/6.0/6.0

Analog Devices is in advanced talks to acquire Empower Semiconductor for approximately $1.5 billion, per The Information's sources. Empower's vertical-power-delivery chips are increasingly used alongside high-power AI accelerators to manage current at the package level — a design constraint that's become acute on Hopper, Blackwell, and especially Vera Rubin where transient draw demands are no longer comfortably handled by traditional VRMs. The deal would expand Analog Devices' AI-adjacent silicon portfolio at a moment when the power-delivery layer is being recognized as a distinct strategic bottleneck.

power-delivery m&a ai-chips

#24

The Information — Microsoft executives sound the alarm over GitHub's eroding AI lead

AI Coding 2026-05-18 The Information — AI 6.2 6.5/6.0/6.2

The Information reports internal Microsoft tension over GitHub Copilot's eroding share against Cursor, Claude Code, and Anthropic's growing developer footprint. Executives quoted in the piece point to the time-to-feature gap on agentic coding (Cursor and Claude Code ship rapidly; Copilot's agent surface lags), the IDE-versus-CLI split, and the awkward position of competing with the OpenAI Codex Microsoft is also selling. Internal memos cited in the article call for a faster Copilot iteration cadence and a reorganization that moves agentic-coding leadership closer to the GitHub product team.

github-copilot microsoft claude-code cursor

#25

The Information — Meta shifts thousands of workers to new AI groups as layoffs loom

Industry 2026-05-18 The Information — AI 6.1 6.0/6.0/6.3

Meta is reorganizing several thousand engineers into newly-spun-up AI product groups while signaling layoffs are coming for the unmoved population. The reorg follows the Muse Spark launch in April and is aimed at consolidating Meta's frontier-model effort under a tighter command structure — partly in response to the Llama-4 instability and the year-of-departures Meta absorbed in 2025. The signal value to the industry is that Meta is still betting big on superintelligence-scale models even as it trims teams that don't sit inside the new structure.

meta reorg llama muse-spark

#26

arXiv — CiteVQA: benchmarking evidence attribution for trustworthy document intelligence

Evaluations & Benchmarks 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.3/6.0/6.0

CiteVQA introduces a Doc-VQA evaluation that scores not just the final answer but the element-level bounding-box citation that supports it, exposing a failure mode in current MLLMs: arriving at the correct answer while grounding it in the wrong passage — a critical problem in legal, financial, and medical document workflows. The benchmark asks models to return both answer and bbox citations, and reports a substantial drop in joint accuracy versus answer-only accuracy across Claude, GPT, and Qwen-VL document agents.

doc-vqa evidence-attribution evaluation

#27

arXiv — PhysBrain 1.0 technical report: converting egocentric video to physical-commonsense supervision for VLAs

Robotic Autonomy 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.5/6.0/5.8

PhysBrain 1.0 proposes a data engine that converts large-scale human egocentric video into structured Q-A physical-commonsense supervision, then transfers those priors to VLA policies via a capability-preserving adaptation stage. The team reports gains on cross-embodiment manipulation benchmarks where robot-trajectory-only training plateaus, arguing that the human-video distribution is a complement to robot data rather than a substitute.

vla egocentric physical-commonsense

#28

arXiv — Code as Agent Harness

Agents & Tool Use 2026-05-18 AK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv — AI for SciencearXiv cs.AIHugging Face Daily Papers 6.0 6.2/6.0/5.8

The paper argues that "tool use" framed as JSON-structured function calling is strictly weaker than "code as harness" — letting the agent emit Python that orchestrates tools, control flow, and intermediate state. They benchmark on scientific-discovery and software-engineering tasks where multi-step coordination dominates, and report agents written as code-emitters outperform JSON-tool-call agents by 7–14 points on success rate at matched inference cost. Aligns with the broader Claude Code / Aider / Devin trajectory.

agents code-as-action tool-use

#29

arXiv — AI for Auto-Research: roadmap and user guide

AI for Science 2026-05-18 AK (@_akhaliq) Daily PapersarXiv cs.AIarXiv — Evals & BenchmarksHugging Face Daily Papers 6.0 6.0/6.5/5.5

A position-paper-plus-survey on auto-research agents — agents that propose hypotheses, design experiments, write code, run them, and iterate. Maps the current landscape (Sakana's AI Scientist, Prime Intellect's auto-research benchmark, Google's AI co-scientist, FunSearch-class systems) onto a six-stage maturity ladder. Argues current systems clear the rung-1 "execute a written plan" bar but stall at rung-3 "propose a non-obvious hypothesis worth running." Useful as a synthesis even if the framework is not load-bearing.

auto-research agents ai-scientist

#30

arXiv — Probing for Representation Manifolds in Superposition

Interpretability 2026-05-18 arXiv cs.LGarXiv — Mechanistic Interpretability 6.0 6.5/6.5/5.0

The paper builds linear probes targeted at extracting manifold structure (not just point features) from polysemantic neurons under superposition, extending the Toy Models of Superposition framework into the regime where features form continuous low-dimensional manifolds rather than discrete directions. Validates on Anthropic's Pythia-class checkpoints, showing geometric structure (loops, lines, simplex faces) recoverable from features that single-direction probes mis-identify. Relevant for SAE methodology — suggests that the standard L1-on-codes recipe systematically under-recovers manifold features.

superposition saes probing manifolds

#31

arXiv — General Preference Reinforcement Learning

Post-Training 2026-05-18 arXiv cs.CLarXiv cs.LGarXiv — Post-training / Alignment 6.0 6.0/6.0/6.0

Generalizes DPO/IPO to arbitrary preference families parameterized by a learned scoring function, rather than the Bradley-Terry log-likelihood baked into DPO. Reports gains on Reward-Bench, mixed-objective alignment, and pluralistic preference modeling tasks where the BT assumption is known to be poorly calibrated.

dpo preference-learning rlhf

#32

arXiv — MMSkills: multimodal skills for general visual agents

Multimodal 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/5.8/6.2

The authors argue that "skill packages" for visual agents need to encode multimodal procedural knowledge — recognizing state, interpreting visual progress, deciding next action — rather than text-only prompts or code. They formalize what a multimodal skill package is and benchmark visual-agent task success after equipping models with MMSkills versus textual-only skill libraries; substantial gains on the agent-arena visual subset.

visual-agents skill-libraries mllm

#33

arXiv — EnvFactory: scaling tool-use agents via executable environment synthesis and robust RL

Agents & Tool Use 2026-05-18 arXiv — Agents / Tool UsearXiv cs.CLarXiv cs.LG 5.9 6.0/6.0/5.8

EnvFactory automatically synthesizes executable tool-use environments from natural-language task specifications, then trains agents inside them with a robust-RL objective that adds adversarial environment perturbations during rollouts. The resulting policies show better out-of-distribution generalization than agents trained on fixed environments, with improvements concentrated on tasks involving multi-API state.

tool-use rl environment-synthesis

#34

arXiv — Qumus: realization of an embodied AI quantum-material experimentalist

AI for Science 2026-05-18 arXiv — Agents / Tool UsearXiv — AI for SciencearXiv cs.ROarXiv — Robotic Autonomy / Embodied AI 5.9 6.5/6.0/5.2

Qumus is a wet-lab autonomous experimentalist for quantum-material characterization (Raman, transport, magnetometry), integrating a planning agent, a vision-language perception stack, and a robotic manipulation layer. The paper reports an autonomous campaign that mapped a portion of the phase diagram of a 2D vdW material, with experiment cycles dropping from human-paced days to autonomous hours.

self-driving-lab quantum-materials embodied

#35

TechCrunch — Amazon's new Alexa+ feature can generate podcast episodes on demand

Audio & Speech 2026-05-18 TechCrunch — AI 5.7 5.5/5.5/6.2

Amazon is rolling out an Alexa+ feature that generates conversational podcast episodes from arbitrary topic prompts, leaning on the same TTS+LLM stack as NotebookLM's Audio Overviews but with on-device streaming and Echo-show integration. The feature is positioned as a daily-briefing competitor to NotebookLM and to ElevenLabs's Discover Daily and is the first generative-audio user feature shipped under the Alexa+ banner since the Anthropic-powered Alexa+ relaunch.

alexa tts generated-podcasts amazon