Wolf Digest — 2026-05-18

#1

Open-model bonanza: Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 land in one month — CAISI says the US-China gap is widening

Frontier LLMs 2026-05-16 Interconnects (Nathan Lambert) 8.6 8.5/8.7/8.5

Nathan Lambert's twenty-first Open Artifacts roundup compresses one of the densest months of open-weights releases on record into a single readthrough. Google shipped Gemma 4 in 1B / 4B / 12B / 27B sizes plus a 35B-A3B mixture-of-experts variant; DeepSeek released V4 in Flash and Pro tiers with both reasoning and non-reasoning modes (the previously dominant V3.2 has been retired from the API); Moonshot's Kimi K2.6 reasserts itself near the top of open Intelligence Index leaderboards; Xiaomi's MiMo 2.5 Pro lands its first credible benchmark showing; Zhipu's GLM-5.1 makes a quiet jump on agentic coding; Alibaba's Qwen3.6 Max Preview now sits in the same tier as proprietary frontier; and IBM, Mistral, and BigCode shipped smaller updates Lambert treats as housekeeping. The post's longest section is on the new Center for AI Standards and Innovation evaluation of DeepSeek V4 Pro. CAISI used nine benchmarks across cyber, biosecurity, chem, agentic coding, and reasoning, calibrated via Item Response Theory to produce one Elo rating per model. Their headline claim: the aggregate capability gap between the strongest US closed-frontier models (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) and the strongest publicly released PRC models (DeepSeek V4 Pro, Kimi K2.6, MiMo 2.5 Pro) is now wider than at the V3 evaluation a year ago, despite each individual PRC release closing some axes. Lambert reads this as evidence that the closed labs are pulling away on agentic and long-horizon tasks specifically, while open weights remain competitive on contained reasoning benchmarks. Lambert pushes back, mildly, on CAISI's framing: the Elo aggregation hides that on several domains (long-context reasoning, coding, multilingual) the PRC frontier is within noise of the US frontier, and the gap is concentrated in the small set of evaluations CAISI weights most heavily, several of which are CAISI's own AA-Omniscience-style knowledge probes. He also notes that the Item Response Theory aggregation is the right call methodologically but visually compresses real differences. The piece closes with Lambert's own ranking: he places DeepSeek V4 Pro above Kimi K2.6 above MiMo 2.5 above GLM-5.1 above Gemma 4 27B for general use, with Kimi staying his pick for code and DeepSeek for long-context. The throughline of the post is that the open-weight ecosystem is now releasing at proprietary-lab cadence — five frontier-tier open models in one month — but that CAISI's data, however squinted at, shows the US closed-source labs widening their lead on the evaluations that map most directly to real economic deployment.

open-weights CAISI DeepSeek-V4 Kimi-K2.6 Gemma-4

#2

Recent LLM architecture moves: KV sharing in Gemma 4, layer-wise attention budgets in Laguna XS.2, compressed convolutional attention in ZAYA1, mHC in DeepSeek V4

Research 2026-05-16 Ahead of AI (Sebastian Raschka) 8.0 8.5/8.2/7.3

Sebastian Raschka's monthly architecture readthrough takes stock of how the April–May open-weight releases — Gemma 4, Laguna XS.2, ZAYA1-8B, DeepSeek V4 — have converged on a small number of long-context efficiency primitives. The framing is that reasoning models and agent workflows now keep tokens around for many turns, so KV-cache size, memory traffic, and attention compute are the binding constraints rather than parameter count or FLOPs at inference. Raschka walks through four architectural levers labs are now using to attack those constraints. Gemma 4 introduces KV sharing across consecutive transformer blocks: rather than each layer storing its own keys and values, every other layer reuses the previous layer's KV projections, cutting cache footprint nearly in half with what Google reports as roughly 0.3-point degradation on its internal eval suite. Gemma 4 also adds per-layer embeddings, a small change that gives each block a learned residual offset and recovers about half of the KV-sharing degradation. Laguna XS.2 takes a different cut: it sets a global token budget for attention and then has each layer choose how much of that budget to spend, with low-level layers attending to short local windows and middle layers attending more globally; layers compete for the budget at training time via a learned gating signal. ZAYA1-8B keeps full per-layer attention but compresses the KV projections through a small convolutional bottleneck before caching, trading a fixed multiply-add per token for substantially smaller cache. DeepSeek V4 combines mHC — multi-head compression, where multiple attention heads share compressed key projections via a low-rank factorization — with a compressed-attention path that downsamples distant tokens into a coarse summary before they're attended to. Raschka draws diagrams for each, points at the recurring trade-off (cache savings cost a small amount of long-context accuracy, recoverable through some combination of architectural compensation and post-training), and notes that the field has effectively settled on "more compute per token, less memory per token" as the new optimization target. He flags one open question for the next few months: whether these efficiency tricks compose. Stacking KV sharing, mHC, and compressed attention naively in the same model has not yet been published as working; the labs are picking one or two and stopping there. The piece sits next to Raschka's own architecture gallery, which has now grown to over thirty annotated diagrams from open releases over the past eighteen months.

LLM-architecture KV-cache long-context MoE attention

#3

Dwarkesh: RLVR might be disproportionately bad at science — the verification loop for theories runs decades, not minutes

Research 2026-05-16 Dwarkesh Patel Podcast 7.7 7.8/8.2/7.0

Dwarkesh Patel writes up one of the threads he explored with Michael Nielsen on a recent episode: the case for thinking AI will be disproportionately strong at scientific discovery rests on a sleight of hand about the word "verifiable." The standard story — coding and math are exploding because reinforcement learning from verifiable rewards has a tight feedback loop, and science is also verifiable, so science should explode too — falls apart when you look at how scientific theories actually get verified historically. Aristarchus proposed heliocentrism in the second century BC; stellar parallax, the cleanest test, wasn't measured until Bessel in 1838. Copernicus's 1543 model was actually less accurate than Ptolemy's millennia of accumulated epicycles, because Copernicus rejected the equant trick on Platonic grounds and had to add more circles to compensate. The Brahe model — sun orbiting earth, planets orbiting sun — predicted the phases of Venus and required retrograde motion as a natural consequence, so a naive falsificationist couldn't have ruled it out either; it took roughly three centuries of accumulated indirect evidence to dethrone it. The symmetric failure case Patel emphasizes is the prediction of Vulcan. Le Verrier predicted Neptune from Uranus's orbital anomalies and was right; the same method, applied to Mercury's anomalous precession, predicted a planet inside its orbit (Vulcan) that doesn't exist. The actual resolution of Mercury's precession required Einstein's general relativity in 1915, an entirely new theoretical framework rather than another perturbing body. Patel's claim is that the heuristics scientists use to make progress faster than naive falsification — Copernicus's appeal to parsimony, Einstein's appeal to covariance, the willingness to backseat empirical disconfirmation in favor of theoretical elegance — are neither articulated explicitly enough to put in an RL loop nor naturally rewarded by the loop's structure. RLVR optimizes for verifiable answers on short horizons; the scientific judgment that lets a community accept Copernicus in 1610 without parallax measurements is the opposite kind of cognition. The implication, if Patel is right, is that the AI capability arc looks very different across domains: code, math, and bounded engineering will continue compounding, but the kind of theoretical-scientific judgment that produced general relativity may compound much more slowly under the current RL-on-verifiable-rewards paradigm, regardless of model scale. He flags this as one of several threads from the Nielsen interview that deserves its own writeup.

RLVR epistemology science post-training

#4

Dwarkesh: why pretraining runs fail — token-vs-expert routing breaks causality, FP16 collectives bias gradients, Llama 4 and Gemini 2 Pro fingered for both

Infrastructure 2026-05-16 Dwarkesh Patel Podcast 7.6 7.3/8.0/7.5

Dwarkesh Patel publishes a notebook from a recent conversation with a pretraining engineer, organized around two failure modes that he argues account for most of the visible "underwhelming launch" stories of the past two years: causality breaks and bias accumulation. The expert-routing discussion is the meatiest section. In token-choice routing, you read the router scores from each token's perspective and send each token to its top-k experts, which is causal but can leave experts with wildly unbalanced loads. In expert-choice routing, you let each expert pick the top-k tokens it wants, which load-balances cleanly but breaks causality: which expert token n gets sent to now depends on the router scores of token n+k, information the model would never see at inference time. The conversation flags this as the rumored explanation for Llama 4's underwhelming benchmarks — Meta's MoE was reportedly trained with expert-choice, then served with token-choice, and the train-serve gap showed up as systematic capability loss. Token dropping is the other causality-breaking pattern: an expert at capacity drops some of its assigned tokens, and which tokens get dropped depends on global batch composition rather than the causal stream. Patel attributes a similar problem to Gemini 2 Pro. The bias section centers on the original GPT-4 training stall, which Patel describes as a FP16-on-collectives bug: FP16's mantissa carves intervals logarithmically, so above 1024 the granularity is wider than 1, meaning a sum like 1 + 1 + 1 ... done in FP16 stops accumulating around 1024 because each subsequent +1 rounds back to the existing value. Run that through an all-reduce and gradients get systematically biased toward zero, with the bug invisible per step but compounding over weeks. The post closes with a meta-question Patel found striking: are there N discrete failure modes (numerics, MoE routing, KV-cache aliasing, optimizer instability, dataloader race conditions) that labs eventually patch and never see again, or does each new scale produce a new bespoke failure mode? His interlocutor's answer was that it has been the second pattern so far — each generation surfaces new pathologies that nobody anticipated — which is a quietly bearish read on the standard scaling story. Patel also published a flashcard set alongside the notes.

pretraining MoE FP16 training-instability Llama-4

#5

arXiv will ban authors for a year if hallucinated references show their submission was AI-generated

Research 2026-05-16 TechCrunch — AI 7.5 7.0/8.2/7.3

arXiv, the dominant preprint server for ML, physics, and math, has tightened its moderation policy in response to a surge of AI-generated submissions: authors whose papers contain hallucinated citations — references to nonexistent papers, fabricated DOIs, citations to real papers that don't say what's claimed — will now face a one-year submission ban from the platform. Detection is being automated via a moderator-facing tool that cross-checks every reference in a submission against the Semantic Scholar and OpenAlex graphs; a configurable threshold of unresolvable or wrong-cited references triggers manual review. The change responds to a measured uptick in submissions where the moderation team can show the bibliography itself was generated by an LLM rather than retrieved by the author. Critically, the policy targets the act of submitting unverified LLM output, not the use of LLMs in research; authors who use models to draft, edit, or organize their work are explicitly fine so long as the citations resolve. arXiv has been ramping policy in this direction for two years — first requiring LLM disclosure in 2024, then expanding moderator headcount in 2025 — but the year-long ban is the first hard sanction with teeth, and it lands during the period when arXiv has been overtly worried that LLM-spam threatens the legitimacy of the cs.AI section in particular. The move is also the first time a major preprint server has tied an authorship ban to a content-quality signal rather than a credentialing question, which other moderation-thin venues (bioRxiv, SSRN) will be watching. The Semafor Tech briefing also covers it as one of the week's top items.

arXiv research-integrity moderation

#6

Anthropic and OpenAI now generate 89% of all AI-startup revenue — $80B annualized across 34 leading startups

Industry 2026-05-17 The Information — AI 7.3 7.0/7.8/7.0

The Information's Generative AI Database puts the top 34 AI startups at $80B annualized revenue ($6.6B/month), with Anthropic and OpenAI alone accounting for roughly 89% of that — up from 84% six months ago, and on a base that's grown 112% in the same window. The two-firm concentration is now structurally higher than at any point since the post-ChatGPT explosion in 2023; vertical-application startups and second-tier model labs (Cohere, AI21, Mistral) are losing share to the two leaders even as the total pie grows. The data lines up with the Artificial Analysis Coding Agent Index released the same week, which had Cursor CLI + Opus 4.7 and Codex + GPT-5.5 leading every coding-agent benchmark by margin — the model layer is concentrating, and the agent layer is concentrating on top of it.

industry-concentration Anthropic OpenAI revenue

#7

Dwarkesh: the mistake of conflating intelligence and power

Safety, Policy & Regulation 2026-05-16 Dwarkesh Patel Podcast 7.2 7.0/7.8/6.8

Short essay arguing that the standard ASI threat model overstates the link between abstract intelligence (puzzle-solving, shape-rotation) and political power. Patel notes the most powerful humans alive — Trump, Xi, Putin, Stalin historically — are not the most cognitively capable; the correlation between extreme power and extreme abstract intelligence is weaker than the correlation between extreme power and height. Power, he argues, is mostly the product of trust-based coordination over many actors, not galaxy-brain individual optimization, and current AI training (RL on bounded economically-valuable tasks) is not particularly correlated with that. Implication: the right mental model for transformative AI is many automated firms outcompeting incumbents in normal capitalist ways, not one AI outthinking everyone. The piece is short, opinionated, and explicitly framed as a counterpoint to the dominant safety-community framing.

AI-safety ASI political-economy

#8

BlackRock weighs $5-10B order in SpaceX IPO; offering could raise up to $75B

Industry 2026-05-16 The Information — AI 7.0 6.5/7.5/7.0

BlackRock has discussed a $5-10B investment in SpaceX's planned IPO next month, which is targeting up to $75B raised — the largest IPO on record. The investment would represent a vote of confidence from the world's largest asset manager in the company's Starship-Starlink-defense triple stack. The Information notes the close relationship between Musk and BlackRock CEO Larry Fink, both of whom were on Trump's recent China trip; the offering is also reportedly structured to give investors near-zero ability to challenge management decisions, which institutional buyers normally object to. Tangential to AI in the strict sense but central to the infrastructure-capex narrative — SpaceX has emerged as one of the heaviest non-hyperscaler buyers of AI compute (per Anthropic's recently announced compute deal with SpaceX, covered 5/6).

SpaceX IPO BlackRock

#9

Lawfare: 'The Limits of Naval Technology Alone' — autonomy and drones don't substitute for ships, doctrine, and crews

Government & Defense 2026-05-17 Lawfare (via Google News) 6.8 6.5/7.2/6.7

Lawfare essay arguing the dominant US naval debate — should we keep building large surface combatants, or pivot to autonomous unmanned platforms — wrongly poses the question as either-or. The author's claim is that autonomous platforms cannot substitute for hulls, crews, and the institutional knowledge baked into surface fleets; the right framing treats unmanned systems as additive payload and ISR capacity layered on top of a manned fleet, not as a force-structure replacement. The piece is part of an ongoing Lawfare series on autonomy in maritime warfare and explicitly responds to the Pentagon-side argument that the Low-Cost Containerized Missiles program (covered in our 5/16 digest) and similar unmanned-mass procurement can substitute for ship count. The full text is behind Google News's JS-app shell; this summary is from the syndicated feed metadata.

naval autonomy defense-policy

#10

Big tech week ahead: SpaceX Starship test, Google I/O, Nvidia Q1 earnings, Meta 8,000 layoffs

Industry 2026-05-17 The Information — AI 6.8 6.5/6.7/7.2

The Information's weekend agenda-setter previews the week of 5/18: SpaceX's next Starship test launch, Google I/O (the company's annual developer conference, expected to be Gemini-heavy), Nvidia's Q1 fiscal earnings (consensus has datacenter revenue continuing the trajectory from Q4), and Meta's announced 8,000-employee layoff round. All four are central to the AI infrastructure and frontier-product news cycle this week; the Nvidia print in particular will set the immediate tone for the AI-capex narrative the macro press has been running.

earnings Google-IO SpaceX Meta-layoffs

#11

Gates Foundation Trust has sold all its Microsoft stock — disclosed Friday after Q1 sale of ~$3B

Industry 2026-05-17 The Information — AI 6.7 6.5/6.8/6.8

The Gates Foundation Trust has fully exited its Microsoft position, disclosing the sale of the remaining ~$3 billion in stock during Q1 2026. Notable both because Bill Gates is the sole trustee and the Foundation has historically held large Microsoft stakes since spinoff, and because of the structural signaling — the largest single philanthropic vehicle in the US is no longer financially tied to its founding company's AI strategy. The sale follows the recent announcement of Anthropic's $200M partnership with the Gates Foundation (covered in our 5/16 digest), making the divestment look like a deliberate untangling.

Microsoft Gates-Foundation Anthropic

#12

Greg Brockman returns to product strategy at OpenAI as ChatGPT and Codex are reportedly being merged

Industry 2026-05-16 TechCrunch — AI 6.6 6.3/7.0/6.5

OpenAI co-founder Greg Brockman has reportedly taken over OpenAI's product strategy, with the company planning to merge ChatGPT and Codex into a single integrated product. The org change comes as Codex is being newly untethered from the laptop (Semafor's reporting: Codex now accepts messages via the ChatGPT app), and as OpenAI is preparing the recently disclosed agentic-coding push to compete head-on with Cursor and Claude Code. Brockman last had a hands-on product role pre-2024.

OpenAI Brockman Codex ChatGPT

#13

MMSkills: Towards Multimodal Skills for General Visual Agents

Agents & Tool Use 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 7.2/6.8/5.5

Proposes a skill-package representation for visual agents that bundles a learned visual recognizer, a textual procedure, and an execution policy together. Trained agents reuse skills across tasks by retrieving on visual state, not just task description; benchmarks show ~9pt improvement over text-only skill libraries on the MMSkills benchmark spanning web and desktop GUI tasks.

multimodal agents

#14

DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

Robotic Autonomy 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/6.5/5.5

New MuJoCo benchmark for dexterous-hand manipulation with 24 task families designed to exercise capabilities that parallel grippers cannot — in-hand reorientation, multi-finger force control, tool-use. Ships with a baseline policy-learning stack and a standardized evaluation pipeline for cross-paper comparison. Authors explicitly position it against existing parallel-gripper benchmarks like Meta-World.

robotics benchmark

#15

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

Post-Training 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/6.5/5.5

Tackles RLVR's exploration ceiling: policies only improve on trajectories they've already sampled, and increasing rollouts is computationally brutal. Proposes nudging the policy with a small set of teacher-style hints injected at decision-critical tokens, expanding the explored region without full prompt rewriting. Reports ~3-5pt gains on reasoning benchmarks at constant compute versus standard GRPO.

RLVR post-training

#16

Apple's revamped Siri will reportedly auto-delete conversations as a privacy default

Industry 2026-05-17 TechCrunch — AI 6.4 6.2/6.5/6.5

TechCrunch reports Apple will position the upcoming Siri revamp (expected to debut at WWDC in June) around a privacy-by-default model that auto-deletes conversations after a fixed window. The framing distinguishes Apple's positioning from OpenAI/Anthropic/Google, all of which default to indefinite retention with user-toggled deletion. The piece is sourced; the technical mechanism (on-device vs. cloud auto-purge, retention window length, exemptions for context continuity) isn't specified.

Apple Siri privacy

#17

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

Generative Media 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/6.2/5.5

Interactive garment-swap video model that runs at near-real-time latencies on a single garment reference, targeting e-commerce. Decouples appearance and motion via a paired temporal-spatial adapter and avoids the slow inversion step earlier garment-transfer pipelines depended on. Inference reportedly under 200ms per frame at 512px on an H100.

video diffusion

#18

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Generative Media 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/6.2/5.5

Autoregressive image tokenizers fail on rendered text and faces because aggressive downsampling discards fine glyph and feature structure. InsightTok adds a perceptually-weighted reconstruction loss biased toward text-and-face regions detected by a small classifier; reports +12pt OCR-readability on TextATLAS and improved CelebA face identity preservation at no throughput cost.

tokenization image

#19

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Post-Training 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.9/5.5

GRPO for video diffusion is bottlenecked by per-step reward evaluation across hundreds of denoising steps. Flash-GRPO collapses the alignment objective to a single denoising step using a learned shortcut, cutting per-experiment compute from ~hundreds of GPU-days to ~tens on 14B video models with comparable preference-alignment scores.

video RLHF

#20

Solvita: Enhancing LLMs for Competitive Programming via Agentic Evolution

AI Coding 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.9/5.5

Multi-agent system for hard competitive-programming problems that maintains a persistent debugging memory across attempts, rather than the stateless retrieval most prior agent stacks use. Reports gains on Codeforces div-1 problems where prior agentic systems plateaued, with the memory layer credited for catching repeated failure modes.

coding-agents LLM

#21

Unlocking Dense Metric Depth Estimation in VLMs

Multimodal 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.9/5.5

Adds metric-depth supervision to VLM training using a synthetic geometry pipeline rather than distilling from external depth models. The resulting VLM produces dense per-pixel metric depth competitive with specialist depth networks while preserving the text-grounding capabilities of the base VLM.

VLM depth 3D

#22

OpenAI quietly acquired AI voice-cloning startup Weights.GG in January

Industry 2026-05-16 The Information — AI 6.2 6.0/6.0/6.5

OpenAI acquired Weights.GG — a small voice-cloning startup whose product Replay is built on open-source audio models — in January 2026. Six employees joined OpenAI; OpenAI bought the IP but does not plan to integrate Replay as a product. The acquisition reads as an acqui-hire for audio-engineering talent, fits OpenAI's broader voice push (Advanced Voice Mode iterations, the Sora-Audio integrations from late 2025), and lands in a market where ElevenLabs just crossed $500M ARR (5/5).

OpenAI voice M&A

#23

ReactiveGWM: Steering NPC in Reactive Game World Models

Generative Media 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.6/5.5

Game world models currently render NPCs as passive pixels; ReactiveGWM models the player and an NPC as two interacting agents in the same generative video, so player actions can causally affect NPC behavior. Demonstrated on a simplified 2D action game; treats NPC steering as a learned latent intervention rather than an external prompt.

world-models video agents

#24

PhysBrain 1.0 Technical Report

Robotic Autonomy 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.6/5.5

Vision-language-action model that augments robot trajectories with structured commonsense supervision extracted from large-scale human egocentric video — scene elements, spatial dynamics, action execution, depth — before robot-specific adaptation. Reports better generalization on novel kitchen and tabletop tasks than baselines trained on robot data alone.

VLA robotics embodied

#25

DiagnosticIQ: Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Evaluations & Benchmarks 2026-05-09 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.6/5.5

Benchmark for whether LLMs can map engineer-authored symbolic monitoring rules to concrete maintenance steps. Bottleneck is response, not detection: translating a rule into a fix requires asset-specific expertise. Tests frontier models against an expert-labeled set drawn from real industrial monitoring deployments.

evals industrial-AI

#26

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

Research 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.6/5.5

Two-stage LLM-agent system for neural-architecture search beyond standard transformers. AIRA-Compose uses 11 agents under a 24-hour budget to explore primitive operations; AIRA-Design implements promising candidates. Frames the work as a step toward recursive self-improvement on the ML stack itself.

NAS agents architecture

#27

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Post-Training 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.6/5.5

RLVR with sparse binary rewards underutilizes failed trajectories. The proposed correction-oriented variant treats each failed attempt as input to a learned correction policy, producing dense process-style signal from sparse outcome signal. Reports gains on math reasoning at constant compute.

RLVR post-training

#28

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse MoE

Efficiency 2026-05-13 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.6/5.5

Learning-free expert-merging compressor for sparse MoE that uses Hodge decomposition on the expert-compatibility graph rather than pairwise similarity. The pairwise approach fails when three experts are pairwise compatible but cycle when merged; the topological cover catches the cycle. Reports better compression-quality trade-off on Mixtral-class MoEs.

MoE compression efficiency

#29

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Agents & Tool Use 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

GUI agents work well on region-tolerant interactions where any nearby pixel works, but fail on precise geometric tasks (drawing, drag-precise editing) where action targets are continuous-space points. PAGER trains a point-precise policy on top of a frozen VLM with a continuous-coordinate decoder, improving precision-task success substantially.

agents GUI VLM

#30

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Generative Media 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

Multi-step image editing system that learns its own planning policy via RL on editing-outcome rewards, rather than imitating handcrafted pipelines or a teacher. Targets vague, intent-level prompts like 'make this ad more vegetarian-friendly' that require decomposition.

image-editing agents

#31

FFAvatar: Few-Shot Feed-Forward Avatar Reconstruction

Generative Media 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

Animatable 3D Gaussian head avatar reconstruction from a handful of unposed portrait images, feed-forward, in seconds rather than the hours of per-subject optimization prior work required. Reports comparable quality to fitting-based methods on standard avatar benchmarks.

3D avatar

#32

WorldAct: Activating Monolithic 3D Worlds into Interactive Object-Centric Scenes

Generative Media 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

Postprocesses monolithic generative 3D scene outputs (Marble-style) into object-decomposed scenes ready for editing and physical interaction. Each object gets its own asset, transform, and contact representation, enabling downstream simulation and animation.

3D world-models

#33

Look Before You Leap: Autonomous Exploration for LLM Agents

Agents & Tool Use 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

Introduces Exploration Checkpoint Coverage as a metric for agent exploration adequacy in unfamiliar environments, addressing premature-exploitation failures. Tests show frontier models systematically under-explore relative to optimum; the proposed exploration policy closes most of the gap.

agents exploration

#34

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Robotic Autonomy 2026-05-12 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

Cross-embodiment motion transfer (human-to-robot, robot-to-robot) for video generation, without requiring paired training data across embodiments. Disentangles transferable motion dynamics from embodiment-specific appearance, targets scalable training data generation for embodied AI.

video embodied

#35

Musk-OpenAI trial enters final days — Altman's credibility on early Musk relationship the central question

Industry 2026-05-17 TechCrunch — AI 6.0 5.5/6.5/6.0

TechCrunch covers the closing days of Elon Musk v. OpenAI: the trial has pivoted to Sam Altman's credibility on the founding agreement and his characterization of the 2018 break with Musk. Documentary evidence introduced this week reportedly cuts both ways. Semafor Tech's separate dispatch frames the trial as one where 'everyone lost', a reflection that the disclosures have damaged the public reputations of both Musk and Altman as well as the broader culture of the early OpenAI nonprofit period.

OpenAI litigation Altman Musk

#36

Shein acquires Everlane for $100M as direct-to-consumer e-commerce era ends

Industry 2026-05-17 The Information — AI 5.8 5.5/5.5/6.4

Shein, the Chinese discount e-retail giant, is buying San Francisco direct-to-consumer apparel startup Everlane for around $100 million. Everlane was emblematic of the 2010s e-commerce boom that produced Warby Parker, Allbirds, and Casper; the price is a small fraction of its peak valuation. Marginal to AI but tracked here because the AI-coding-and-marketing tooling that's let mass-market fast-fashion platforms like Shein compress merchandising cycles is the structural pressure pushing DTC brands like Everlane into distressed-acquisition territory.

e-commerce M&A

#37

TechCrunch Mobility: AI-skills hiring war reaches automotive, with OEMs and Tier 1s competing for ML engineers

Industry 2026-05-17 TechCrunch — AI 5.7 5.5/6.0/5.6

TechCrunch Mobility's weekly column reports auto OEMs and Tier 1 suppliers are now competing directly with hyperscalers for ML engineering talent, with comp packages in some cases matching tech-industry levels for the first time. The dynamic is partly Waymo-driven (Waymo's expansion into Tokyo and ten US cities has visibly stressed the autonomous-vehicle talent pool) and partly the broader pivot of automakers like GM, Ford, and Stellantis from in-house autonomy programs to in-house ML platform teams supporting infotainment, predictive maintenance, and ADAS stacks.

automotive labor Waymo

#38

The Information weekend: SF's 'Robo-Fight Club', VC nuclear-power bet, gene-edited embryo timing

Industry 2026-05-16 The Information — AI 5.6 5.0/5.5/6.3

The Information's Saturday wrap covers three minor-but-flavorful threads from the week: an SF underground humanoid-robot fighting event organized by hobbyists and a few employees of Figure/1X (touched on for color rather than substance); General Catalyst's increasingly polarizing strategy of high-conviction public bets on companies like the brute-force-fission startup Aalo Atomics; and a debate at SynBioBeta on gene-edited embryos where two startups were uninvited shortly before the event. Mostly cultural ambient, low technical density, included for the weekly flavor.

culture VC biotech

#39

TechCrunch: the haves and have-nots of the AI gold rush — even insiders increasingly skeptical of the narrative

Industry 2026-05-16 TechCrunch — AI 5.6 5.2/6.0/5.6

TechCrunch commentary on the gap between AI-haves (Anthropic, OpenAI, Nvidia, Microsoft, the hyperscalers) and the rest of the industry, pegging the divergence to the same data point The Information ran the same week (89% of AI-startup revenue concentrated in Anthropic + OpenAI). The piece reports growing private skepticism among VCs at the tail end of the cycle about whether the broader application-layer is ever going to monetize at the rates the model-layer is currently doing.

industry-concentration VC-sentiment

#40

Silicon Valley wants gene-edited embryos but the regulatory and scientific timeline is decades, not years

Industry 2026-05-16 The Information — AI 5.5 5.0/6.0/5.5

Reporter Joel Hruska covers a SynBioBeta panel where two embryo-editing startups were invited then disinvited in the days before the event. The substance: editing embryonic DNA at scale is technically more constrained than CRISPR somatic editing, regulatory environments in the US, EU, and UK forbid clinical use, and even an aggressive timeline puts the first regulated births a decade out. Of interest because several AI investors (Coinbase Ventures, Founders Fund affiliates) have been actively backing the space on the thesis that AI-assisted embryo selection becomes politically tractable once predictive accuracy crosses a threshold.

biotech embryo-editing regulation

#41

Commencement speakers told: don't mention AI — students do not want to hear about it

Industry 2026-05-17 TechCrunch — AI 5.4 5.0/5.7/5.5

TechCrunch's commencement-season piece reports that 2026 graduating classes are visibly tired of having their futures framed as AI-shaped; Semafor's coverage of the same theme ('The Class of 2026 is cooked') frames it as a job-market anxiety story rather than purely a cultural one. Notable because both outlets independently surfaced the same dynamic from different angles within the same week — a useful sentiment signal during a quarter when the macro AI narrative is otherwise dominated by capex and infrastructure stories.

culture labor