Wolf Digest — 2026-05-09

#1

Helix-02 Bedroom Tidy: Two humanoids cooperatively reset a bedroom from a single learned VLA policy

Robotic Autonomy 2026-05-08 Figure AI 8.7 8.8/8.4/9.0

Figure released Helix-02 Bedroom Tidy, a two-humanoid demonstration in which a pair of robots reset a bedroom in under two minutes — opening doors, hanging clothes, putting away headphones, closing a book, taking out trash, pushing a chair under a desk, and cooperatively making the bed. Both robots run a single learned vision-language-action policy, with no shared planner, no message passing, and no central coordinator. Each robot reads the room through its own cameras and infers its partner's intent from motion alone. To Figure's knowledge this is the first published demonstration of a single learned neural network performing multi-humanoid collaborative locomanipulation directly from pixels to actions, and it is the most concrete public step yet from the February 2025 grocery-putaway demo where they first showed two robots running one shared policy.

The motor repertoire on display is unusually diverse for a unified policy. Helix-02 opens lever-handled doors with whole-body coordination, balancing as the door swings; pushes an office chair under a desk by generating force through stance and foot placement rather than arm motion alone; carries a garment across the room and drapes it onto a coat tree using both hands; picks up headphones and reorients them mid-air to seat the headband on a narrow stand; closes an open book by handling its hinged, mass-shifting cover; and operates a trash-can foot pedal with single-leg balance — using a foot as an end-effector while standing on the other leg. The bed-making segment has both robots manipulating a deformable comforter from opposite sides of the bed, lifting, unfurling, spreading, folding, and smoothing, while continuously updating their predictions about each other's contact points as the fabric drapes and slides under shared tension.

Figure frames the difficulty as three compounding problems. Two humanoids in one room is more than two single-robot problems running in parallel: every action one robot takes redefines the problem the other is solving, and each is reading its partner's intent from motion alone in real time while its own actions are simultaneously changing what the partner sees. The central object is deformable, with no fixed pose, no rigid geometry, and no canonical grasp — there is no natural seam between "your half" and "mine," so each robot has to commit to a contact point while predicting what the other will do, then update both predictions tens of times per second. And the whole sequence runs in two minutes of whole-room locomanipulation, with the robot walking naturally between locations, balancing dynamically on one leg, and switching between rigid, deformable, articulated, and collaborative manipulation without scripted handoffs.

The architectural claim is that none of this required task-specific controllers. The same underlying policy that previously learned logistics, laundry folding, kitchen cleanup, and living-room tidying now performs collaborative bedroom reset by adding more data, with no changes to the core algorithm. That is a strong scaling claim for the VLA paradigm, and it lands in the same week as several archive papers — including When to Trust Imagination on World Action Models and ReflectDrive-2 on RL-aligned masked diffusion driving — that are working similar territory in simulation. Figure does not release weights, training data, or compute numbers with these demos, so the public surface is video plus claims; the result is impressive but unfalsifiable in the way Physical Intelligence's pi-0 was at first. Still, multi-humanoid coordination from pixels with one policy is the cleanest expression so far of the bet that locomotion, dexterity, sensing, and inter-agent reasoning collapse into one network when you add data — and Figure is hiring against that bet.

#2

[AINews] Anthropic growing 10x/year while everyone else is laying off >10% of their workforce

Industry 2026-05-09 Latent Space (swyx & Alessio)The Information — AI 8.4 8.0/8.5/8.7

Latent Space's AINews issue for May 7-8 anchors a remarkable revenue print into a single, plottable line. After what swyx calls Anthropic's "miracle Q1" — eighty-times annualized growth, with a one-month jump of about fifteen billion dollars in annualized run rate — secondary-market and traditional reporting now puts Anthropic's valuation between one trillion and one-point-two trillion dollars. That places Anthropic somewhere between the eleventh and fifteenth most valuable company in the world, and it is the first time the lab has officially overtaken OpenAI on equity-market terms. Importantly, the chart in the issue is a revenue trajectory, not a financing-round trajectory; the rerating is being driven by realized run rate rather than narrative.

The Information piped a complementary data point on Friday: Akamai is the previously-unnamed customer behind the one-point-eight-billion-dollar, seven-year cloud deal it announced on Thursday — a deal that sent Akamai's stock up twenty-seven percent. Layered on top of last month's expanded Amazon agreement for up to five gigawatts of new compute, plus the takeover of xAI's Colossus reported earlier this week, the picture that emerges is of an Anthropic that has finally cleared its compute-bottleneck problem in the same quarter that its commercial pull-through ramped to OpenAI parity on enterprise. The diversification is also notable: Akamai is not historically a frontier-AI cloud, and signing a seven-year deal there is a hedge against the fragility of any single hyperscaler relationship.

The same AINews issue notes the contrasting picture across the rest of the index. Block has cut about forty percent of its workforce, Coinbase fourteen percent, and Cloudflare twenty percent — most citing AI productivity gains, though swyx flags that some of this is plausibly "AI-washing" otherwise normal layoffs. The asymmetry is hard to ignore: companies that are pure consumers of AI capability are shrinking, while companies producing it (Anthropic, plus Linear as an example called out in the issue of an AI-fluent application company that grew rather than shrunk) are absorbing the headcount surplus. The broader concern in the issue is concentration: with the AI capex cycle now exceeding three Manhattan Projects per quarter across the megacaps (per Stratechery's Q1 readout, also published Friday), and with revenue growth disproportionately accruing to a handful of labs, the contribution of AI to U.S. GDP and corporate earnings is approaching levels that historically precede crowding risks.

For practitioners, the takeaway is mostly forward-looking. An Anthropic at one-trillion in valuation with eighty-times annualized growth has every incentive — and now the compute — to push pricing aggressively on Claude, expand the agent stack (Claude Code, Computer Use, the new Claude Design product), and continue the per-user-limit increases announced earlier this week. The Akamai deal in particular suggests Anthropic is positioning for an inference-side surge rather than just training scale. The competitive pressure on OpenAI to ship the GPT-5.5 successors, and on the second tier (Mistral, Cohere, AI21), to not slip a generation behind, is now numeric rather than vibes-based.

How it was discussed

AINews leads with the revenue chart and frames the rerating as realized rather than speculative, contrasting Anthropic's growth against Block / Coinbase / Cloudflare layoffs.
The Information identifies Akamai as the customer behind the $1.8B / 7-year deal — Akamai's stock jumped 27% on the announcement.
Both sources flag the broader concentration risk: AI capex now ~3× Manhattan Project per quarter, with revenue accruing to a handful of labs.

#3

EMO: Pretraining mixture of experts for emergent modularity

Efficiency 2026-05-08 Allen Institute for AI (AI2)Hugging Face BlogAK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.9 7.8/8.0/7.9

Allen AI shipped EMO, a pretraining-time recipe for mixture-of-experts in which modular expert groups are designed to emerge from the data rather than being prescribed by hand. The motivating problem is now familiar: production deployments of large language models routinely need only a narrow subset of capabilities at any given time — code generation, math reasoning, medical-domain knowledge, retrieval-style summarization — yet the standard recipe forces every request to load the full set of parameters. Vanilla MoEs were supposed to solve this by routing each token through a sparse subset of experts, but in practice the experts learn entangled, partially-overlapping representations and rarely admit a clean task-level partition. EMO introduces a pretraining objective that pushes experts toward emergent modularity: groups of experts cluster around coherent capability surfaces during training, and at inference users can select a small task-specific subset while preserving near-full-model quality on the targeted task.

The technical contribution is a structured-sparsity regularizer applied during pretraining that biases the routing distribution toward block-diagonal patterns, plus a curriculum that progressively concentrates each expert's specialization as training proceeds. The HuggingFace blog post that mirrors the AI2 announcement reports that selecting roughly one-fourth of the experts at inference time recovers the majority of the dense-equivalent performance on domain-targeted benchmarks, and that the routing block-diagonality emerges sharply after a defined point in training rather than gradually — suggesting a phase transition similar to what's been observed in interpretability work on attention-head specialization. The authors compare against Mixtral-style top-k routing and standard MoE training as baselines and report meaningfully tighter capability clusters under EMO's regularization.

For practitioners the result is interesting in two directions. First, it gives a clean route to deploying a single base model in latency- or memory-constrained environments by carving out task-specific subsets at serving time without retraining or distilling — useful for on-device or edge-side deployments where the full mixture would not fit. Second, it reframes a long-standing critique of MoE architectures: that they trade dense capability for sparse pseudo-capability without giving operators a knob for what gets activated. EMO turns that knob into a property of the model rather than the inference engine. The contribution sits between Allen AI's recent Olmo work and the broader DeepSeek / Qwen / Mistral push on cheap large-MoE inference, and the reception across the AI2, HuggingFace, and archive surfaces — three independent posts on the same day — suggests the modularity claim is what's drawing attention rather than the headline benchmark numbers.

Caveats are worth flagging. The "near full-model performance" claim depends on which benchmarks are being measured and how the task-specific subsets are selected; aggressive subsetting on out-of-distribution prompts may degrade more than the headline. The block-diagonalization curriculum also adds optimizer overhead during pretraining, which limits applicability for groups that cannot afford a full retraining pass. Still, this is the most interesting recent attempt at making MoE expert specialization controllable rather than emergent and uninspectable — and it pairs nicely with this week's UniPool paper on globally-shared expert pools for the opposite cut of the architecture-design space.

How it was discussed

Allen AI's blog leads with the practical framing — users can select small task-specific expert subsets while preserving near full-model performance.
The HuggingFace blog post mirrors the AI2 announcement and emphasizes the deployment-side benefits for memory-constrained settings.
The arXiv paper frames it as a structured-sparsity regularizer that produces emergent block-diagonal routing patterns during pretraining.

#4

Running Codex safely at OpenAI

Agents & Tool Use 2026-05-08 OpenAI Research 7.7 7.5/7.9/7.7

OpenAI published Running Codex safely at OpenAI, a substantive look at the operational controls the company runs around its own Codex coding agent — sandboxing, approval gates, network egress policy, and agent-native telemetry. The post is positioned for enterprise readers thinking about deployment of any production coding agent (Codex, Claude Code, Cursor, Devin), and it lays out the threat model and the control plane in more detail than most agent vendors have published to date. The disclosure is timely: Codex usage has grown sharply across OpenAI's own engineering org and at customers, and the same week saw The Information report that Cursor staffers are now visiting xAI offices following SpaceX's option to acquire the company, and TechCrunch covered Cloudflare's announcement that AI productivity gains made roughly eleven hundred customer-support roles obsolete. Coding-agent safety is now a procurement question, not just a research question.

The core architectural claim is that coding agents need a layered defense rather than a single sandbox boundary. Codex sessions at OpenAI run in containers with read-only views of source where possible, with explicit approval surfaces for destructive operations like rm or migrations, with outbound network access whitelisted to a small set of allowlisted endpoints, and with structured-event telemetry instrumented at every tool call rather than only at the model-output level. The telemetry piece is the part most enterprise teams have missed. Off-the-shelf observability stacks were designed around HTTP request graphs and process-level metrics; agent traces with tool-call provenance, planner-output diff context, and approval-gate decisions need their own schema. OpenAI describes their own internal schema and indicates that a public version is under consideration.

The post also addresses the most-asked operational question: when do you let the agent commit to main versus open a PR for human review. OpenAI's framing is policy-driven rather than capability-driven — even where Codex is empirically capable of merging cleanly without review, branches with sensitive paths (auth, payments, release-pipeline configs) require an approver outside the model's own session. The implicit recommendation is to over-deploy approval gates initially and pull them back once telemetry shows the agent's commits clear in-house code review with low rework rate. That recommendation will resonate with the deployment-safety teams now standing up at most large enterprises that have signed agent contracts in the last six months.

The post is announcement-level rather than research-level — there are no benchmarks, no failure-mode breakdowns, no quantified false-positive rates on the approval gate. But for an industry that has been arguing about coding-agent autonomy in vibes-based terms, this is a useful reference point from one of the major shippers: how OpenAI runs their own. It pairs with the Latent Space coverage of Anthropic's enterprise growth, the rumored Codex-vs-Claude-Code comparison work circulating on the archive, and Cursor's own evolving stance on enterprise controls. Expect this to be cited in security reviews of every coding-agent procurement deal for the rest of the year.

#5

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Efficiency 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 6.6/6.6/6.6

A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective.

#6

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Agents & Tool Use 2026-05-03 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 6.6/6.6/6.6

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning.

#7

Continuous Latent Diffusion Language Model

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.5/6.5/6.5

Large language models have achieved remarkable success under the autoregressive paradigm, yet high-quality text generation need not be tied to a fixed left-to-right order. Existing alternatives still struggle to jointly achieve generation efficiency, scalable representation learning, and effective global semantic modeling. We propose Cola DLM, a hierarchical latent diffusion language model that frames text generation through hierarchical information decomposition.

#8

Audio-Visual Intelligence in Large Foundation Models

Robotic Autonomy 2026-05-05 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.5/6.5/6.5

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals.

#9

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.4/6.4/6.4

Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem'': when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate.

#10

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

Efficiency 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.4/6.4/6.4

Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps.

#11

MiA-Signature: Approximating Global Activation for Long-Context Understanding

Agents & Tool Use 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.3/6.3

A growing body of work in cognitive science suggests that reportable conscious access is associated with global ignition over distributed memory systems, while such activation is only partially accessible as individuals cannot directly access or enumerate all activated contents. This tension suggests a plausible mechanism that cognition may rely on a compact representation that approximates the global influence of activation on downstream processing. Inspired by this idea, we introduce the concept of Mindscape Activation Signature (MiA-Signature), a compressed representation of the global activation pattern induced by a query.

#12

RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation

Evaluations & Benchmarks 2026-05-06 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.3/6.3

We present our winning system for Task~B (generation with reference passages) in SemEval-2026 Task~8: MTRAGEval. Our method is a heterogeneous ensemble of seven LLMs with two prompting variants, where a GPT-4o-mini judge selects the best candidate per instance. We ranked 1st out of 26 teams, achieving a conditioned harmonic mean of 0.7827 and outperforming the strongest baseline (gpt-oss-120b, 0.6390). Ablations show that diversity in model families, scales, and prompting strategies is essential, with the ensemble consistently beating any single model.

#13

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.3/6.3

Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward R(x)=sum_k w_k R_k(x), or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training.

#14

When to Trust Imagination: Adaptive Action Execution for World Action Models

Robotic Autonomy 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.3/6.3

World Action Models (WAMs) have recently emerged as a promising paradigm for robotic manipulation by jointly predicting future visual observations and future actions. However, current WAMs typically execute a fixed number of predicted actions after each model inference, leaving the robot blind to whether the imagined future remains consistent with the actual physical rollout. In this work, we formulate adaptive WAM execution as a future-reality verification problem: the robot should execute longer when the WAM-predicted future remains reliable, and replan earlier when reality deviates from imagination.

#15

SkillOS: Learning Skill Curation for Self-Evolving Agents

Agents & Tool Use 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.3/6.3

LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback.

#16

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Agents & Tool Use 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.3/6.3

Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL).

#17

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Reinforcement Learning 2026-04-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.3/6.3

Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative.

#18

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.2/6.2/6.2

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic.

#19

Recovering Hidden Reward in Diffusion-Based Policies

Robotic Autonomy 2026-05-01 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.2/6.2/6.2

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds.

#20

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

Agents & Tool Use 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials.

#21

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Agents & Tool Use 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator.

#22

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Efficiency 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models.

#23

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

Evaluations & Benchmarks 2026-05-06 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space.

#24

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Reinforcement Learning 2026-05-06 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Geospatial reasoning requires models to resolve complex spatial semantics and user intent into precise target locations for Earth observation. Recent progress has liberated the reasoning path from manual curation, allowing models to generate their own inference chains. Yet a final dependency remains: they are still supervised by human-annotated ground-truth coordinates. This leaves the reasoning process autonomous, but not its spatial endpoint, and prevents true self-evolution on abundant unlabeled remote sensing data. To break this bottleneck, we introduce RemoteZero, a box-supervision-free framework for geospatial reasoning.

#25

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states.

#26

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Efficiency 2026-05-06 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design.

#27

The Scaling Properties of Implicit Deductive Reasoning in Transformers

Safety, Policy & Regulation 2026-05-05 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

We investigate the scaling properties of implicit deductive reasoning over Horn clauses in depth-bounded Transformers. By systematically decorrelating provability from spurious features and enforcing algorithmic alignment, we find that in sufficiently deep models with a bidirectional prefix mask, implicit reasoning approaches explicit CoT performance across graph topologies and problem widths, though CoT remains necessary for depth extrapolation.

#28

Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement.

#29

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: Discriminative RMs regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization.

#30

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Evaluations & Benchmarks 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget.

#31

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Reinforcement Learning 2026-04-30 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.1/6.1

Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers' workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs' understanding of complex handwritten logic as a whole.

#32

David Reich – Why the Bronze Age was an inflection point in human evolution

AI for Science 2026-05-08 Dwarkesh Patel Podcast 6.0 6.0/6.0/6.0

David Reich is back. He and collaborator Ali Akbari just published a paper that overturns a long-standing consensus about human evolution — that natural selection has been dormant in our species since the agricultural revolution. By scaling ancient DNA sequencing and developing a new statistical method, they found that selection has actually sped up. Selection went especially bonkers during the Bronze Age (around 3,000 years ago). That's when gene frequencies for everything from immune function to body fat to intelligence were most in flux.

#33

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

Robotic Autonomy 2026-05-06 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory.

#34

AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

Agents & Tool Use 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

We introduce the AI co-mathematician, a workbench for mathematicians to interactively leverage AI agents to pursue open-ended research. The AI co-mathematician is optimized to provide holistic support for the exploratory and iterative reality of mathematical workflows, including ideation, literature search, computational exploration, theorem proving and theory building. By providing an asynchronous, stateful workspace that manages uncertainty, refines user intent, tracks failed hypotheses, and outputs native mathematical artifacts, the system mirrors human collaborative workflows.

#35

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V.

#36

Prescriptive Scaling Laws for Data Constrained Training

Post-Training 2026-05-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice.

#37

TIDE: Every Layer Knows the Token Beneath the Context

Research 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states.

#38

PianoCoRe: Combined and Refined Piano MIDI Dataset

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. This work presents PianoCoRe, a large-scale piano MIDI dataset that unifies and refines major open-source piano corpora. The dataset contains 250,046 performances of 5,625 pieces written by 483 composers, totaling 21,763 h of performed music.

#39

GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Multimodal 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity (O(1)), regardless of the number of integrated experts.

#40

Generative Quantum-inspired Kolmogorov-Arnold Eigensolver

Evaluations & Benchmarks 2026-05-06 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

High-performance computing (HPC) is increasingly important for scalable quantum chemistry workflows that couple classical generative models, quantum circuit simulation, and selected configuration interaction postprocessing. We present the generative quantum-inspired Kolmogorov-Arnold eigensolver (GQKAE), a parameter-efficient extension of the generative quantum eigensolver (GQE) for quantum chemistry. GQKAE replaces the parameter-heavy feed-forward network components in GPT-style generative eigensolvers with hybrid quantum-inspired Kolmogorov-Arnold network modules, forming a compact HQKANsformer backbone.

#41

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

Agents & Tool Use 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a small set of tools.

#42

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Reinforcement Learning 2026-05-07 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/6.0

In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data.

#43

Pentagon OIG partners with Justice Department’s new government fraud-hunting team

Government & Defense 2026-05-08 DefenseScoop 5.9 5.9/5.9/5.9

The Pentagon's inspector general recently met with the Justice Department's first-ever assistant attorney general for the national fraud enforcement division to explore opportunities for their teams to cooperate more closely on efforts to confront increasingly complex scams impacting the Defense Department and military.

#44

Here’s what you need to know about the cruise ship hantavirus outbreak

Industry 2026-05-08 MIT Technology Review — AI 5.9 5.9/5.9/5.9

MIT Technology Review Explains : Let our writers untangle the complex, messy world of technology to help you understand what's coming next. You can read more from the series here . Eight passengers aboard a Dutch-flagged cruise ship have contracted a type of hantavirus, a rare virus transmitted by rats. Three of them have died. As the ship prepares to dock in the Canary Islands, plans are being finalized to let the remaining passengers and crew disembark safely. The virus in question appears to have a high fatality rate.

#45

DeepSeek To Raise More than $7 Billion as Startup Plots Revenue Efforts

Industry 2026-05-08 The Information — AI 5.9 5.9/5.9/5.9

Liang Wenfeng, billionaire founder and CEO of DeepSeek, is planning to write the biggest check for the startup's first-ever funding round, which the company now hopes will raise up to 50 billion yuan ($7.35 billion), according to two people with direct knowledge of the talks. That would make it the biggest funding round by a Chinese AI company ever. Meanwhile, the funding round has prompted the Chinese AI lab to expedite its plans to generate revenue and become commercially viable.

#46

Marines ‘wrestling’ with tough questions over sensors, robotics for Corps’ revamped reconnaissance training

Government & Defense 2026-05-08 DefenseScoop 5.8 5.8/5.8/5.8

The Marine Corps is overhauling how it trains its reconnaissance troops as ubiquitous surveillance tools redefine modern conflict and service officials wrestle with difficult questions over sensor and robotic employment for new training. Two new courses replaced the Basic Reconnaissance Course, a 12-week program known for its grueling nature in a move service officials said was meant to modernize training, reduce wait times for advanced schools and strengthen baseline infantry skills to meet fleet demands.

#47

Pentagon counter-drone task force announces pilot program to get directed energy systems to 5 installations

Government & Defense 2026-05-08 DefenseScoop 5.8 5.8/5.8/5.8

The Pentagon's counter-drone task force announced a new pilot program this week aimed at fielding directed energy systems for UAS defense to five military installations across the country over the next six months. Joint Interagency Task Force 401, an Army-led entity charged with boosting the military's counter-drone efforts, said the initiative is intended to protect infrastructure, military installations and domestic missions against unmanned aerial systems.

#48

CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models

Safety, Policy & Regulation 2026-05-08 Hugging Face Blog 5.8 5.8/5.8/5.8

CyberSecQwen-4B: Why Defensive Cyber Needs Small, Specialized, Locally-Runnable Models

#49

Building realistic electric transmission grid dataset at scale: a pipeline from open dataset

AI for Science 2026-05-08 Microsoft Research Blog 5.8 5.8/5.8/5.8

At a glance We construct geographically grounded, electrically coherent power grid models entirely from publicly available data and release a dataset spanning 48 U.S. states and multi-state interconnections. The models support AC optimal power flow (AC‑OPF) analysis, enabling physics-based study of congestion, capacity, and demand siting without restricted data. We demonstrate applications including transmission expansion potential, targeted line upgrades, and placement of large datacenter loads. Microsoft Research is excited to release an open dataset of approximate transmission topology of the U.S. power grid derived from publicly available data.

#50

Musk v. Altman week 2: OpenAI fires back, and Shivon Zilis reveals that Musk tried to poach Sam Altman

Industry 2026-05-08 MIT Technology Review — AI 5.8 5.8/5.8/5.8

In the second week of the landmark trial between Elon Musk and OpenAI, Musk's motivations for bringing the suit were under scrutiny. Last week, Musk took the stand, alleging that OpenAI CEO Sam Altman and president Greg Brockman had deceived him into donating $38 million to the company. He claimed that they'd promised to maintain it as a nonprofit dedicated to developing AI for the benefit of humanity, only to later accept billions of dollars of investment from Microsoft and restructure the company to operate a for-profit subsidiary.

#51

The Download: AI malaise and babymaking tech

Robotic Autonomy 2026-05-08 MIT Technology Review — AI 5.8 5.8/5.8/5.8

This is today's edition of The Download , our weekday newsletter that provides a daily dose of what's going on in the world of technology. We've entered the era of AI malaise AI is spreading everywhere, and it is not going away. But what will it do? What effect will it have on our society? Will it make life better, or worse? How will we know?

#52

Here’s how technology transformed babymaking

Robotic Autonomy 2026-05-08 MIT Technology Review — AI 5.8 5.8/5.8/5.8

Technology is changing the way we make babies. The pioneering work of the scientists who invented IVF led to the birth of the first "test tube baby" in 1978. We've come a long, long way since then. This week, I've been working on a piece about the cutting edge of IVF technologies and what's coming next. Think AI and robots and, potentially, gene-edited embryos. My reporting has also made me think about just how much progress has been made in the last five decades.

#53

Exclusive: Scale Investor Dan Levine Steps Back From Accel

Industry 2026-05-08 The Information — AI 5.8 5.8/5.8/5.8

Dan Levine, a partner at Accel known for early startup investments in Scale AI and Vercel, is stepping back from the venture capital firm and will no longer make new investments for it, people familiar with the matter said. The firm disclosed the change to limited partners in recent weeks. It ...

#54

Anthropic Signs $1.8 Billion Cloud Deal with Akamai

Infrastructure 2026-05-08 The Information — AI 5.8 5.8/5.8/5.8

Anthropic is the unnamed customer behind a $1.8 billion cloud deal that Akamai announced on Thursday, according to a person with direct knowledge of the matter. That deal sent the content delivery network provider's stock soaring 27% on Friday. The seven-year cloud contract—which Akamai ...

#55

Cursor Staff Meet With xAI Employees as Layoffs, Exits Mount

Industry 2026-05-08 The Information — AI 5.8 5.8/5.8/5.8

Cursor is already starting to make its presence known at SpaceX's AI unit, just weeks after Elon Musk's firm got an option to buy the coding startup for $60 billion. Cursor staffers have been visiting xAI offices to meet with employees and discuss their work, according to two people with direct knowledge of the companies. Fresh xAI exits have followed, including staff cuts last Friday, the people said.

#56

Exclusive: Salesforce Chief Communications Officer to Depart

Industry 2026-05-08 The Information — AI 5.8 5.8/5.8/5.8

Salesforce's chief communications officer Carolyn Guss is leaving the sales software giant, according to two people familiar with the matter. One of the people said Guss is joining a new company and that Salesforce will be appointing another chief communications officer soon. Among her ...

#57

Apple and Intel Have Reached “Preliminary” Chip Manufacturing Deal

Infrastructure 2026-05-08 The Information — AI 5.8 5.8/5.8/5.8

Apple has reached "a preliminary agreement" with Intel in recent months for manufacturing some of its chips, The Wall Street Journal reported . Earlier, this week Bloomberg reported that Apple was exploring such a partnership with Intel and Samsung. A deal with one of the companies would help ...

#58

Microsoft Shares Sink After TCI Cuts $8 Billion Stake

Industry 2026-05-08 The Information — AI 5.8 5.8/5.8/5.8

The hedge fund TCI cut almost all of its $8 billion stake in Microsoft, the Financial Times reported Friday , due to what CEO Christopher Hohn described in an investor letter as "uncertainty over Microsoft's competitive position in the future" as AI threatens to replace existing software like ...

#59

Corning’s Stock Rally Risks Outpacing Reality

Infrastructure 2026-05-08 The Information — AI 5.8 5.8/5.8/5.8

Shares of iconic glassmaker Corning Inc. have doubled so far this year to their highest price ever, thanks to surging demand for Corning's fiber optic cables that connect servers and AI data centers to one another. A partnership Corning announced with Nvidia this week gave the stock another lift. But Corning's current deluxe valuation, which is now based almost entirely on sky-high expectations for its optics business, might be more fragile than it appears.

#60

OpenAI's GPT 5.5 Instant: The Good, The Bad And The Insane

Frontier LLMs 2026-05-08 Two Minute Papers 5.8 5.8/5.8/5.8

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers 📝 GPT 5.5 Instant: https://deploymentsafety.openai.com/gpt-5-5-instant/introduction https://openai.com/index/gpt-5-5-instant/ Classifiers paper: https://arxiv.org/pdf/2501.18837 Our Patreon if you wish to support us: https://www.patreon.com/TwoMinutePapers 🙏 We would like to thank our generous Patreon supporters who

#61

Schumer seeks DHS plan on AI cyber coordination with state, local governments

Government & Defense 2026-05-08 FedScoop — AI 5.7 5.7/5.7/5.7

<

#62

2026.19: Earning & Spending

Industry 2026-05-08 Stratechery 5.7 5.7/5.7/5.7

(Photo by Maddie Meyer/Getty Images) Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we're sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don't want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week.

#63

As Adversaries Integrate, U.S. Partners Bypass Washington

Government & Defense 2026-05-08 War on the Rocks 5.7 5.7/5.7/5.7

The drones hitting Gulf Arab states daily since the United States and Israel launched large-scale military operations against Iran in February are not merely Iranian. They are originally Iranian, yes. But these designs and production processes were improved and refined by Russia through years of battlefield testing against Ukrainian defenses. So, they were returned to Tehran from Moscow. Confronted with a threat that Ukraine has spent four years learning to counter, the United States found itself in unfamiliar territory.

#64

Laid-off Oracle workers tried to negotiate better severance. Oracle said no.