Wolf Digest — 2026-05-30

#1

Anthropic raises $65B Series H at $965B post-money valuation

Industry 2026-05-28 Anthropic News 8.6 8.3/8.3/9.2

Anthropic disclosed a $65 billion Series H at a $965 billion post-money valuation alongside the Opus 4.8 launch. The round sits inside a multi-leg compute build-out announced in parallel: a 5 gigawatt commitment with Amazon, a 5 gigawatt TPU commitment with Google and Broadcom, and a SpaceX Colossus data-center participation. Stratechery and several follow-on reads frame the financing as the necessary other side of the Opus 4.8 capability story — the model only ships if the compute backing the next-generation training runs is locked in. The post-money number is now the largest valuation ever attached to an AI company at the financing stage and effectively closes the gap with OpenAI on private-market pricing.

anthropic funding compute

#2

Mistral launches physics AI: foundation models for engineering simulation

AI for Science 2026-05-27 Mistral AI News 7.6 7.8/8.0/7.0

Mistral introduced a new class of models trained to predict the behavior of physical systems, framed as a foundation for accelerating engineers and hardware product teams. The announcement positions Mistral alongside neural-surrogate efforts at the major labs (DeepMind GraphCast/GenCast lineage on weather, NVIDIA Modulus on engineering CFD), aiming for general physics emulators rather than domain-specific surrogates. Specifics on architecture, training data, and benchmark numbers were not published in the launch post; the practical read is that Europe's frontier-LLM lab is opening a second product line in scientific computing as the LLM market consolidates.

mistral physics simulation

#3

SGLang + AMD MI355X with MoRI hits cost-competitive DeepSeek-R1 distributed inference

Infrastructure 2026-05-28 LMSYS Blog (Chatbot Arena) 7.4 8.0/6.6/7.6

SGLang and the AMD team published TCO comparisons for disaggregated DeepSeek-R1 inference on AMD Instinct MI355X with MoRI (multi-ring interconnect collective). The post argues parity-to-better cost-per-token versus NVIDIA H100/B200 baselines at the same throughput class, driven by HBM3e capacity headroom for MoE expert weights and an MoRI all-to-all that closes the EP-collective gap MI300X struggled with. Material for buyers running open-weight MoE workloads at scale who have been waiting for a credible alternative to NVL72-class systems.

amd mi355x sglang deepseek

#4

Physical Intelligence pi-0.7: a steerable robotic foundation model with emergent capabilities

Robotic Autonomy 2026-04-16 Physical Intelligence (PI) 7.3 7.6/7.6/6.7

Physical Intelligence published the pi-0.7 announcement in mid-April but the page was browser-required and missed by the previous fetcher pass; surfacing now as the most recent PI release in the catch-up sweep. pi-0.7 is a steerable VLA exhibiting what the team frames as a step-change in cross-task generalization. The post sits in a line that includes the March precise-manipulation online-RL work, the March long-and-short-term memory VLA, the February Physical Intelligence Layer announcement with partner deployments, and the November pi-star-0.6 RL-experience system.

physical-intelligence vla robotics

#5

Transformer Circuits Thread (Anthropic): Natural Language Autoencoders + HeadVis — interpretability shipped this month

Interpretability 2026-05-28 Transformer Circuits Thread (Anthropic) 7.2 6.8/8.0/6.8

Anthropic's Interpretability Thread published two May 2026 entries: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations (Fraser-Taliente, Kantamneni, Ong, et al.) — training Claude to translate its own internal state into natural language — and HeadVis, an interactive tool for inspecting attention-head behavior. Yesterday's digest scored the NLAE/HeadVis combination at 7.6; today's entry surfaces the page-level state for readers who want the index of recent posts. The thread also has the April 2026 Emotion Concepts paper (Sofroniew et al.) which finds emotion representations in Claude Sonnet 4.5 that causally influence outputs.

interpretability sae attention

#6

Arc Institute: PerturbSpace brings spatial context to in vivo CRISPR screens

AI for Science 2026-05-18 Arc Institute 7.0 7.2/7.4/6.4

Arc Institute described PerturbSpace, a single labeling step added to a standard single-cell sequencing workflow that lets each cell carry its tissue location alongside its transcriptome, CRISPR guide identity, and other modalities. The result is in-vivo CRISPR screen datasets where neighbor and tissue context are first-class features — addressing a long-standing limitation that perturbation screens were effectively run on cells stripped from their tissue environment. The post landed mid-month but represents an Arc release worth pulling forward as Saturday catch-up.

arc-institute crispr spatial

#7

Perplexity ships Personal Computer for Mac and Computer-for-Office (Excel/Word/PowerPoint/Outlook)

Agents & Tool Use 2026-05-28 Perplexity AI 6.9 7.1/6.3/7.3

Perplexity finalized a busy month for the Computer agent stack: Personal Computer (a local-machine variant of the multi-model orchestration product) is now available to all Mac users, and the company extended Computer into the four classic Microsoft Office surfaces. The pitch is that the same agent that drives the Comet browser now reaches the files-and-apps boundary on the desktop. Worth tracking as Anthropic, Google, and Perplexity converge on similar product shapes — agent that reads/writes across local apps with human-in-the-loop confirmation.

perplexity agents computer-use

#8

CDAO: Five Eyes allies accelerate 'Project Arcadia' at Combined Digital Leadership Summit

Government & Defense 2026-05-11 DoD Chief Digital and AI Office (CDAO) 6.8 6.4/7.6/6.4

The Department of War CDAO surfaced two May items: the May 11 Combined Digital Leadership Summit progress note on Project Arcadia, the Five Eyes shared digital-battlespace initiative; and a May 1 Classified Networks AI Agreements release that formalizes data-sharing arrangements for AI work on classified fabrics. Both are continuation pieces in the broader CDAO push that began with the December GenAI.mil platform launch and the January innovation-ecosystem overhaul.

cdao five-eyes classified

#9

Artificial Analysis changelog: Claude Opus 4.8 #1 article and Gemini 3.5 Flash medium evaluation added

Evaluations & Benchmarks 2026-05-28 Artificial Analysis 6.7 6.1/6.8/7.2

Artificial Analysis published a "Claude Opus 4.8 - The new #1 AI model" article on May 28 and listed both the Opus 4.8 (Adaptive Reasoning, Max Effort) evaluation entry and a Gemini 3.5 Flash (medium) evaluation on May 27. Current Intelligence Index v4.0 leaderboard at 28 of 527 models: Claude Opus 4.8 (max) 61.4, GPT-5.5 (xhigh) 60.2, Claude Opus 4.7 (max) 57.3, Gemini 3.1 Pro Preview 57.2, GPT-5.4 (xhigh) 56.8, Qwen3.7 Max 56.6, Gemini 3.5 Flash 55.3, Kimi K2.6 53.9, MiMo-V2.5-Pro 53.8, Grok 4.3 (high) 53.2. Opus 4.8 also tops the new GDPval-AA economically-valuable-tasks leaderboard at Elo 1890.

benchmarks leaderboard opus-4.8

#10

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Robotic Autonomy 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 6.8/6.8/6.6

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where...

#11

CSET releases 'Beyond P(doom) for AI Risk: Quantifying Uncertainty Without Probability' (Lohn)

Safety, Policy & Regulation 2026-05-15 CSET — Center for Security and Emerging Technology (Georgetown) 6.6 6.2/7.4/6.2

Andrew Lohn's May 2026 CSET report argues that probability-of-doom estimates are the wrong tool for catastrophic-AI risk modeling and proposes Belief and Plausibility (Dempster-Shafer) as a more honest representation of the actual epistemic state. The report sits alongside Bresnick and Dohmen's April piece on trusted semiconductor supply chains and a series of new CSET Chinese-policy translations covering the May 12 AI Ethics Management draft, the Data Factor of Production three-year plan, and the amended Cybersecurity Law.

cset risk uncertainty

#12

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Agents & Tool Use 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 6.4/6.1/7.4

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment,...

#13

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Generative Media 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 6.2/6.2/7.5

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13...

#14

DIU launches Navy Mine Countermeasure Modernization Prize Challenge and Army Driverless Cars Prize Challenge

Government & Defense 2026-05-29 Defense Innovation Unit (DIU) 6.5 6.1/7.3/6.1

DIU announced two prize challenges in the last week of May: a Navy partnership on autonomous mine countermeasures (May 29) and an Army partnership on driverless ground vehicles (May 22). The challenges follow the agency's January re-designation as a Defense Field Activity and continue the "Drone Dominance" program structure from February. Companion piece: a May 19 project spotlight on XR devices for training-bottleneck mitigation.

diu autonomy prize-challenges

#15

Stability AI ships Stable Audio 3.0 — open-weight model family on fully licensed training data

Audio & Speech 2026-05-20 Stability AI News 6.5 6.3/6.3/6.9

Stability AI released the Stable Audio 3.0 family on May 20, an open-weight set trained on fully licensed data and positioned as the foundation for downstream audio-community work. The release is the successor to Stable Audio 2.5 (enterprise-grade September 2025) and Stable Audio Open Small (Arm partnership, on-device). Surfaced from the browser-captured news listing on Saturday catch-up; not in the in-window slice but represents the most current open-weight audio generation alternative to Suno and ElevenLabs.

stability audio open-weights

#16

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.7/6.5/6.4

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches...

#17

Anthropic Newsroom: Milan office opens; Korea Representative Director KiYoung Choi appointed ahead of Seoul

Industry 2026-05-27 Anthropic News 6.4 6.1/6.1/7.0

Two procedural announcements from Anthropic this week reflect the international rollout that follows the Series H. A Milan office now anchors Italian enterprise sales, research recruiting, and developer relations; KiYoung Choi was named Representative Director for Korea ahead of a Seoul office opening. Neither item is a research or product release on its own, but together they continue the pattern of pre-positioning national-market footprints in tandem with the sovereign-AI wave that Cohere, Mistral, and OpenAI are also leaning into.

anthropic enterprise

#18

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

Post-Training 2026-05-25 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.4/6.6/6.1

Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via Low-Rank Adaptation (LoRA). As the number of desired effects grows, storing and dynamically loading numerous these effect LoRAs significantly increases deployment overhead. Furthermore, current pipelines typically cascade these effect LoRAs with acceleration modules for fast generation, which triggers severe parameter interference and results in concept bleeding and style degradation. We propose CollectionLoRA, a multi-teacher on-policy distillation framework capable of distilling the concepts of up to 50 different effect LoRAs along with few-step generation capabilities into a single LoRA. This fundamentally resolves the feature interference issue and significantly reduces deployment costs. Specifically, the method introduces (i) a Probabilistic Dual-Stream Routing mechanism that...

#19

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Generative Media 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.2/5.8/7.1

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation,...

#20

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

Multimodal 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.3/6.3

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and...

#21

Xetrieval: Mechanistically Explaining Dense Retrieval

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.3/6.3

Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose Xetrieval, an embedding-level mechanistic framework for explaining dense retrieval. Xetrieval first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps...

#22

GenClaw: Code-Driven Agentic Image Generation

Generative Media 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 5.8/5.8/7.2

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then...

#23

EarlyTom: Early Token Compression Completes Fast Video Understanding

Generative Media 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 5.8/5.8/7.2

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs...

#24

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Post-Training 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.5/6.2/6.2

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim...

#25

Native Audio-Visual Alignment for Generation

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.3/6.5/6.0

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint...

#26

Colored Noise Diffusion Sampling

Generative Media 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 5.8/5.8/7.2

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers...

#27

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.5/6.2/6.2

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin...

#28

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Post-Training 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.4/6.2/6.1

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and...

#29

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Multimodal 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.4/6.2/6.1

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations...

#30

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.4/6.2/6.1

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as Contextual Belief Management (CBM): maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics...

#31

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Post-Training 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before...

#32

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Agents & Tool Use 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.0/5.2/7.0

Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation...

#33

Is Position Bias in Dense Retrievers Built In-or Learned from Data?

Research 2026-05-26 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware...

#34

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Agents & Tool Use 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.0/5.2/7.0

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy,...

#35

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

Agents & Tool Use 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.0/5.2/7.0

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more...

#36

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

Evaluations & Benchmarks 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 5.3/6.0/7.1

The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification & Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer...

#37

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Reinforcement Learning 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

#38

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Reinforcement Learning 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across...

#39

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Evaluations & Benchmarks 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 5.3/6.0/7.1

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's...

#40

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs difficult to verify. Conversely, constrained decoding ensures standardized formats but can inadvertently restrict reasoning capabilities by imposing constraints too early in the generation process. We propose a hybrid approach, namely In-Writing, that combines free-form reasoning and structured generation in a single call. The model first performs unconstrained reasoning and only applies structured decoding after a trigger token is generated, explicitly decoupling reasoning from formatting. We establish that our trigger-token strategies are able to virtually eradicate premature triggering, a failure mode in which constrained decoding interrupts on-going reasoning. Evaluations across diverse datasets covering classification and reasoning tasks demonstrate that our...

#41

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Agents & Tool Use 2026-05-26 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.0/5.2/7.0

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58%...

#42

Reflective Prompt Tuning through Language Model Function-Calling

Research 2026-05-20 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the...

#43

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Post-Training 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a...

#44

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Multimodal 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.0

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency...

#45

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reinforcement Learning 2026-05-26 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.1/6.4/5.9

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this...

#46

How Braintrust turns customer requests into code with Codex

AI Coding 2026-05-29 OpenAI Research 6.1 6.3/5.5/6.5

How Braintrust engineers use Codex with GPT-5.5 to run experiments and code faster.

#47

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.2/6.0/5.9

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting,...

#48

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

Agents & Tool Use 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/5.2/6.9

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0,...

#49

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Agents & Tool Use 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/5.2/6.9

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency...

#50

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

Agents & Tool Use 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/5.2/6.9

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (UI-KOBE), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent...

#51

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.2/6.0/5.9

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through...

#52

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Robotic Autonomy 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/6.0/5.9

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with...

#53

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.2/6.0/5.9

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even...

#54

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Agents & Tool Use 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/5.2/6.9

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session...

#55

NeuROK: Generative 4D Neural Object Kinematics

Generative Media 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 5.5/5.5/6.9

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under various physical conditions -- remains challenging and often ad hoc, despite its importance in building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space representing all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the...