← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Thursday, May 28, 2026

Coverage window: 2026-05-27 03:02 ET2026-05-28 03:03 ET
Press play to listen
Thursday, May 28, 2026
9m 38s · top-4 narrated briefing
#1 · Agents & Tool Use
AXPO: Closing the Thinking-Acting Gap in Multimodal Agentic Reasoning
Vision-language models with extended reasoning succeed on internal problems but fail when external tools are needed: under standard GRPO, the policy attempts tool calls on only ~30% of rollouts, and ~40% of those tool-using rollouts are all-wrong within their group, gutting the l…
7.1 · 5 srcs
#2 · Agents & Tool Use
MemTrace: Tracing and Attributing Errors in LLM Memory Systems
Long-horizon LLM agents leak, corrupt, and lose information through their memory systems in ways that are essentially undiagnosable today. MemTrace turns the memory pipeline into an executable memory evolution graph — every write, retrieve, summarize, and prune becomes a typed no…
6.9 · 5 srcs
#3 · Reinforcement Learning
Bidirectional Evolutionary Search Breaks the Best-of-N Wall
Best-of-N and standard tree search are limited by sparse verification signals and by autoregressive expansion confining candidates to a narrow probability shell. BES (Bidirectional Evolutionary Search) attacks both: forward search augments expansion with evolution operators that…
6.7 · 5 srcs
6.5
#1
Agents & Tool Use 2026-05-27 arXiv cs.CLHugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv — Evals & Benchmarks 7.1 7.1/7.1/7.1

Vision-language models with extended reasoning succeed on internal problems but fail when external tools are needed: under standard GRPO, the policy attempts tool calls on only ~30% of rollouts, and ~40% of those tool-using rollouts are all-wrong within their group, gutting the learning signal exactly where it's needed. The authors call this the Thinking-Acting Gap and propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples just the tool call and its continuation, with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO by +1.8pp Pass@1 and +1.8pp Pass@4 at 8B; the SFT+AXPO 8B model matches Qwen3-VL-Thinking-32B Base on Pass@4 with 4× fewer parameters. The asymmetry insight matters because it generalises any RL recipe where some action branches are high-variance auxiliaries to a default behavior — the standard group-normalized estimator silently down-weights exactly those branches.

How it was discussed
  • arXiv abstract and HF Daily framing emphasize the surfaced diagnostic — 30% tool-attempt rate and 40% all-wrong subgroups — as the actual measurable cause of the gap.
  • _akhaliq's thread on X highlighted the 8B-with-AXPO-matching-32B-base Pass@4 result as the headline number.
agents rl post-training
#2
Agents & Tool Use 2026-05-27 arXiv cs.CLarXiv cs.LGHugging Face Daily PapersAK (@_akhaliq) Daily PapersarXiv — Evals & Benchmarks 6.9 6.9/7.0/6.7

Long-horizon LLM agents leak, corrupt, and lose information through their memory systems in ways that are essentially undiagnosable today. MemTrace turns the memory pipeline into an executable memory evolution graph — every write, retrieve, summarize, and prune becomes a typed node whose inputs and outputs can be replayed — and pairs it with MemTraceBench, a benchmark spanning Long-Context, RAG, Mem0, and EverMemOS that exposes the systems' common failure modes. An automatic attribution method then iteratively walks operation subgraphs to identify the root cause of any failed case. Two findings stand out. Failures are not random: they cluster around operation-level issues like information loss during summarization and retrieval misalignment after schema drift. And the attribution signal is actionable — feeding it back into prompt optimization closes the loop and boosts end-task performance by up to 7.62%. For practitioners building multi-step agent memory, this is the first principled debugger.

How it was discussed
  • arXiv frames this as the first general-purpose framework for memory debugging; HF Daily highlights the +7.62% closed-loop gain.
  • Code at github.com/zjunlp/MemTrace promised on release.
agents evals
#3
Reinforcement Learning 2026-05-27 arXiv cs.CLarXiv — Agents / Tool UsearXiv — Evals & BenchmarksHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.7 6.7/6.7/6.7

Best-of-N and standard tree search are limited by sparse verification signals and by autoregressive expansion confining candidates to a narrow probability shell. BES (Bidirectional Evolutionary Search) attacks both: forward search augments expansion with evolution operators that recombine partial trajectories, generating candidates a single rollout cannot produce; backward search recursively decomposes the task into checkable subgoals to provide dense intermediate feedback. The theoretical motivation shows that expansion-only candidates are confined to a narrow entropy shell while evolutionary operators escape it, and that backward decomposition can exponentially reduce samples needed to find a correct answer. Empirically BES gains where mainstream post-training algorithms fail to improve on challenging tasks, and outperforms existing open-source inference frameworks on three open problem-solving benchmarks in both average and best-case performance. Code is released at Embodied-Minds-Lab/BES.

rl post-training agents
#4
Multimodal 2026-05-26 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.7 6.5/6.7/6.9

Token-by-token coordinate generation is the practical bottleneck for VLM grounding: each 2D box is serialized into multiple 1D tokens decoded sequentially, mismatching the coupled geometry. LocateAnything introduces Parallel Box Decoding (PBD), treating bounding boxes and points as atomic units decoded in a single step. The geometric coherence is preserved, throughput rises sharply, and high-IoU localization improves across benchmarks. A scalable data engine curates LocateAnything-Data with more than 138 million training samples, providing the diversity to push the speed-accuracy frontier on grounding and detection simultaneously.

multimodal evals
#5
Agents & Tool Use 2026-05-27 arXiv cs.CLarXiv cs.LGarXiv — Agents / Tool UsearXiv — Evals & BenchmarksHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.7 6.6/6.7/6.8

Most agent memory systems treat memory as a static repository with fixed retrieval pipelines, which is brittle when task variation and heterogeneous feedback continuously reshape what should be remembered. FluxMem models memory as a heterogeneous graph and progressively refines its topology through initial connection formation, feedback-driven refinement, and long-term consolidation: it repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits. Across LoCoMo, Mind2Web, and GAIA it achieves consistent state-of-the-art, showing strong adaptation in complex agentic environments.

agents
#6
Reinforcement Learning 2026-05-27 arXiv cs.CLarXiv cs.LGarXiv — Agents / Tool UsearXiv — Reinforcement LearningHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.6 6.6/6.6/6.5

Two clean findings about training multimodal verifiers: symbolic verifier outputs like bounding boxes outperform textual rationales as meta-verification signals and unlock rule-based RL rewards without an auxiliary judge model; and decoupling the RL objectives for binary judgment and meta-verification substantially beats joint reward optimization because the two have intrinsically different output structures. OmniVerifier-M1 applies both and adds M1-TTS, a verifier-driven agentic generation system that performs dynamic region-level self-correction during sampling.

rl multimodal evals
#7
Agents & Tool Use 2026-05-27 arXiv cs.CLarXiv cs.LGHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.6 6.7/6.6/6.4

Specializing small computer-use agents to a software domain with naive synthetic data barely moves the needle. LearnWeak uses a stronger reference agent to identify the student's specific weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically — then trains with an error-aware objective that disentangles planning and execution errors. On OSWorld, LearnWeak gains 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B across eight domains. The framework formalizes student-awareness in data synthesis: targeting weaknesses with behaviorally precise updates outperforms broad uniform supervision and existing autonomous trajectory generation baselines.

agents ai_coding
#8
Post-Training 2026-05-27 arXiv cs.CLarXiv cs.LGarXiv — Evals & BenchmarksHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.5 6.5/6.5/6.5

PEFT evaluations emphasize downstream accuracy and overlook retention of pretrained capabilities. PEFT-Arena scores both. Under matched parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. Spectral analysis explains why: orthogonal updates interact gently with the pretrained singular-value structure, while LoRA-style updates can produce non-isometric activation-space distortions tied to capability forgetting. The benchmark also finds final SFT checkpoints frequently overshoot a better target-retention operating point — a post-hoc path-wise rewinding case study restores capability without losing target accuracy.

post_training evals
#9
Evaluations & Benchmarks 2026-05-25 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.5 6.7/6.5/6.3

Video benchmarks check 'whether it is right' (prompt following) and ignore 'whether it is good' (cinematic quality and aesthetics). EvalVerse digitizes subjective cinematic expertise: organizes domain knowledge along the pre-production, production, and post-production workflow; curates large-scale expert annotations; and fine-tunes VLMs with expert-calibrated Chain-of-Thought reasoning to score them. Compatible with foundational rightness metrics but extends to multi-shot sequencing and audio-visual integration, providing granular diagnostic signals that double as reward signals for downstream RL.

evals generative_media
#10
Evaluations & Benchmarks 2026-05-26 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.5 6.6/6.5/6.4

SpatialBench evaluates 41 spatial foundation models across 6 paradigms on 19 datasets and 546 scenes spanning 5 spatial domains, with deterministic sampling. Two clear results: full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability; and in challenging embodied and egocentric tasks, strict domain alignment and high data quality dominate raw dataset scale. The authors also release DA-Next-5M and the DA-Next model, addressing the largest data gap surfaced by the evaluation.

evals robotic_autonomy
#11
Agents & Tool Use 2026-05-25 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.5 6.6/6.4/6.5

MobileGym hosts mobile app environments in the browser with full state-as-JSON capture, ~400MB per instance, and ~3s cold start, enabling deterministic state-based verification and hundreds of parallel rollouts on a single server. Its 416-task MobileGym-Bench across 28 apps uses deterministic judges and a structured AnswerSheet protocol that sidesteps free-text matching failures. A Sim-to-Real case study: GRPO on Qwen3-VL-4B-Instruct gains +12.8pp on the 256-task test set, and real-device execution retains 95.1% of the simulation-side training gain.

agents evals
#12
AI for Science 2026-05-27 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.5 6.7/6.5/6.3

The largest collection of research-level math problems to date, ResearchMath-14k curates 14,056 problems through a multi-agent pipeline from academic sources, and releases ResearchMath-Reasoning, 220K teacher trajectories. The authors document a startling generational trend: newer open-weight models produce 5.6× more references and 5.0× more fake references per trace. After agentic filtering of teacher traces, fine-tuning Qwen3 from 4B to 30B improves over base by 9.2 points on average — showing filtered open-problem attempts can supervise even without fully correct reasoning.

ai_science evals post_training
#13
AI for Science 2026-05-27 Latent Space (swyx & Alessio)Latent Space Podcast 6.5 7.5/6.5/5.5

Alex Rives (now Head of Science at BioHub after CZI's EvoScale acquisition) announced ESMFold2 with state-of-the-art performance on protein interactions — especially antibodies, a critical therapeutics modality — and evidence that inference-time scaling works across five cancer and immunology targets. Built on Cryo-EM data discussed in the BioHub pod, ESMFold2 is being released as an open scientific engine for prediction, design, and discovery across protein biology. The framing on Latent Space is explicit: the Bitter Lesson is coming for proteins, with the same compute-scales-predictably story that took ESM2 and ESM3 from masked-language-modeling on protein sequences to learning biological structure and function without explicit supervision.

ai_science research
#14
Reinforcement Learning 2026-05-27 arXiv cs.LGarXiv cs.CLarXiv — Post-training / AlignmentarXiv — Reinforcement LearningarXiv — Agents / Tool Use 6.4 6.4/6.4/6.4

Training competitive-programming RL checkpoints under nested unit-test coverage (low coverage tolerates only smaller-input passes; high coverage requires the full suite) reveals a correctness-efficiency Pareto frontier: high-coverage rewards reduce optimization failures but increase correctness failures, leaving solve rate roughly unchanged on hard problems. Interpolation between low- and high-coverage checkpoints recovers this frontier; extrapolation extends it. At pure reasoning, tool use, and agentic coding inference settings (32B and 7B), ensembles with extrapolative weight averaging improve pass@250 on LCB/hard by +3.3% over the best single checkpoint at matched budget — the extrapolated checkpoints solve a different set of problems, making them complementary in test-time scaling.

rl ai_coding post_training
#15
Reinforcement Learning 2026-05-27 arXiv cs.CLarXiv — Agents / Tool UsearXiv — Efficiency (Quantization, MoE, Inference)arXiv — Mechanistic InterpretabilityarXiv — Reinforcement Learning 6.4 6.4/6.4/6.4

Existing skill-based RL forces a binary: fully externalize skills (prohibitive context overhead) or fully internalize (overfitting and knowledge conflicts). Skill0.5 routes by difficulty — internalizes general skills via privileged distillation on hard tasks while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific-skill utilization. On ALFWorld and WebShop it beats both memory-based and skill-based RL baselines on in-distribution and out-of-distribution.

rl agents
#16
Interpretability 2026-05-27 arXiv cs.CLarXiv cs.LGarXiv — Evals & BenchmarksarXiv — Mechanistic Interpretability 6.4 6.4/6.4/6.3

A clean negative-then-positive result on Gemma-3-4B-IT: projecting task vectors onto SAE feature subspaces — the intuitive 'surgical edit' approach — acts as an information bottleneck that discards approximately 97% of the modification energy, yielding no statistically significant gains across seven math subjects. The geometric reason is misalignment between activation-space SAE directions and weight-space task vectors. Reframing SAEs as stethoscopes (used for layer-level diagnosis rather than projection-level filtering), the authors inject unfiltered raw task vectors only into SAE-identified layers and improve Number Theory accuracy from 29.6% to 39.4% (z=+3.41, p=0.0007) on Minerva Math, with 5 of 7 math subjects significantly improved and none significantly degraded.

interpretability post_training
#17
Post-Training 2026-05-27 arXiv cs.CLarXiv — Post-training / AlignmentarXiv — Reinforcement Learning 6.4 6.4/6.4/6.3

Standard DPO has an asymmetric gradient: it suppresses dispreferred responses faster than it promotes preferred ones, so models learn to avoid bad answers more than to generate good ones. AdaDPO introduces per-pair, stop-gradient-based coefficients derived from policy generation probabilities (with reference probabilities optional) that enforce equality of gradient magnitudes between preferred and dispreferred. On Llama-3-8B-Instruct trained on UltraFeedback under a SimPO-like setup, it outperforms DPO on AlpacaEval 2 in 81% of hyperparameter combinations, hits global-best LC win rate 48.3% and raw WR 46.1%, and enlarges the LC-over-WR margin in 88% of combinations — direct evidence of length-bias mitigation. Loss-level fix; drops into existing pipelines without architectural change.

post_training rl
#18
Multimodal 2026-05-26 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.4 6.4/6.4/6.4

A native multimodal embedding model built on Gemini that produces a single representation space for arbitrary interleaved video, audio, image, and text inputs. Trained with large-scale contrastive learning in a multi-task multi-stage setup, it sets SOTA on key benchmarks (62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual, 84.0 on MTEB Code), with strong zero-shot performance across specialized domains from astronomy to bioscience. Practical impact: RAG, recommendation, and search pipelines no longer need to maintain separate embedding indices per modality.

multimodal frontier_llm
#19
Industry 2026-05-27 The Information — AITechCrunch — AI 6.4 7.0/6.5/5.8

Cognition (the company behind Devin) raised more than $1 billion at a $26 billion post-money valuation, nearly double its prior $10B raise from about a year ago. The funding round positions Cognition among the highest-valued coding-agent startups even as the field has compressed pricing aggressively — Cursor, Claude Code, and the Cognition stack now all sell against the same enterprise IT budget. The Information notes the valuation jump came as enterprises shifted from per-seat AI subscriptions to API-priced consumption, the same dynamic Simon Willison documented this week.

industry ai_coding
#20
State Space Models 2026-05-27 arXiv cs.LGarXiv — Recurrent / Linear AttentionarXiv — State Space ModelsarXiv — Evals & Benchmarks 6.3 6.4/6.3/6.2

Most hybrid attention/recurrent designs statically interleave or merge blocks. Oryx flexibly switches between mixers along the token sequence: quadratic attention where rich context utilization is needed, linear recurrences elsewhere for efficient generation. Crucially, at least 90% of parameters are tied across mixers, so attention and recurrent modes operate over shared internal representations. Validated with Mamba-2 and Gated DeltaNet variants up to 1.4B; all 1.4B Oryx instances outperform their single-mixer baselines by ≥0.7pp on averaged language modeling, and on retrieval the Oryx model matches Transformer performance while processing only <10% of tokens in attention mode.

recurrent ssm efficiency
#21
Interpretability 2026-05-27 arXiv cs.CLarXiv — Mechanistic InterpretabilityarXiv — Evals & Benchmarks 6.3 6.4/6.3/6.2

Across eight models (four architectures, base and instruct), 2–3 mid-layer attention heads contribute causally to associating cultural items with identities. Knockout of identity-to-item edges on these heads lowers binding strength by 9–23%. The heads transfer from instruct back to base models, suggesting cultural binding emerges at pre-training. An α-scaling intervention at generation (α=2–3) increases cultural differentiation accuracy by 1–3pp while leaving neutral reasoning largely intact. A probing task shows models know 3–5× more about cultural identity than they act upon — the bottleneck is routing, not knowledge.

interpretability safety_policy
#22
Frontier LLMs 2026-05-26 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.3/6.3/6.3

A family of Mixture-of-Experts models with 229.9B total but only 9.8B active per token, designed end-to-end for agentic deployment. Three components: (i) agent-driven data pipelines producing verifiable trajectories grounded in executable workspaces with artifact-aligned rewards; (ii) Forge, a scalable agent-native RL system with windowed-FIFO scheduling, prefix-tree merging, and clean training-inference-agent decoupling supporting both white-box and black-box agents; (iii) the M2.7 checkpoint takes an early step toward self-evolution — autonomously debugging training runs and modifying its own scaffold. The series targets agentic coding, deep search, office-task, and reasoning benchmarks with a mini-activation footprint translating to frontier-tier performance.

frontier_llm agents efficiency
#23
Generative Media 2026-05-26 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.3/6.3/6.3

Although x, ε, and v are linearly convertible at a fixed corruption time, they are not interchangeable as prediction targets in latent space. A local Gaussian analysis shows velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean-latent prediction damps them. JLT — a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes — confirms it on ImageNet 256×256: FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. Reframes prediction targets as representation-dependent geometric choices, not algebraic equivalents.

generative_media research
#24
Multimodal 2026-05-25 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.4/6.3/6.2

LLaVA-OV-2 advances long-video VLM perception via codec-stream tokenization: it treats compressed video as a continuous bit-cost stream where bit-cost dynamics define adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. Shared 3D RoPE places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Training: ~8M re-captioned videos for pretraining, 4M-sample spatial corpus for fine-tuning. On the new JumpScore benchmark targeting fine-grained grounding in densely repeated motion, LLaVA-OV-2-8B reaches 74.9 mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; matched-budget codec-stream input improves temporal grounding by +9.7 points over frame sampling.

multimodal evals
#25
Robotic Autonomy 2026-05-27 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.3/6.3/6.2

Standard text-guided VLM pretraining ignores the low-level spatial and physical knowledge embodied agents need. GEM integrates a depth-map generation task directly into VLM pre-training, jointly with the main objective. The accompanying GEM-4M dataset pairs grounding, reasoning, and planning data with high-quality depth supervision. GEM achieves state-of-the-art on diverse embodied benchmarks, and the deployed GEM-VLA model exhibits superior task execution in both simulation and real-world evaluations.

robotic_autonomy multimodal
#26
AI for Science 2026-05-25 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.4/6.3/6.2

Autonomous research agents produce polished papers riddled with verifiability failures invisible to surface evaluation. Three contributions: Chain-of-Evidence (CoE), a framework requiring every claim trace to evidence; ScientistOne, an end-to-end system that maintains evidence chains by construction; and CoE Audit, a post-hoc audit with four integrity checks — score verification, specification violation, reference verification, method-code alignment. Across 75 papers from five systems, every baseline failed: hallucinated reference rates up to 21%, score verification passing in as few as 42% of papers, method-code alignment ranging 20–80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), highest method-code alignment (14/15), matching or exceeding human expert performance on all five evaluated tasks.

ai_science agents evals
#27
Agents & Tool Use 2026-05-26 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.3/6.3/6.2

Most hallucination benchmarks evaluate only final outputs, missing failures that originate in intermediate Thought-Action-Observation steps. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, scope-based) over expert-annotated agent traces from AssetOpsBench. Key findings: the most common failure modes are missed by existing benchmarks; nearly half of hallucinated trajectories involve multiple types simultaneously; automated detectors with high binary accuracy still misclassify subtle types. Trajectory-aware detection significantly outperforms post-hoc verification.

agents evals
#28
Government & Defense 2026-05-27 DefenseScoop 6.3 6.5/6.5/5.9

SSC tapped SpaceX under a $2.29B Other Transaction Authority for the Space Data Network (SDN) Backbone — formerly known as MILNET — requiring a 'fully operational prototype' by end of 2027. The architecture is the data-transport substrate for Golden Dome missile defense and the broader Combined Joint All-Domain Command and Control program, with a planned constellation of 480+ Starshield satellites providing tactical communications and broadband SATCOM across the joint force. The contract consolidates SpaceX's role as effective sole provider for Pentagon-wide proliferated-LEO SATCOM.

gov_defense infra
#29
Industry 2026-05-27 Simon Willison's Weblog 6.3 6.5/6.5/5.8

Anthropic is strongly rumored to have its first profitable quarter coming, and Willison reads the rising enterprise LLM bills as definitive product-market-fit signal: enterprise customers are now paying API prices, not subscription prices, and they're ramping up. The post connects Anthropic's purported profitability with reports of large companies (including Uber CTO's comments on Claude Code blowing AI budgets) and OpenAI's preparation to file for IPO in coming weeks. April marks a new inflection point in his framing: the AI-failure stories are unusually thin while API-revenue importance is dropping in favor of consumption-priced enterprise deployments.

industry ai_coding
#30
Industry 2026-05-27 OpenAI Research 6.3 6.3/6.3/6.3

OpenAI and Cisco announced a partnership using Codex to scale Cisco's AI-native development, accelerate the AI Defense product line, and automate defect remediation across Cisco's engineering pipeline. The framing positions Codex as the production engine for enterprise codebases at scale — directly competitive with what Anthropic announced via the KPMG partnership earlier in May. Cisco is the latest in a string of enterprise design partners (PwC, KPMG, Snowflake) that have committed to material codegen rollouts in Q2.

industry ai_coding
#31
Agents & Tool Use 2026-05-27 arXiv cs.CLarXiv — Agents / Tool UsearXiv — Evals & Benchmarks 6.2 6.3/6.2/6.2

MUTATE is an interactive benchmark for evaluating agentic divergent thinking at two levels: path-level (multiple alternative paths to the same goal) and action-level (non-typical, mechanism-shifting object uses). Unlike success-only evaluations, it scores both completed paths and off-path attempts, exposing reasoning that conventional metrics discard. Frontier LLMs show immediate action fixation under convergence pressure. The proposed ReDNA separates unconstrained divergent candidate generation from convergent constraint selection and significantly outperforms prior methods at both divergence levels, generalizing to an external creativity environment.

agents evals
#32
Efficiency 2026-05-26 arXivHugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.2 6.4/6.2/6.1

An on-device MoE scaling law identifies an on-device sweet spot — moderate sparsity with fine-grained and shared experts — that is simultaneously memory- and compute-optimal under mobile constraints. The MobileMoE family (0.3–0.9B active, 1.3–5.3B total) sets a new on-device Pareto frontier: across 14 benchmarks it matches or beats leading dense on-device LLMs with 2–4× fewer inference FLOPs, and matches or surpasses OLMoE-1B-7B with up to 60% fewer parameters. At comparable INT4 weight memory, MobileMoE-S delivers 1.8–3.8× faster prefill and 2.2–3.4× faster decode than MobileLLM-Pro on commodity smartphones — the first efficient MoE inference demonstrated at that scale.

efficiency frontier_llm
#33
Reinforcement Learning 2026-05-27 arXiv cs.LGarXiv — Reinforcement LearningarXiv — Evals & Benchmarks 6.2 6.2/6.2/6.2

Most RLVR data-selection pipelines need training-time signals or reward evaluation over large candidate pools — costly or infeasible in specialized domains. SHIFT runs a single deterministic reasoning rollout per candidate and computes a reasoning-induced representation shift (RIRS) as the start-to-end hidden-state delta, using the magnitude as a utility proxy and enforcing coverage via quality-weighted farthest-first CoreSet in RIRS feature space. Under ultra-low budgets across math reasoning and medical QA benchmarks, SHIFT consistently beats training-free diversity and difficulty/uncertainty baselines, improving both in-domain accuracy and transfer to harder evaluation settings. RIRS is not explained by simple length statistics.

rl post_training
#34
Evaluations & Benchmarks 2026-05-27 Hugging Face Blog 6.2 6.3/6.3/6.0

Artificial Analysis and IBM Research jointly released ITBench-AA, the first benchmark for agentic enterprise IT tasks measuring average precision at full recall on Kubernetes incident root-cause analysis from offline incident snapshots. All 18 evaluated frontier models score below 50%, with Claude Opus 4.7 (max) leading at 46.7% and GPT-5.5 (xhigh) at 45.8%. The gap to practical SRE deployment is large; the result complements the existing GDPval-AA index in showing that agentic-coding wins haven't yet translated to agentic-operations wins.

evals agents
#35
Infrastructure 2026-05-27 The Information — AITechCrunch — AI 6.2 6.5/6.3/5.8

Snowflake committed $6 billion to AWS over the next few years, with explicit usage of Amazon's Graviton ARM CPUs alongside AI accelerator infrastructure. The deal coincides with Snowflake stock jumping 35% on the back of accelerating AI product adoption. Graviton-tied workloads matter because they signal continued CPU relevance for AI-adjacent serving and analytics — not every layer of the modern AI stack is GPU-bound, and inference-side CPU offload is becoming a meaningful cost lever for hyperscale customers.

infra industry
#36
Infrastructure 2026-05-27 The Information — AI 6.2 6.3/6.3/6.0

ByteDance is considering raising 2026 capex to as much as $70 billion — roughly double its prior planning baseline — as data center and AI infrastructure costs surge. The trajectory keeps ByteDance among the top global compute spenders alongside Microsoft, Google, and Meta. China's hyperscale AI capex is becoming explicitly state-supportive: the matching CSET translation released this week on China's policy measures to accelerate the science-and-technology finance system flags exactly this kind of capital deployment as a national priority. China's AI-talent retention concerns covered in TechCrunch the same day fit the same political-economic narrative.

infra industry gov_defense
#37
Infrastructure 2026-05-27 NVIDIA AI Blog 6.1 6.0/6.3/6.0

NVIDIA's marketing/positioning post argues that the deployment unit for production AI is now the 'AI factory' — a full-stack system spanning compute, networking, storage, and orchestration software designed and sized as a unit. The framing follows a string of GB200/GB300 deployment announcements and aligns with the SGLang and DeepSeek long-context inference work that's been mining the GB300 NVL72 architecture for performance. Notable mainly as marketing direction confirmation: NVIDIA is selling rack-scale designs and the software stack now, not just GPUs.

infra industry
#38
Safety, Policy & Regulation 2026-05-27 Microsoft Research Blog 6.1 6.0/6.3/6.0

A position post arguing that modern AI systems are powerful because they presuppose human intelligence, extending structures already present in cognition and language rather than replicating them — and that this lens explains both capabilities and recurring limits like hallucinations. The post frames AI safety as a system-level engineering and governance challenge rather than a 'rogue AI' problem. The framing matters as Microsoft Research positions itself relative to Anthropic and OpenAI in the AI-safety discourse heading into a year of regulatory drafting.

safety_policy research
#39
Research 2026-05-27 Google AI Blog 6.1 6.2/6.2/5.9

Google details a zero-trust aggregation framework for private analytics — combining differential privacy with hardware-enforced trust separation so that aggregation servers cannot read or correlate per-user inputs even if compromised. The piece sits in the production-deployment side of the privacy-preserving ML stack (alongside federated analytics and anonymous-token systems), positioning Google's serving infrastructure to handle queries over user-derived training and analytic signals without single points of trust.

research safety_policy
#40
Industry 2026-05-27 The Information — AI 6.1 6.2/6.0/6.0

Meta is creating Enterprise Solutions — a new unit pairing forward-deployed engineers and data engineers with large corporate customers to integrate Meta's AI directly into customer systems. The Information also separately reports Meta is launching paid AI chatbot subscriptions across Facebook, Instagram, and WhatsApp. Both moves are explicit responses to Google, Microsoft, and Anthropic's enterprise traction — Meta is reorienting from consumer-only AI distribution to enterprise revenue and recurring subscription revenue simultaneously.

industry
#41
Industry 2026-05-27 TechCrunch — AI 6.0 6.0/6.0/6.0

Robinhood now lets AI agents trade stocks on behalf of users, with agents accessing only a pre-loaded balance in a dedicated wallet rather than the full account — a containment design clearly intended to bound failure modes when the agent is wrong. Agents can read and analyze the user's portfolio for strategy and suggestions, but the trading actions are scoped to the wallet balance. First mainstream agentic-trading rollout from a major US broker.

industry agents
#42
Audio & Speech 2026-05-27 TechCrunch — AI 6.0 6.0/6.0/6.0

ElevenLabs released a music model that can switch genres mid-track and lets users regenerate individual sections without affecting the rest of the song — section-level editability for generated music in the same way image-editing models support inpainting. The release lands during ElevenLabs' broader push past $500M ARR. Suno and Udio remain the closest competitors but neither has shipped section-level edit primitives at this fidelity.

audio generative_media
#43
Government & Defense 2026-05-27 TechCrunch — AI 6.0 6.0/6.0/6.0

TechCrunch tracks the explicit retention dynamic in Chinese AI labor markets: world-class talent emerging from the DeepSeek/Qwen/Moonshot wave is increasingly being kept domestic by a mix of equity opportunities at Chinese labs, state-aligned incentives, and tightening export-control-adjacent controls on technical migration. The piece pairs with this week's CSET-translated Chinese policy measures on accelerating capital flows into S&T enterprises — the talent retention and the capital deployment are two sides of the same national strategy.

gov_defense industry
#44
Evaluations & Benchmarks 2026-05-27 Artificial Analysis 5.9 5.9/5.9/5.9

Artificial Analysis added new language model evaluations this week: Gemini 3.5 Flash (medium) on 2026-05-27, Grok Code Fast 1 and MiniCPM5-1B (Non-reasoning) on 2026-05-25, and Grok 4.3 (medium) on 2026-05-22. The 'MiniCPM5-1B: The leading 1B open weights model' article was published 2026-05-26. The current Intelligence Index v4.0 leaderboard has GPT-5.5 (xhigh) at 60.2, Claude Opus 4.7 (max) at 57.3, Gemini 3.1 Pro Preview at 57.2, Gemini 3.5 Flash at 55.3 — Gemini 3.5 Flash's new entry is notable: it's the new leader on the intelligence-vs-speed Pareto frontier.

evals frontier_llm
#45
Industry 2026-05-27 Last Week in AI 5.9 6.0/6.0/5.7

Andrey Kurenkov's weekly digest covers Musk losing his $150B suit against OpenAI/Altman on a statute-of-limitations finding, OpenAI preparing to file for IPO in coming weeks, Google's I/O updates, and a milestone where OpenAI's reasoning models solved an Erdős conjecture problem. The Musk vs. OpenAI trial yielded the surprising disclosure that Musk himself had similar non-profit-to-profit ambitions, which the trial showed in detail — partial vindication for OpenAI's strategic pivot in the public narrative.

industry safety_policy
Items
45
Multi-source
31
Long-form (≥7.5)
0
Sources OK / attempted
117 / 119
Top category
Agents & Tool Use
7 items