Wolf Digest — 2026-05-29

#1

Anthropic ships Claude Opus 4.8 - new #1 on the Artificial Analysis Intelligence Index, 84% on Online-Mind2Web, dynamic workflows in Claude Code

Frontier LLMs 2026-05-28 Anthropic NewsSimon WillisonArtificial AnalysisLatent Space (AINews)TechCrunch - AI 8.7 8.5/8.8/8.8

Anthropic released Claude Opus 4.8 yesterday morning at the same price as Opus 4.7 - $5 per million input tokens and $25 per million output - alongside a re-engineered fast mode that runs at roughly 2.5x the steady-state token rate for three times less than previous fast-mode pricing. The headline benchmark deltas, lined up against Opus 4.7 and the contemporaneous GPT-5.5 release, put Opus 4.8 at the top of the Artificial Analysis Intelligence Index v4.0 at 61.4 (Opus 4.7 was 57.3, GPT-5.5 xhigh 60.2), and at 84% on Online-Mind2Web for browser-agent end-to-end task completion - Anthropic's framing calls this a meaningful jump over both 4.7 and GPT-5.5 and the largest single-release move on that benchmark in the post-CUA era. Anthropic's reported evaluations also claim Opus 4.8 is roughly four times less likely than Opus 4.7 to let a flaw in code it has written pass unremarked, a hallucination-suppression signal corroborated by AA-Omniscience: Opus 4.8 is third on the index at +27, behind Gemini 3.1 Pro Preview (+33) and ahead of Opus 4.7 (+26).

The headline architectural / capability shift ships with two product-level companions. Dynamic workflows, in research preview for Claude Code on Enterprise, Team, and Max plans, lets Claude plan a task, fork hundreds of parallel subagents inside a single Claude Code session, then verify outputs against the project's existing test suite before reporting back - the model card calls out codebase-scale migrations across hundreds of thousands of lines from kickoff to merge as a worked example. Effort control, available on every plan, lets the user pick how hard Claude thinks per turn. The Messages API now accepts system entries inside the messages array, so an agent harness can update permissions, token budgets, or environment context mid-task without breaking the prompt cache or routing through a synthetic user turn - a quiet but pragmatic concession to long-running agents.

Beyond the in-product changes, the alignment readout is the more interesting paragraph. Anthropic's alignment team frames Opus 4.8 as reaching new highs on prosocial traits (supporting user autonomy, acting in the user's best interest) and reports rates of misaligned behavior - deception, cooperation with misuse - substantially below Opus 4.7 and roughly at parity with Claude Mythos Preview, the unreleased cybersecurity-tier model currently in restricted use through Project Glasswing. Anthropic states it expects to bring Mythos-class models to all customers in the coming weeks once stronger cyber safeguards are in place - that is the first concrete public signal about the next intelligence tier above Opus.

Reactions were measured. Simon Willison's read, published within hours, highlighted Anthropic's deliberately understated framing - the release notes call Opus 4.8 'a modest but tangible improvement' - and pulled the system card line that Opus 4.8 achieved the lowest hallucination rate of the six models tested mainly by abstaining on questions about which it was uncertain. Cursor (Michael Truell) reports Opus 4.8 exceeds prior Opus models at every effort level on CursorBench with meaningfully more efficient tool calling. Cognition CEO Scott Wu, quoted in the Anthropic release, says Opus 4.8 fixes the comment-verbosity and tool-calling issues that surfaced in Opus 4.7 and translates directly into faster capability gains for engineers building on Devin. Databricks' Hanlin Tang reports a 61% reduction in token cost on multimodal reasoning over PDFs and diagrams in Genie, suggesting the price-per-quality math is meaningfully better despite list price being unchanged.

How it was discussed

Simon Willison: refreshing to see an AI lab honestly describe a release as a minor incremental improvement. Highlights system card: lowest incorrect rate achieved by abstaining on uncertainty, not by answering more.
Artificial Analysis: Opus 4.8 (max) reaches 61.4 on the Intelligence Index, ahead of GPT-5.5 xhigh at 60.2; tops GDPval-AA at 1890 Elo; still underperforms on IFBench at 62%.
Latent Space AINews: positions release as part of Anthropic's flippening narrative against OpenAI alongside the $65B Series H; flags compute parity (5GW Amazon, 5GW Google/Broadcom, SpaceX Colossus) as the structural change.
TechCrunch: frames as the final private fundraise before a highly anticipated IPO, with Opus 4.8 as product anchor for the IPO story.

claude opus 4.8 agents browser-agent evaluation

#2

Anthropic closes $65B Series H at $965B post-money, $47B run-rate revenue; 5GW Amazon, 5GW Google/Broadcom TPUs, SpaceX Colossus GPU access

Industry 2026-05-28 Anthropic NewsTechCrunch - AILatent Space (AINews) 8.5 8.4/8.5/8.7

Anthropic announced a $65 billion Series H at a $965 billion post-money valuation, led by Altimeter, Dragoneer, Greenoaks, and Sequoia, with co-leads Capital Group, Coatue, D1, GIC, ICONIQ, and XN. The release pins current run-rate revenue at $47 billion as of May 2026 - Latent Space notes the figure was $9 billion in December 2025, an unusually steep curve even by post-2023 standards. The round includes $15 billion of previously committed strategic-investor capital (including $5 billion from Amazon) and brings in Micron, Samsung, and SK hynix as strategic infrastructure partners - the entire memory supply chain in one round.

The compute story is the structural news. Anthropic disclosed three concurrent capacity agreements: up to 5 gigawatts of new AWS capacity, 5 gigawatts of next-generation TPU capacity with Google and Broadcom, and access to GPU capacity in SpaceX's Colossus 1 and Colossus 2 - Anthropic is now the first frontier lab to ship a model available across all three of the major hyperscalers (AWS, Google Cloud, Azure). AWS remains the primary training partner. The combined power footprint puts Anthropic in the same compute league as OpenAI's announced Stargate sites - Latent Space frames the structural read as Anthropic at least temporarily ahead of OpenAI in every headline dimension other than raw compute and non-coding benchmarks, with this round closing most of the remaining compute gap.

Notable secondary participants in the round include Blackstone, Brookfield, GIC, Temasek, T. Rowe Price (both Associates and Investment Management entities), Fidelity, Jane Street, Lightspeed, MGX, NTTVC, Insight, General Catalyst, D.E. Shaw Ventures, DST Global, Baillie Gifford, AMP, and Situational Awareness LP. TechCrunch's framing: this may be Anthropic's final private fundraise before an IPO. The use-of-proceeds language in the release calls out safety and interpretability research, compute expansion, and product/partnership scale - not retroactively unusual but worth noting given the Mythos-class cyber-safeguards work referenced in the Opus 4.8 announcement.

How it was discussed

Latent Space AINews: revenue went from $9B in December 2025 to $47B run-rate by May 2026 - fastest growing company of all time. Now ahead of OpenAI on every headline dimension outside compute and non-coding benchmarks.
TechCrunch: final private fundraise before a highly anticipated IPO; frames Opus 4.8 release as IPO-side product anchor.
Anthropic itself emphasizes the memory-supply-chain strategic angle (Micron + Samsung + SK hynix in one round) and the AWS/Google/SpaceX compute triangle.

anthropic funding valuation compute tpu

#3

Mistral release wave: Vibe long-horizon agent + VS Code extension, Mistral Medium 3.5 powering Vibe remote agents, Search Toolkit production pipelines, physics AI foundation model

Agents & Tool Use 2026-05-28 Mistral AI News 7.7 7.6/7.6/7.8

Mistral cleared most of the May 27-28 news cycle with what looks more like a single coordinated release wave than a sequence. The top headlines: Vibe - 'the unified agent for long-horizon productivity and coding' - launches with Work and Code modes and a new Vibe VS Code extension; underneath Vibe, Mistral Medium 3.5 lands as the new mid-tier model and is what powers Vibe's remote agents and new Work mode in Le Chat. The Artificial Analysis Intelligence Index now has Mistral Medium 3.5 at 39.2, in the same tier as Gemma 4 31B and within reach of Claude 4.5 Haiku at 37.1; not a frontier number, but a markedly better price-quality position than Mistral Medium 3.

Two other releases shipped the same week. The Search Toolkit, framed as production search pipelines anywhere, is Mistral's bid for the enterprise-search slot that Glean, Vertex AI Search, and Perplexity Enterprise currently occupy. And 'physics AI at Mistral' - a new class of models that predict the behavior of physical systems for engineering and hardware design - is paired with a research-side post titled 'Physics AI research that's shaping the industry' covering published breakthroughs. The framing puts Mistral into the engineering-CAE foundation-model space alongside Google's neural-PDE work and the NVIDIA Modulus line, with explicit positioning around powering the engineers and hardware products of tomorrow. Read this as Mistral broadening from pure LLM into engineering-vertical foundation models - the same expansion path that distinguishes the leading labs.

Also flagged but less explained at this point: 'Connect the dots: Build with built-in and custom MCPs in Studio' (May 22) shows Mistral building first-class MCP support into Le Chat's developer studio, with reusable connectors, direct tool calling, and human-in-the-loop approval controls - the MCP standardization push has now reached the second tier of model labs.

mistral vibe agents mcp physics-ai

#4

Anthropic Interpretability ships Natural Language Autoencoders - Claude translates its own internal activations to natural language; HeadVis adds attention-head visualization

Interpretability 2026-05-28 Transformer Circuits Thread 7.6 7.7/8.0/7.0

Anthropic's interpretability team published two May 2026 Circuits Thread entries that together advance two of the most-asked-for capabilities in mech-interp: a way to read out feature semantics directly without SAE feature-naming legwork, and a way to actually look at attention head behavior interactively. The first paper, 'Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations' (Fraser-Taliente, Kantamneni, Ong, et al., 2026), trains Claude itself to translate its internal state into natural language - the autoencoder's bottleneck is a textual description, generated by Claude, that compresses the activation it then has to reconstruct. The framing is that you finally get a per-activation natural-language label without standing up a per-feature labeling pipeline on top of an SAE, and the labels are unsupervised in the sense that they emerge from Claude's own descriptive vocabulary rather than from human curation. This is a different style of interpretability primitive from the SAE + dictionary-learning + automated-interpretability stack the team has been on for two years.

HeadVis, the second entry, is a tool rather than a paper - an interactive visualization for attention-head behaviors in language models. The Circuits Thread framing is short: the team wanted to be able to see what attention heads do at scale, and HeadVis is what they built. The combination signals that Anthropic interpretability is publishing more tooling alongside the formal papers - the previous month's Emotion Concepts and their Function in a Large Language Model (Sofroniew et al., April 2026, with the headline result that emotion-concept representations causally influence Claude Sonnet 4.5's outputs) sits in the same lineage. Read NLAE as the methodological move and HeadVis as the inspection tooling that should make follow-on circuit work cheaper.

interpretability anthropic autoencoder attention claude

#5

DOD wants more than $2B in fiscal 2027 for CJADC2 - $1.5B+ for Palantir Maven Smart System expansion across combatant commands

Government & Defense 2026-05-28 DefenseScoop 7.3 7.4/7.6/6.8

Pentagon FY2027 budget materials surface a $2-billion-plus line item for command-and-control technology licenses and engineering support across the combatant commands, Joint Staff, and National Guard Bureau - and more than $1.5 billion of that is earmarked for expansion of Palantir's Maven Smart System under the Combined Joint All-Domain Command and Control (CJADC2) initiative. The framing in the budget request is that current CJADC2 deployments are 'fragmented' and need centralized funding to consolidate, which is striking given the Maven Smart System contract was originally awarded as a relatively narrow targeting-data fusion vehicle and has now become the de facto C2 backbone the department is funding centrally. The scale puts CJADC2 in the same budget tier as several major weapons programs and is the largest explicit AI-software line item in the FY27 request.

palantir maven cjadc2 dod budget

#6

Glean tops $300M ARR - tripled revenue as AI budget-cutting becomes its enterprise selling point

Industry 2026-05-28 TechCrunch - AILatent Space (AINews) 6.8 6.6/6.8/7.0

Glean's enterprise AI search business crossed $300 million annual run-rate, tripling its top line in the past year even as the hyperscalers (Google with Vertex AI Search, Microsoft with Copilot Search, Amazon with Q Business) shipped competing offerings. TechCrunch's framing: Glean's go-to-market pitch has shifted from find your enterprise knowledge faster to consolidate the seven AI subscriptions your knowledge workers are accumulating. The cross-source signal from Latent Space puts Glean alongside OpenRouter as one of the breakout enterprise-side AI companies of the past two quarters.

glean enterprise-search ARR

#7

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Robotic Autonomy 2026-05-28 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.8

A unified embodied foundation model extending Qwen's vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. Trained jointly across robotics-manipulation trajectories, human egocentric demos, simulation data, vision-and-language navigation, and trajectory-centric supervision; embodiment-aware prompt conditioning lets the same model address multiple robot platforms. Manipulation, navigation, and trajectory prediction are unified into one action-and-trajectory output head. Among multi-source HF Daily Papers this run, this is the highest-coverage VLA submission.

qwen vla robotics multimodal

#8

SGLang + AMD ship MoRI - cost-competitive distributed inference for DeepSeek-R1 on Instinct MI355X

Efficiency 2026-05-28 LMSYS Blog (Chatbot Arena) 6.7 6.8/6.6/6.7

The SGLang and AMD teams published a joint post showing competitive total-cost-of-ownership for large-scale DeepSeek-R1 disaggregated inference on Instinct MI355X GPUs, mediated by a new routing-aware kernel orchestrator called MoRI built on SGLang's serving stack. The takeaway is structural: AMD now has end-to-end MoE serving with one of the two leading open-source inference frameworks, closing one of the last remaining ecosystem gaps versus NVIDIA on production inference. Pair with the LMSYS Blog's ROCm-on-Miles post earlier in the month for the RL-training side.

sglang amd mi355x deepseek moe

#9

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Research 2026-05-28 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksarXiv — Mechanistic InterpretabilityarXiv — Reinforcement LearningHugging Face Daily Papers 6.7

Casts long-horizon LLM interaction as Contextual Belief Management (CBM): maintaining an internal belief state aligned with formal evidence while isolating task-irrelevant noise. Introduces BeliefTrack - Rule Discovery + Circuit Diagnosis with symbolic verifiers for turn-level evaluation. Diagnoses three failures (Failed Stay, Failed Update, Failed Isolation); vanilla LLMs fail systematically, prompts help marginally, and RL with belief-state rewards drops failure rates 70.9% on average. Representation-level steering recovers 46.1% in the residual.

belief-tracking rl evals

#10

Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

Research 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7

Generative multi-agent world model for interactive simulation beyond two players. Introduces Simplex Rotary Agent Encoding - a parameter-free extension of 3D RoPE placing agents as simplex vertices in rotary angle space, giving distinct phases while keeping permutation equivalence and supporting scalable identity without per-slot learned embeddings. Sparse Hub Attention reduces cross-agent attention cost via learnable hub tokens. The named target is interactive video generation that needs to scale to many simultaneous controllable agents.

world-model multi-agent rope

#11

NVIDIA Research at ICRA: sim-to-real advances for generalizable embodied autonomy

Robotic Autonomy 2026-05-28 NVIDIA AI Blog 6.6 6.5/6.7/6.5

NVIDIA's ICRA roundup covers its 2026 robotics research portfolio with the explicit framing of moving from controlled-demo and scripted-automation regimes toward generalizable real-world embodied autonomy. Key threads: sim-to-real transfer with massively-parallel GPU-accelerated rollout; foundation models for robot policies that can be specialized with little data; and Isaac Lab tooling that absorbs the sim-to-real engineering rather than leaving it to research labs. Useful as the company-line version of where NVIDIA wants the robotics stack to land - pairs cleanly with the Physical Intelligence and Boston Dynamics announcements from earlier in the month.

nvidia isaac sim-to-real robotics

#12

C4ISRNET: US troops being targeted via commercial location-data brokers - Pentagon confirms

Government & Defense 2026-05-28 C4ISRNET 6.6 6.4/7.0/6.4

The Pentagon acknowledged that US service members have been targeted using commercially-available location data sold by data brokers. The C4ISRNET piece details an operational-security pattern in which adversaries purchase precision-located mobile-ad data and cross-reference with installation perimeters. While not directly an AI story, the targeting pipeline relies heavily on automated cross-referencing and is the cleanest data-broker-as-weapon example surfaced publicly to date - relevant for any policy-side conversation about commercial AI-derived data flows.

data-brokers opsec pentagon

#13

Latent Space - 'The Age of Async Agents': Cognition's Walden Yan and OpenInspect's Cole Murray on background coding agents

Agents & Tool Use 2026-05-28 Latent Space (swyx + Alessio) 6.6 6.4/6.6/6.8

Latent Space's interview with Cognition's Walden Yan and OpenInspect's Cole Murray maps the practitioner architecture of async background agents: how teams structure background-agent harnesses, where Anthropic's managed-agents and Google's Gemini managed agents fit relative to LangGraph and Pydantic-based DIY stacks, and the operational realities of running agents at Shopify, Stripe, and elsewhere. Yan's framing in the conversation is that the central tension in agent-product strategy is between decacorn agent labs (Sierra, Decagon, Notion, Cursor) and the increasing ease of DIY - and async background agents are where DIY currently wins.

cognition agents async podcast

#14

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Reinforcement Learning 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6

Identifies a Thinking-Acting Gap in agentic RL: vanilla GRPO sees tool calls attempted in only ~30% of rollouts and all-wrong in ~40% of attempts, gutting the learning signal at exactly the steps that need it. AXPO fixes the thinking prefix and resamples the tool call + continuation for each all-wrong tool-using subgroup, with uncertainty-based prefix selection. Across 9 multimodal benchmarks x 3 scales of Qwen3-VL-Thinking, SFT+AXPO beats SFT+GRPO by +1.8pp Pass@1 / Pass@4 at 8B on average; 8B+SFT+AXPO surpasses the 32B Base on Pass@4 at 4x fewer parameters.

agentic-rl grpo qwen-vl

#15

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Safety, Policy & Regulation 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6

Lightweight agent-safety alignment framework targeting open-world agents like OpenClaw. Updates agent-safety taxonomy for Codex/OpenClaw execution scenarios, builds a taxonomy-guided data engine with influence-function purification, trains AgentDoG 1.5 variants at 0.8B/2B/4B/8B parameters using only ~1k samples and gets comparable performance to closed-source baselines (e.g., GPT-5.4). Deployable as a Docker-level guardrail at SOTA latency with two orders of magnitude lower overhead than the alternatives.

agent-safety guardrail alignment

#16

TechCrunch: 'The internet is being rebuilt for machines' - AWS, Cloudflare and others retool cloud infrastructure for machine-generated traffic

Infrastructure 2026-05-28 TechCrunch - AI 6.5 6.3/6.7/6.6

TechCrunch's long-read frames the shift from human-driven to agent-driven traffic as a structural rebuild rather than a routing tweak: AWS, Cloudflare, and others are explicitly redesigning the cloud edge for traffic that originates from agents - different cache patterns, different rate limits, different attestation needs (machine-to-machine identity, agent-readable robots.txt extensions, AI-traffic carve-outs in DDoS heuristics). Noteworthy data point: machine-generated traffic crossed half of total web traffic in 2025 per Cloudflare's Radar; the projection is 70% by end of 2026.

cloudflare aws agents infrastructure

#17

OpenAI case study: Endava builds an agentic organization on Codex - software requirements analysis from weeks to hours

AI Coding 2026-05-28 OpenAI Research 6.5 6.3/6.4/6.8

OpenAI's case study with the IT services firm Endava reports that Codex compressed requirements analysis from weeks to hours and is now used across the firm's delivery pipeline. The publication is squarely a Codex-marketing artifact, but the operational detail (specific stages of the Endava delivery model that Codex now owns, automation of UML/architecture-diagram generation upstream of code synthesis) is the kind of system-design signal worth tracking - IT services firms with code-as-product remain a leading indicator of enterprise codegen adoption.

codex openai case-study ai-coding

#18

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Reinforcement Learning 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5

Proactive Recommendation via Rectified Policy Gradient. Identifies two flaws in naive RL for proactive recommendation: positive-mean step-decomposition creating length-dependent gradient bias, and high-variance path-level reward weighting. Stepwise Reward Centering neutralizes length bias; Position-Specific Advantage Estimation uses decomposition structure to compute step-dependent advantages. The methods package targets RecSys problems where path-length blowup has historically broken policy optimization.

recsys policy-gradient rl

#19

From Pixels to Words -- Towards Native One-Vision Models at Scale

Multimodal 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5

Native vision-language model that learns cross-frame and pixel-word correspondence end-to-end without external image encoders or post-hoc fusion. By eliminating module boundaries, NEO-ov enables fine-grained spatiotemporal modeling to emerge inside the model. Closes the gap to modular VLMs while excelling at fine-grained visual perception, with detailed training recipes published. Architectural argument: native one-vision designs are competitive at scale.

vlm video pixel-language

#20

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

Research 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5

14,056 research-level math problems curated via multi-agent pipeline - largest such dataset to date - plus 220K teacher trajectories from two open models. Reports a striking failure mode: newer model generations produce 5.6x more references and 5.0x more *fake* references per trace. After agentic filtering, fine-tuning Qwen3 from 4B to 30B yields +9.2 points average gain over base. Argues filtered open-problem attempts give useful supervision even without correct reasoning traces.

math research fine-tuning

#21

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

Research 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5

Error tracing and attribution for LLM memory systems. Transforms memory pipelines into executable memory-evolution graphs enabling fine-grained tracing of operational information flow. Companion benchmark MemTraceBench covers Long-Context, RAG, Mem0, EverMemOS; automatic attribution iteratively traces operation subgraphs to root-cause failed cases. Demonstrates that memory failures are systematic at the operation level (information loss, retrieval misalignment) and uses the attribution signal to guide prompt optimization in a closed-loop self-correcting system.

llm-memory debugging rag

#22

Asana acquires StackAI to add no-code agent-builder capabilities to its AI workflow suite

Industry 2026-05-28 TechCrunch - AI 6.4 6.2/6.0/7.0

Asana announced an acquisition of StackAI, a no-code agent-builder, to expand its in-app AI workflow tooling. Terms undisclosed. The acquisition is part of a broader trend - work-management platforms (Asana, Monday, ClickUp, Notion) competing to embed agentic builders rather than ship MCP integrations to third-party agent platforms. Notable mainly as a data point on consolidation among second-tier work-management vendors trying to keep parity with Notion's agentic surface area.

asana stackai agents acquisition

#23

Google Research at I/O 2026 - 'A New Era of Innovation' recap of platform research releases

Research 2026-05-28 Google AI Blog 6.4 6.2/6.5/6.5

Google's I/O 2026 research recap consolidates the platform-side announcements - Gemini 3.5 Flash specifically - and the research artifacts that landed across the conference. Notable for context: Gemini 3.5 Flash now sits at 55.3 on the Artificial Analysis Intelligence Index and 210 tokens/s, putting it firmly in the high-end-speed quadrant. The Google AI blog post itself is recap-format and light on technical detail; treat as a pointer to the underlying I/O sessions.

google io gemini recap

#24

Defense One: US Navy used drones to sink retired warship - uncrewed test platform proves out maritime kill chain

Robotic Autonomy 2026-05-28 Defense One 6.4 6.3/6.5/6.4

Defense One reports the Navy used uncrewed surface vessels and aerial drones to sink a retired warship during a live-fire test, validating an end-to-end maritime kill chain with autonomous platforms. Pair with the prior month's announcements from the Pentagon's Replicator initiative and the Manned-Unmanned Teaming Pentagon framing - the cadence of autonomous-platform live-fire tests has accelerated.

navy drones autonomy replicator

#25

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Research 2026-05-28 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)Hugging Face Daily Papers 6.4

Quantifies LoRA's exact parametric memory capacity as a controlled probe in latent space. Introduces the Parametric Memory Law - a power-law linking loss reduction (delta-L) to effective parameters and sequence length. Token-level analysis reveals a deterministic phase transition: p>0.5 is sufficient for verbatim recall under greedy decoding. MemFT redistributes training budget toward sub-threshold tokens, improving memory fidelity and efficiency.

lora memory fine-tuning

#26

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Multimodal 2026-05-28 AK (@_akhaliq) Daily PapersarXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.4

Local Modality Substitution for deeper vision-language fusion. Diagnoses 'carrier sensitivity' in VLMs - replacing a textual question with a rendered-image counterpart should be neutral but in practice causes large performance drops, tracing back to asymmetric text-as-query / image-as-reference roles in training data. LoMo is a lightweight architecture-agnostic data curation pass that supplies supervision for representation alignment across textual and visual carriers.

vlm data-curation modality

#27

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Multimodal 2026-05-28 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.4

GASP injects geometric priors into VLM transformer layers. Small correspondence head applied as deep supervision across all layers; dual objective: contrastive loss on ground-truth point correspondences (2D view-invariance) and depth-consistency supervision (3D ambiguity resolution). Reports that standard VLMs have layer-wise correspondence-matching accuracy below 5%; GASP substantially improves peak layer-wise correspondence and downstream 3D reasoning.

3d spatial vlm

#28

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Evaluations & Benchmarks 2026-05-28 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksarXiv — Post-training / Alignment 6.4

Text-to-FHIR pipeline producing clinically-realistic HL7 FHIR R4 bundles from unstructured text via staged LLM generation + terminology-grounded validation/repair. Achieves valid FHIR for 82.5% of MedCaseReasoning cases. Headline finding: LLMs score consistently lower diagnostic accuracy on structured FHIR inputs than on the plain-text equivalents, an EHR-aligned benchmark caveat that should reshape clinical-AI evaluation practice.

clinical fhir ehr

#29

Reasoning with Sampling: Cutting at Decision Points

Reinforcement Learning 2026-05-28 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Reinforcement Learning 6.4

Entropy-Cut Metropolis-Hastings - a sampler for the power distribution of base LMs that targets decision points instead of uniform cuts. Uses base-model next-token entropy as a proxy for key decision points and resamples from there, fixing the failure mode where uniformly-chosen cuts rewrite local details rather than revisit consequential choices. Provides RL-free reasoning at parity with trained reasoning models.

reasoning sampling mcmc

#30

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

Reinforcement Learning 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4

Substitutes external supervision with recovery-oriented optimization from weak-model failures. Trains directly on incorrect reasoning traces as opportunities for improvement; outperforms strong on-policy RL baselines on competitive math and general reasoning, with stronger self-corrective behavior as training difficulty rises. Architectural appeal: scalable RL without distilled teacher models or curated difficulty datasets.

rl self-correction reasoning

#31

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Generative Media 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4

Causality lens on video generation as world modeling. Distinguishes generative pattern-matching from causal scene modeling and proposes a counterfactual evaluation protocol. Most current video generators succeed at temporal-pattern matching while failing causal counterfactual tests at high rates; the gap quantifies how far video generation is from being a true world model rather than a sequence prior.

video world-model causality

#32

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Safety, Policy & Regulation 2026-05-28 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)Hugging Face Daily Papers 6.3

Characterizes backdoor attacks injected via LoRA adapters: token-level trigger generalization, attack-surface mapping across adapter ranks, and detection heuristics for hostile LoRA artifacts in adapter-shared marketplaces. Worth flagging given the increasing reliance on community LoRA distribution in the Hugging Face ecosystem.

lora backdoor supply-chain

#33

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

Evaluations & Benchmarks 2026-05-28 arXiv — Agents / Tool UsearXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 6.3

Do LLM agents exhibit human-like psychology? HEART-Bench probes agents across emotional, social, and cognitive dimensions, finding systematic deviations from human baseline psychology while also surfacing instrumental cases where human-like response patterns emerge from instruction-following rather than internalized models. Useful as a methodological corrective to studies that conflate output-similarity with cognitive parallelism.

agent-psychology evals benchmark

#34

Self-Improving Language Models with Bidirectional Evolutionary Search

Research 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Bidirectional Evolutionary Search frames self-improvement as joint forward generation + reverse refinement passes. Demonstrates exponential reductions in samples required to escape the entropy shell that bounds best-of-N improvement, with empirical results across reasoning benchmarks at scale.

self-improvement evolutionary reasoning

#35

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

AI for Science 2026-05-25 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Autonomous-research agent built on Chain-of-Evidence - iteratively decomposes scientific questions into evidence-yielding sub-investigations, grounds claims to literature/data sources, and composes findings. Targets human-level autonomous research as a north-star and reports concrete progress on multi-paper synthesis tasks.

autonomous-research ai-science chain-of-evidence

#36

AI Research Agents Narrow Scientific Exploration

AI for Science 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Empirical study: AI research agents systematically narrow the scientific search space relative to human researchers, biasing exploration toward already-cited and well-resourced subfields. Quantifies the narrowing across multiple domains and proposes diversification heuristics for agent deployment in scientific workflows.

research-agents diversity ai-science

#37

SkillGrad: Optimizing Agent Skills Like Gradient Descent

Agents & Tool Use 2026-05-26 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Optimizes agent skills via a gradient-like update rule over textual skill definitions. Skills are treated as the natural-language analog of parameters; updates flow from skill-execution evals back to skill text via local perturbation + acceptance. Reports gains over prompt-only baselines on agentic benchmark suites without touching model weights.

agents skill-learning self-improvement

#38

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Research 2026-05-28 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Unified retrieval across heterogeneous knowledge sources - code, structured data, unstructured documents, multimodal - in a single retrieval substrate. Architectural argument that source-aware encoder zoos can be collapsed into a shared latent without losing source-specific recall.

retrieval rag multimodal

#39

Rethinking Memory as Continuously Evolving Connectivity

Research 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Replaces discrete memory-slot abstractions with a continuously-evolving connectivity graph where retrieval and update are local message-passing operations over a sparse neighborhood. Reports better performance on long-horizon memory benchmarks than slot-based predecessors.

memory continuous long-horizon

#40

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Agents & Tool Use 2026-05-25 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Coins 'agent lifespan engineering' for the operational problem of deployed agents that accumulate state, drift behaviorally, and require curated retirement / refresh cycles. Provides an empirical study of state-accumulation patterns in long-running agents and design patterns for graceful state pruning.

operations deployed-agents drift

#41

GEM: Generative Supervision Helps Embodied Intelligence

Robotic Autonomy 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Generative Supervision Helps Embodied Intelligence - uses generative models of action trajectories as auxiliary supervision for embodied policies. Reports gains on standard manipulation benchmarks at lower sample count than direct behavior-cloning, with the generative trajectory model serving as both data augmentation and consistency regularizer.

embodied generative-supervision

#42

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Agents & Tool Use 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3

Automated domain specialization for small computer-use agents. Identifies a small-agent's failure modes via comparison to a stronger reference, then synthesizes targeted training data to plug each weakness. Yields domain-specialized small agents that outperform untargeted fine-tuning at substantially lower data and compute cost.

computer-use small-model specialization

#43

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

Evaluations & Benchmarks 2026-05-28 arXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 6.3

Dynamic, multilingual, multi-domain benchmark for misinformation detection. Crucially refreshes its corpus from community fact-check feeds rather than fixing a static set, attacking the staleness problem that plagues prior misinformation benchmarks.

misinformation multilingual benchmark

#44

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

Evaluations & Benchmarks 2026-05-28 arXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 6.3

Preplan-Empowered LLM Mathematical Reasoning separates the what-to-solve planning step from the how-to-solve execution step, training a model to produce an explicit plan before tackling math problems. Reports consistent gains on contest-level benchmarks against direct-reasoning baselines.

math planning reasoning

#45

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Research 2026-05-28 arXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 6.3

Recoverable Program-of-Thought with checkpoint repair. Adds checkpointed intermediate state to Program-of-Thought reasoning so the model can recover from a fault without restarting; improves robustness on multi-step computational reasoning benchmarks.

program-of-thought reasoning

#46

TechCrunch: AI token futures markets in design at major exchanges - treating inference tokens as a tradeable commodity

Industry 2026-05-28 TechCrunch - AI 6.2 5.8/6.4/6.4

Major exchanges are designing derivative products around AI inference tokens, treating them as a raw input commodity comparable to electricity or bandwidth. The structural argument is that token pricing has become dense and liquid enough across providers (with cache-hit vs. input vs. output pricing already standardized) to support futures and options contracts hedging enterprise inference spend. Noteworthy as a leading indicator of inference-economy financialization rather than as a near-term product.

finance tokens commodities

#47

Cohere ships practical MCP guide alongside Command A+ enterprise sovereign-AI push

Agents & Tool Use 2026-05-28 Cohere Blog 6.2 5.9/6.1/6.6

Cohere published three May-28 enterprise-facing posts (AI in BI use cases; a practical guide to Model Context Protocol; AI governance for safe enterprise adoption) alongside a May-27 Mila partnership for Quebec French. Most interesting in the context of last week's Command A+ open-source enterprise model launch: the MCP post is a tacit acknowledgement that MCP has won the standard-for-agent-tool-calling slot, and Cohere wants to be visibly aligned with it.

cohere mcp command-a enterprise

#48

Army Data Operations Center designed lean on people, expecting automation to absorb the workload

Government & Defense 2026-05-28 DefenseScoop 6.2 6.1/6.4/6.1

The Army Data Operations Center, stood up in April, is being staffed with minimal human headcount on the explicit expectation that automation will pick up growing force-wide data-ops demand. DefenseScoop's piece pairs the staffing plan with the standing-up of Army-side AI data pipelines - a clean case of an institution betting that LLM-driven automation will close a workforce gap rather than waiting for evidence.

army data-ops automation

#49

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Agents & Tool Use 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2

Are search agents searching, or just verifying what they already believe? LiveBrowseComp is a benchmark designed to detect the failure mode in which a browser agent's retrieval queries chase confirmatory evidence rather than expand the hypothesis space. Reports the prevalence of confirmation-style queries across leading search agents.

search-agent evals confirmation

#50

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Efficiency 2026-05-28 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Efficiency (Quantization, MoE, Inference) 6.2

Canonical-Context On-Policy Distillation tackles inconsistency between teacher and student when given the same evidence but different contexts. Forces both into a canonical context before the distillation step, recovering most of the consistency gap and improving downstream accuracy without re-collecting trajectories.

distillation consistency

#51

Less is More: Early Stopping Rollout for On-Policy Distillation

Efficiency 2026-05-26 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2

Early-stops rollout length during on-policy distillation when the student's per-token KL plateaus. Eliminates a substantial fraction of redundant token computation with no measurable downstream loss.

distillation rl efficiency

#52

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

Evaluations & Benchmarks 2026-05-28 arXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 6.2

Benchmark for robotic creative problem solving - surfacing the unexpected ways that embodied agents fail at tasks requiring genuine insight rather than pattern-matching. Robotic agents systematically struggle with creative re-purposing of objects, with failure modes that map cleanly onto missing-world-knowledge rather than control-stack limitations.

robotics creativity benchmark

#53

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

AI for Science 2026-05-28 arXiv — AI for SciencearXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 6.2

Evaluates LLM scientific hypothesis generation under progressive context degradation. Headline finding: LLM hypothesis quality degrades non-monotonically as context is degraded, with sharp cliff-edges at certain degradation thresholds rather than a smooth roll-off - argues for evaluation that includes adversarial context degradation.

evals hypothesis ai-science

#54

Unlocking the Working Memory of Large Language Models for Latent Reasoning

Interpretability 2026-05-28 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks 6.2

Latent-reasoning study showing LLM working memory has structured latent representations that can be unlocked with targeted intervention. Reports that residual-stream patches at specific layers expand effective working memory capacity by a measurable fraction without retraining.

working-memory latent-reasoning interp

#55

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

Reinforcement Learning 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2

Information-Bottleneck Driven Tree-based Policy Optimization. Uses an information-bottleneck regularizer over a tree of policy rollouts to balance exploration and exploitation in long-horizon agentic settings.

policy-optimization tree-search agents

#56

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Evaluations & Benchmarks 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2

Benchmarks thinking-mode switch strategies in hybrid models that mix fast and slow reasoning. Reports the operational tradeoffs of various switching policies - confidence-threshold, query-difficulty heuristics, learned routers - across reasoning workloads.

hybrid-reasoning switching evals

#57

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Post-Training 2026-05-26 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2

Uses SAE-derived features as a guide for post-training data engineering - selects training data that activates SAE features the target model needs to strengthen, rather than relying on quality or diversity heuristics over text. A practical use of interpretability artifacts for training-data curation.

sae data-engineering post-training

#58

CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation

Efficiency 2026-05-25 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2

Collects 50 effects in 1 LoRA via Multi-Teacher On-Policy Distillation. Demonstrates that a single low-rank adapter can encode many distinct stylistic effects when distilled from a panel of teacher models, dramatically reducing the per-effect LoRA proliferation typical in generative-media tooling.

lora distillation generative-media

#59

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Agents & Tool Use 2026-05-28 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Reinforcement Learning 6.2

Human-like long-document translation agent with observe-and-act adaptive policy. Decomposes long-document translation into observe-translate-revise steps with explicit cross-document coherence checking. Reports gains over GPT-4-class single-pass translation on book-length corpora.

translation long-context agents

#60

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Agents & Tool Use 2026-05-26 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2

Coordination-policy substrate for multi-agent systems. Provides a learnable substrate for inter-agent message passing and coordination policy that separates per-agent reasoning from system-level coordination.

multi-agent coordination

#61

Microsoft Research ships Data Formulator 0.7 - enterprise-data AI analytics tool

Industry 2026-05-28 Microsoft Research Blog 6.1 6.0/6.0/6.4

Microsoft Research's Data Formulator 0.7 release adds enterprise-data connectors and AI-driven analytic flow construction on top of the existing visual chart-building pipeline. The release positions Data Formulator as Microsoft's research-backed alternative to the Tableau / Power BI Copilot story for analyst-facing workflows. No frontier-model news here, but a useful signal on the enterprise-tooling layer that wraps Copilot.

microsoft data-formulator enterprise

#62

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

Efficiency 2026-05-27 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1

Efficient high-quality video generation with sparse sequence parallelism. Targets the throughput bottleneck of large-resolution video diffusion at long durations; reports throughput improvements at parity quality.

video-diffusion parallelism

#63

Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMs

Post-Training 2026-05-28 arXiv cs.CL (Computation & Language)arXiv — Post-training / AlignmentarXiv — Reinforcement Learning 6.1

DPO recipe for post-trained LLMs that recovers response diversity without sacrificing alignment metrics. Targets the well-documented diversity collapse from preference optimization; introduces a diversity-aware preference-pair construction that preserves alignment scores while improving output diversity by measurable margins.

dpo diversity alignment

#64

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Reinforcement Learning 2026-05-28 arXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence)arXiv — Reinforcement Learning 6.1

Long-horizon LLM agent training with meta-cognitive memory policy optimization - explicit RL over both action-policy and memory-update-policy. Reports stronger long-horizon performance than memory-static baselines on multi-day-task benchmarks.

memory agents long-horizon

#65

a16z Policy: 'AI is How America Builds Again' - GP Erin Price-Wright lays out reindustrialization-via-AI thesis

Safety, Policy & Regulation 2026-05-28 a16z AI Policy Brief 6.0 5.7/6.2/6.1

a16z general partner Erin Price-Wright argues policymakers' rebuilding-industrial-base rhetoric requires AI as the enabling layer - that the gap between needing to build and being able to build is bridged by automation, robotics, and AI-driven design. The post is policy-positioning material for an a16z faction increasingly focused on defense-industrial and reshoring policy, and is worth tracking as the venture firm's house-line on AI's policy framing heading into the FY27 NDAA cycle.

a16z policy reindustrialization

#66

MIT Tech Review: lithium extraction breakthrough - new process could unlock geothermal-brine deposits

Industry 2026-05-28 MIT Technology Review - AI 5.8 5.5/6.0/5.8

MIT Tech Review covers a new direct-lithium-extraction process targeting geothermal brines, with implications for the battery-supply chain that backs both EVs and AI datacenter power buffering. Not strictly AI but tracked here because lithium availability is a load-bearing assumption in the 5GW datacenter expansion announcements that dominated this week's news.

lithium energy supply-chain