Wolf Digest — 2026-06-06

#1

Google agrees to pay SpaceX $920M a month for ~110,000 GPUs as IPO looms

Infrastructure 2026-06-05 TechCrunch — AIThe Information — AI 7.8 7.5/8.0/7.9

A regulatory filing on Friday revealed that Google will pay SpaceX roughly nine hundred twenty million dollars a month, from October 2026 through June 2029, for access to approximately one hundred ten thousand NVIDIA GPUs along with CPUs, memory, and related components. The structure mirrors the deal SpaceX struck with Anthropic in late May, under which Anthropic agreed to pay about one and a quarter billion dollars a month through 2029 to rent the entire Colossus 1 data center near Memphis, Tennessee, the cluster that xAI originally built before it was folded into SpaceX. Google's commitment buys roughly half the compute Anthropic now controls at Colossus 1, and SpaceX did not specify which facility it would draw from; Elon Musk has previously said Colossus 2 would be reserved for xAI.

What makes the deal striking is who is buying. Anthropic was compute-starved before its agreement and raised its usage limits the same day it was announced. Google, by contrast, is by some estimates the single largest owner of AI compute on the planet, yet it still reached outside its own fleet. A Google representative framed it as a short-term bridge: capacity to absorb demand for its Gemini Enterprise agent platform that has run, in the company's words, even higher than expected. The economics around it are vast. Alphabet has already committed to more than one hundred eighty billion dollars in capital expenditure this year, expects that to rise significantly in 2027, and recently announced an eighty billion dollar equity sale to help fund it.

The agreement carries the same escape hatch as the Anthropic contract: either party can terminate with ninety days' notice after the end of 2026, access ramps up through September at a reduced fee, and if SpaceX fails to deliver the committed GPUs by September 30, 2026, Google can walk or accept fewer chips at lower cost. The timing is conspicuous. SpaceX announced the deal a week before its stock is expected to begin trading on the Nasdaq, where filings show it aims to raise around seventy five billion dollars at a valuation near one and three quarter trillion dollars, which would be the largest public offering in history. Google is a longtime SpaceX investor whose stake should be worth more than one hundred billion dollars after listing, and the two are reportedly exploring orbital data centers as a longer-term play. Stacked against the Anthropic agreement, the filing sketches an emerging pattern in which the company that controls the physical compute can extract billion-dollar monthly commitments from even the best-resourced model builders, and is monetizing that leverage on its way to the public markets.

How it was discussed

TechCrunch anchored on the SEC filing's mechanics: the $920M/month figure, the September GPU-delivery clause, and the one-week-to-IPO timing.
The Information surfaced it as a briefing first, framing it alongside SpaceX's parallel $1.25B/month Anthropic deal as a compute-landlord pattern.

compute data centers SpaceX Google IPO

#2

The token bill comes due: enterprises blow past AI budgets, and a standards body forms

Industry 2026-06-05 TechCrunch — AI 7.5 7.0/7.8/7.7

The story of the year in enterprise AI is no longer capability; it is the bill. TechCrunch reports that Uber burned through its entire 2026 AI coding budget by April, Microsoft revoked its developers' Claude Code licenses months after handing them out, and a Priceline employee watched a routine Cursor renewal come back four to five times more expensive. Even as per-token prices fell, the push toward more autonomous agents drove consumption up so fast that one company reportedly racked up a five hundred million dollar Claude bill after forgetting to set per-employee usage limits. The November wave of stronger models, Claude Opus 4.5, GPT-5.1, and Gemini 3 Pro, sharply increased what agentic tools consume per task, and the budgets written for the previous generation simply did not survive contact with them.

Against that backdrop the Linux Foundation this week unveiled the Tokenomics Foundation, a standards body explicitly modeled on how FinOps brought discipline to cloud spend. J.R. Storment, who runs the FinOps Foundation, said companies began telling him in April and May that they were three times over their full-year token budgets with most of the year still ahead, and that the conversation flipped from going fast to needing guardrails. The scale of the measurement problem is the technical heart of the piece: Storment frames tracking cloud costs as a hundreds-of-millions-of-rows-a-month data problem, and tracking tokens as a trillions-of-rows-a-month one, which cannot be bolted onto a spreadsheet.

A market is rushing into the gap. Pure plays like Pay-i and Paid sit alongside developer-productivity vendors such as Jellyfish, Waydev, and Faros AI, while Ramp, Datadog, and New Relic add token-level observability and GPU monitoring, and AWS is expected to introduce enterprise AI spend features at the FinOps X conference next week. The productivity data is genuinely ambiguous: a two-year Faros study of twenty thousand developers found output rising alongside bugs and rewrites, and Jellyfish found its heaviest token users about twice as productive but spending ten times the tokens to get there, with per-developer consumption up roughly eighteen-fold in nine months. The likely technical response, several sources argue, is routing, frontier labs quietly serving some of an Opus-priced query on cheaper Sonnet or Haiku tiers, plus new harness-layer routers like the one Factory shipped this week. The Tokenomics Foundation wants canonical metrics such as cost-per-intelligence and tokens-per-watt, but its first deliverable is months away even as Goldman Sachs projects global token usage to multiply twenty-four-fold by 2030. As one CEO put it, the industry built a steam engine but has not yet figured out the assembly line.

tokenomics FinOps inference cost agents enterprise

#3

The Pentagon's AI edge is being distilled away, and export controls can't stop it

Government & Defense 2026-06-05 War on the Rocks 7.4 6.9/8.0/7.3

Sebastian Elbaum, a computer scientist at the University of Virginia and a fellow at the Council on Foreign Relations, argues in War on the Rocks that an adversary no longer needs to breach Pentagon systems; it only needs to harvest the logic of the publicly released frontier models those systems are built on. As the Defense Department pivots to an AI-first posture, its most advanced capabilities, from Project Maven's intelligence fusion to the sensor-to-shooter loops of Anduril's Lattice, are tethered to commercial models from Anthropic, Google, and OpenAI. The department's own leadership frames the contest as the ability to out-compute, and since 2022 the United States has leaned on chip export controls, now extended by the Multilateral Alignment of Technology Controls on Hardware Act. Yet the 2026 Stanford AI Index shows the top United States and Chinese models converging: on the Arena leaderboard the gap has fallen to two point seven percent, down from seventeen percent in 2023.

The mechanism, Elbaum contends, is distillation: training a cheaper student model to imitate a more capable teacher using only the teacher's outputs. He cites the 2023 case of Stanford researchers distilling Meta's Llama into a comparable instruction-follower for six hundred dollars against an eighty-two thousand dollar training cost, and notes that the application programming interface is the open door, since every API response leaks a slice of the model's costly intelligence. Anthropic recently accused the Chinese labs DeepSeek and Moonshot AI of generating millions of calls from thousands of fraudulent accounts to extract Claude's capabilities, letting firms founded only in 2023 reach frontier-class behavior at a nearly ninety percent operational discount, and White House science chief Michael Kratsios issued a memo calling APIs an unprotected pipeline for American intellectual property. With the United States controlling roughly three quarters of high-end AI compute against China's fifteen percent, distillation is precisely how a compute-poor rival shortcuts its primary bottleneck.

Because hardware controls cannot restrict the flow of model outputs, Elbaum proposes manufacturing time instead of guarding chips. He would embed Chief Digital and Artificial Intelligence Office liaisons inside frontier labs to gain months of foresight, then trigger a staggered release that grants the government an exclusive early-access window, leaning on a recent executive order that lets the government examine covered frontier models for up to thirty days before public release, with a National Security Council strategic-overmatch designation providing legal cover and an overmatch premium compensating labs for forfeited revenue. He points to Anthropic's Project Glasswing as proof that competitors can coordinate on staggered timelines. The head start only matters, he adds, if the Pentagon also runs a high-velocity refinement pipeline: feeding models fresh theater-specific data, installing automated safety floors so updates never cause catastrophic regressions, and treating integration as a continuous closed loop. In an era defined by distillation, he concludes, the out-compute strategy becomes an out-refine-and-safely-integrate strategy.

distillation export controls national security China frontier models

#4

Attackers hijacked Instagram accounts by simply asking Meta's AI support agent

Safety, Policy & Regulation 2026-06-05 MIT Technology Review — AI 7.2 6.6/7.5/7.5

On June 5, 404 Media reported that attackers had stolen Instagram accounts not through some exotic exploit but by asking Meta's AI customer-support agent to relink the accounts to attacker-controlled email addresses, and the agent simply complied. The only friction was using a VPN that matched the real owner's location. One attacker took over the dormant Obama White House account and posted pro-Iran content; others grabbed valuable single-word handles, presumably to resell. MIT Technology Review uses the episode to make a pointed argument about where AI risk actually lives: not in a superhuman hacking model, but in the eager, under-guarded agents companies are already deploying to take real actions on users' behalf.

The contrast it draws is with Anthropic's Mythos, the model Anthropic said in April was too capable at hacking to release to the general public. Mythos imagines AI as the attacker; the Meta breach shows AI as the soft target, defeated by a request far simpler than indirect prompt injection. Researchers quoted in the piece are blunt about why. Duke's Neil Gong says he cannot understand why Meta missed so simple a problem; Georgetown's Jessica Ji asks whether any guardrails were in place at all; Wisconsin's Somesh Jha likens an agent to an elementary-school student who just wants to please the teacher; and Illinois's Bo Li reduces it to a law of the field, that security and utility always trade off. An agent that can be talked into emailing sensitive information is exactly an agent helpful enough to be dangerous.

The recommended defenses are unglamorous and familiar from ordinary software engineering: hard-coded guardrails, such as a mandatory security question before any sensitive email change, and rigorous red-teaming before deployment rather than after. The piece notes that participants in Anthropic's Project Glasswing already use Mythos to hunt vulnerabilities in their own software, and that defenders face the structural disadvantage of needing to be right everywhere while an attacker needs one opening. The same week, Pentagon chief technology officer Emil Michael told a Washington Post summit that AI firms bear responsibility for their models' weaponization potential, citing Mythos directly, and pointed to a new White House executive order that lets companies volunteer their systems for a roughly thirty-day government vulnerability scan before release. Meta says the specific flaw is now fixed, but the lesson the researchers draw is broader: as agents are wired into more consequential systems, the boring work of bounding what they are allowed to do is the security frontier that matters.

agent security prompt injection Meta red-teaming Mythos

#5

Trump says he will meet AI companies to discuss financial 'partnerships'

Safety, Policy & Regulation 2026-06-05 The Information — AI 7.0 6.8/7.7/6.5

Speaking to reporters on Friday, President Trump said he plans to meet with AI companies in the near future to discuss financial "partnerships," without detailing what the arrangements would look like. The remark follows a Thursday NOTUS report that OpenAI chief executive Sam Altman had spoken with senior administration officials about the government taking an equity stake in his company, extending this week's running thread about Washington seeking direct financial ties to frontier labs. The Information's briefing is paywalled beyond the lede, so specifics remain thin, but the signal continues a clear trajectory toward state involvement in the sector's ownership.

Trump government stakes OpenAI policy

#6

AirTrunk commits $30B to build 5 gigawatts of AI data centers in India

Infrastructure 2026-06-05 TechCrunch — AI 7.0 6.9/7.0/7.1

Blackstone-backed operator AirTrunk said Friday it will invest thirty billion dollars in India by 2030 to develop five gigawatts of new data center capacity, one of the largest single commitments to the country's digital infrastructure. India's capacity is projected to climb from about one and a half gigawatts today to as much as eight gigawatts by 2030, per Bernstein, helped by New Delhi's offer of tax exemptions through 2047 on overseas services run from Indian data centers. AirTrunk, which entered India via its Lumina CloudInfra acquisition, is planning a three-gigawatt campus at the Raigad Pen Growth Center worth roughly twenty-one billion dollars, on top of a six-hundred-megawatt pipeline across Mumbai, Chennai, and Hyderabad. The announcement followed a meeting between chief executive Robin Khuda and Prime Minister Modi; power availability remains the cited bottleneck.

data centers India AirTrunk capex

#7

Pentagon CTO says AI firms must safeguard models; new EO sets a pre-release vulnerability scan

Safety, Policy & Regulation 2026-06-05 C4ISRNET 7.0 6.7/7.6/6.7

Emil Michael, the undersecretary of defense for research and engineering and the Pentagon's chief technology officer, told the Washington Post's Building America Summit that AI companies bear responsibility for the "weaponization potential" of their models, singling out Anthropic's Mythos. He tied the point to a White House executive order issued Tuesday that creates a voluntary "AI cybersecurity clearinghouse," letting firms give the Defense Department roughly thirty days to scan covered systems for software vulnerabilities, for instance in power grids or hospitals, before public release, and credited OpenAI, Anthropic, and Google for agreeing. Michael noted the unresolved friction with Anthropic, which has been excluded from some Pentagon deals and sued the administration over a supply-chain-risk label after restricting military use of Claude. He added that federal monthly AI users jumped from about eighty thousand to one and a half million in six months.

executive order weaponization DoD Mythos Anthropic

#8

France to field its sovereign 'Arcadia' AI battle-command system at NATO's CWIX, challenging Palantir's Maven

Government & Defense 2026-06-06 C4ISRNET 6.9 6.8/7.2/6.7

France will field its homegrown AI command system, Arcadia, at NATO's Coalition Warrior Interoperability Exercise in Poland from June 8 to 26, positioning it as a sovereignty-driven alternative to Palantir's Maven Smart System, which the alliance adopted in August 2025. Built with Mistral AI, Safran.AI, Thales, and Airbus and already tested in the Dacian Fall and Orion 26 exercises, Arcadia uses a highly decentralized mesh of field-deployed servers rather than a central cloud for battlefield resilience, and Gen. Patrick Justel pitched it as compliant with NATO's Federated Mission Networking standards, a claim Palantir disputed. France has also built a staff-officer large language model named Berthier that synthesizes information and drafts courses of action while leaving decisions to commanders. Justel said several European partners have expressed interest.

sovereign AI Mistral NATO Palantir command and control

#9

Google ships agentic RAG in Gemini Enterprise, claiming up to 34% higher factuality

Agents & Tool Use 2026-06-05 Google AI Blog 6.7 6.8/6.5/6.8

Google Research detailed a cross-corpus agentic RAG system now in public preview on the Gemini Enterprise Agent Platform. It decomposes multi-hop queries across a Root, Planner, Query Rewriter, RAG, and Synthesis agent, with the key differentiator being a Sufficient Context Agent that detects what information is still missing and keeps searching rather than guessing or returning "not found." On the FramesQA benchmark of eight hundred twenty-four queries over two thousand six hundred seventy-six PDFs, it answered ninety point one percent correctly even when the planner had to choose the right corpus from four options, with latency within three percent of the single-corpus version, and claims up to thirty-four percent higher factuality than standard RAG. The launch lines up with Google's stated surge in Gemini Enterprise demand that prompted its SpaceX compute deal.

RAG agents Gemini retrieval enterprise

#10

How xAI went from chasing Anthropic to powering it

Industry 2026-06-05 The Information — AI 6.6 6.3/6.5/7.0

The Information reports that SpaceX's AI lab, xAI, has had a chaotic year, churning through leaders and staff and bringing in outside help to close the coding gap with Anthropic, while playing what the piece calls a long-running cat-and-mouse game with that same rival. The headline's twist, that xAI now ends up "powering" Anthropic, points to the compute arrangement in which Anthropic rents the SpaceX-owned Colossus cluster originally built by xAI. The body is paywalled, but the framing captures the strange inversion of a would-be model competitor becoming its rival's landlord.

xAI Anthropic compute SpaceX

#11

Code2LoRA: hypernetwork-generated per-repository adapters for code models

AI Coding 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.6/6.3/6.6

Code2LoRA trains a hypernetwork to emit repository-specific LoRA adapters, injecting project context (imports, APIs, conventions) with zero inference-time token overhead, unlike long-context RAG or per-repo fine-tuning. A static variant converts a single snapshot into an adapter for stable codebases, while an evolving variant maintains an adapter as the repository changes, targeting the brittleness of existing methods under software evolution. The framing is a direct answer to the cost of repository-scale context for code language models.

cs.SE cs.CL LoRA

#12

VideoKR: a 315K-example corpus for knowledge- and reasoning-intensive video understanding

Multimodal 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.4/6.3/6.8

VideoKR is positioned as the first large-scale training corpus aimed specifically at knowledge- and reasoning-heavy video understanding, with three hundred fifteen thousand reasoning examples over one hundred forty-five thousand newly collected, CC-licensed expert-domain videos. A human-in-the-loop, skill-oriented pipeline generates examples of progressively deeper difficulty while controlling diversity and the reliability of chain-of-thought rationales. The contribution is the data and curation methodology rather than a new architecture.

cs.CV video reasoning

#13

JPMorgan, Citi, and BofA plan a tokenized-deposit settlement network

Industry 2026-06-05 The Information — AI 6.4 6.0/6.7/6.5

JPMorgan, Bank of America, and Citi are among banks planning to join a new settlement network, slated to launch next year, that would let them move money between one another on a blockchain, according to The Information. The system, to be run by the bank-owned consortium The Clearing House, would handle clearing and settlement of tokenized deposits. Details beyond the lede are paywalled, but the move signals incumbent banks building shared on-chain rails rather than ceding the function to crypto-native players.

tokenized deposits banking blockchain settlement

#14

A standalone US Cyber Force with no enlisted troops, a CSIS-FDD commission proposes

Government & Defense 2026-06-05 DefenseScoop 6.4 6.0/6.7/6.5

A ten-month study by a CSIS and Foundation for Defense of Democracies commission lays out a blueprint for a standalone US Cyber Force, and its most provocative recommendation is a force with no enlisted personnel, roughly thirty thousand uniformed and civilian members including twenty thousand active-duty officers and warrant officers. The rationale is bluntly economic: the enlisted pay scale, the study argues, cannot compensate elite cyber operators competing with industry salaries. It estimates upward of eleven billion dollars in initial cost from reallocated funds, about two point seven billion for personnel, an initial operating capability in eighteen months, and lateral entry on a managerial-and-technical two-track model. Several former cyber officials pushed back, warning that discarding the enlisted cadre would sacrifice a hands-on culture and talent pipeline instead of fixing enlisted pay.

Cyber Force CSIS military workforce

#15

Simon Willison's micropython-wasm: a clean Python sandbox for agent code execution

AI Coding 2026-06-06 Simon Willison's Weblog 6.4 6.0/6.2/7.0

Simon Willison released an alpha PyPI package, micropython-wasm, that runs Python safely inside a WebAssembly sandbox by compiling MicroPython to WASI and executing it through wasmtime, chosen over Pyodide because Pyodide only runs in a browser or Node. It meets his sandbox criteria: clean install, wasmtime-enforced memory limits, CPU limits via wasmtime's "fuel" concept (a roughly twenty-million default he is still tuning), controlled file and network access, and host functions implemented in seventy-eight lines of C compiled into a three-hundred-sixty-two-kilobyte WASM blob. A blocking host function keeps interpreter state resident across calls so variables persist between runs, and it powers a new datasette-agent-micropython plugin. Largely built with GPT-5.5 Pro and Codex, it is explicitly an alpha he would not yet trust for high-stakes use, though he has so far failed to make the model break out of it.

sandbox WASM MicroPython agents tooling

#16

OPRD: on-policy representation distillation aligns student and teacher hidden states

Post-Training 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.4/6.3

On-policy distillation supervises only output probabilities, which suffers from Monte Carlo KL sampling variance over large vocabularies and discards the teacher's intermediate states. OPRD lifts distillation into hidden-state space by aligning the student's representations with the teacher's, rather than treating the teacher as a black box after the language-model head. The method targets the two structural limits of output-only on-policy distillation directly, with the aim of more stable and information-rich transfer.

cs.LG distillation

#17

RL elicits the meta-skill of in-context translation for unseen low-resource languages

Reinforcement Learning 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.5/6.2

Prior approaches to translating low-resource languages, continued training or stuffing a grammar book into context, tend to overfit specific languages with weak zero-shot transfer. This work argues models must instead acquire the meta-skill of using in-context linguistic knowledge, and uses reinforcement learning to elicit it so the model generalizes to languages unseen at training time. The goal is scalable translation of extremely low-resource languages without per-language memorization.

cs.CL RL translation

#18

Benchmark Agent: autonomously constructing fresh, discriminative LLM benchmarks

Evaluations & Benchmarks 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.3/6.4/6.5

"Benchmark Everything Everywhere All at Once" introduces Benchmark Agent, a fully autonomous agentic system for building benchmarks, attacking the labor cost of construction and the rapid saturation that erodes a benchmark's ability to discriminate between frontier models after release. The pitch is sustainable, reusable evaluation generation rather than another fixed static set, addressing the treadmill of benchmark obsolescence.

cs.CL benchmarks agents

#19

LoomVideo: unifying multimodal inputs for video generation and editing without quadratic blowup

Generative Media 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.2/6.5

Unified video generation-and-editing models typically rely on 13B-plus parameters and condition on source video by concatenating tokens, which doubles sequence length and quadruples self-attention cost. LoomVideo targets that overhead while still interpreting interleaved multimodal inputs, aiming for a unified framework that avoids the prohibitive compute of token concatenation for editing. The emphasis is efficiency in the unified generation-editing setting.

cs.CV video efficiency

#20

Are AI chatbots eroding our capacity to think? A psychologist's case at SXSW London

Safety, Policy & Regulation 2026-06-05 MIT Technology Review — AI 6.3 5.8/6.6/6.5

At SXSW London, UC Irvine psychologist Gloria Mark argued that we have largely lost control of our attention, citing her research showing average attention spans on a screen falling from about two and a half minutes in 2003 to seventy-five seconds in 2012 to roughly forty-seven seconds in work conducted from 2014 to 2020, with faster task-switching directly correlated to higher measured stress. She extends the concern to generative AI: offloading writing, summarizing, and evaluating to chatbots removes the depth of processing needed to learn and retain, risking atrophy of critical-thinking and, through sycophantic synthetic companions, emotional intelligence. Mark concedes the evidence on social media's effects on children remains formally inconclusive even amid major lawsuits, and prescribes deliberately reintroducing friction, reading the book, meeting in person, skipping the GPS.

cognition attention wellbeing society

#21

Thousand Token Wood: a real-time multi-agent economy running on a 3B model

Agents & Tool Use 2026-06-05 Hugging Face Blog 6.3 6.2/6.0/6.7

A Build Small Hackathon field report runs a live five-agent economy on Qwen2.5-3B, each woodland creature trading five goods for "pebbles," served with vLLM on Modal and decided in a single batched GPU call per turn. The thesis is that small models make many-agent, many-step simulations feasible because frontier models are too slow and costly per tick, and that the real engineering is bridging a 3B model's perfect formatting against its weak reasoning: it emitted valid JSON on every call but showed poor economic judgment, fixed with sharper prompting and a tolerant parse-and-repair layer. A "Wood Legends" mechanic reskins Tulip Mania, the South Sea Bubble, and 1929 bank runs as live shocks; in one unscripted run a reskinned bank run drove an owl to dump honey and crash its price from ten to three. Full reasoning traces are released as an open dataset.

multi-agent small models simulation Qwen

#22

AdaPlanBench: evaluating LLM agents that must re-plan under progressively revealed constraints

Agents & Tool Use 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.2/6.3/6.4

Real planning problems disclose world and user constraints gradually through interaction rather than up front, a regime existing benchmarks underexplore. AdaPlanBench is a dynamic interactive benchmark testing whether agents can adaptively plan and re-plan as dual constraints are progressively revealed, isolating adaptive planning from one-shot planning. It targets a concrete gap in how agentic planning is currently measured.

cs.AI planning benchmark

#23

RobotValues: benchmarking household robots when human values conflict with task success

Robotic Autonomy 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.2/6.4/6.3

Household robots are usually scored on task completion, but everyday settings involve value conflicts where the right action prioritizes human autonomy, efficiency, or social appropriateness over finishing the task. RobotValues offers ten thousand value-conflict scenarios to probe a robot planner's value preferences, filling a gap where no benchmark evaluated these tradeoffs. It reframes embodied evaluation around normative judgment rather than pure success rate.

cs.RO values benchmark

#24

ArcANE: do role-playing agents stay in character along a narrative arc?

Evaluations & Benchmarks 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.0/6.1/6.5

Role-playing agents should let a character's values and behavior evolve with the story rather than holding a fixed persona, yet existing benchmarks test factual recall at a chapter, not alignment with the character's psychological trajectory in unexplored scenarios. ArcANE (Arc-Aware Narrative Evaluation) is an automatically constructed benchmark spanning seventeen novels and eighty principal characters, segmenting each into a character arc to test temporally appropriate behavior.

cs.CL role-play evaluation

#25

TIDE: proactively discovering hidden problems an agent's user never asked about

Agents & Tool Use 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.2/6.2/6.2

Agents usually act only on explicit requests, surfacing just the problems a user already noticed while many others hide in the broader context. TIDE frames discovering multiple hidden problems as its own task, where coexisting issues, unknown in number, must be uncovered, grounded in supporting evidence, and paired with concrete fixes, using template-guided iteration. It pushes agents from reactive assistants toward proactive auditors of their context.

cs.AI agents proactivity

#26

AffordanceVLA: structured affordance forecasting as an intermediate for VLA control

Robotic Autonomy 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.3/6.1/6.2

Vision-language-action models inherit broad knowledge from pretrained vision-language backbones, but the mismatch between semantic VLM spaces and embodied control policies hampers precise perception-to-action mapping. AffordanceVLA inserts structured affordance forecasting as a task-oriented intermediate representation to bridge that gap for instruction-following manipulation. The affordance layer is the proposed mechanism for tighter grounding between perception and control.

cs.RO VLA manipulation

#27

RE-Edit: a reasoning-aware benchmark for instruction-based image editing

Generative Media 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 6.1/6.2/6.3

Diffusion editors follow natural-language instructions with strong visual fidelity but mostly operate at surface instruction-following, ignoring implicit contextual constraints and producing plausible yet logically inconsistent edits. RE-Edit evaluates image-editing systems across five complementary reasoning dimensions to expose this gap between visual quality and logical correctness. The benchmark isolates reasoning, not just rendering, in evaluating edits.

cs.CV image editing benchmark

#28

Dream.exe: can video-generation models produce executable robot manipulation?

Robotic Autonomy 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.0/6.1

Video generators synthesize compelling footage but stay confined to the virtual domain; Dream.exe asks how well that footage reflects physical law by testing whether depicted motion translates into executable robot behavior. Robotic manipulation becomes a concrete, measurable probe of whether a model has internalized physics, turning visual realism into an actionable benchmark. It is an evaluation of physical grounding rather than a new generator.

cs.RO world models video

#29

Continual experience internalization collapses under multi-iteration learning, a study finds

Agents & Tool Use 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 6.2/6.1/6.0

Converting past-interaction experience into reusable parametric capability is a promising route to continual learning, but where prior work studied single-iteration transfer, this paper finds that under repeated multi-iteration experience learning existing methods suffer progressive capability collapse rather than compounding gains. The authors dissect the failure across vital dimensions of experience to rethink internalization for self-evolving agents. The negative result, not a fix, is the headline.

cs.AI continual learning agents

#30

House lawmakers move to force annual reporting on the Army Transformation Initiative

Government & Defense 2026-06-05 Defense One 6.0 5.8/6.3/5.9

Frustrated that the Army never delivered promised cost tradeoffs and timelines for its year-old Transformation Initiative, the House Armed Services Committee used its NDAA markup to legally require annual reports by February 15 and briefings by March 15 detailing what the service is buying and cutting. A flashpoint is aviation: the Army requested just one Black Hawk and five Chinooks, which the House raised to seven and twelve over supply-chain concerns, even as the Army argues fewer legacy aircraft make sense ahead of the MV-75 Cheyenne II. The friction is compounded by Defense Secretary Hegseth ordering a fresh review of the initiative, apparently news to Army Secretary Driscoll.

Army NDAA modernization Congress

#31

KITScenes: a high-fidelity European multimodal dataset for autonomous driving

Robotic Autonomy 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 6.0/5.9/6.1

KITScenes Multimodal addresses gaps in sensor fidelity, map completeness, and geographic diversity in existing driving datasets, offering a synchronized European suite of high-resolution global-shutter cameras, lidar beyond four hundred meters, 4D imaging radar, and redundant GNSS/INS localization. Its HD maps are claimed to be the most complete of any sensor dataset, validated through autonomous driving. The contribution is sensor and map quality plus regional coverage.

cs.RO autonomous driving dataset

#32

Personal AI Agent for camera-roll visual question answering

Multimodal 2026-06-04 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 5.9/5.8/6.3

This work studies a conversational assistant that can access a user's personal camera roll and retrieve relevant photos to answer queries ranging from simple factual ones to open-ended recommendations across years of images. The challenge is retrieval and reasoning over thousands of personal photos, a setting distinct from standard VQA. It targets the practical personal-assistant case of querying one's own photo library.

cs.CV VQA agents