← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Tuesday, May 26, 2026

Coverage window: 2026-05-25 08:53 ET2026-05-26 03:03 ET
Press play to listen
Tuesday, May 26, 2026
9m 6s · top-4 narrated briefing
#1 · Safety, Policy & Regulation
Pope Leo XIV's first encyclical 'Magnifica Humanitas' sets AI as a moral test of human priorities
Pope Leo XIV released his first encyclical, Magnifica Humanitas: On Safeguarding the Human Person in the Time of Artificial Intelligence, on Monday morning, May 25, 2026 — the first papal encyclical dedicated to artificial intelligence. The text frames AI as the defining moral an…
9.0 · 4 srcs
#2 · Frontier LLMs
Google I/O 2026: Gemini 3.5 + Spark agent, Gemini Omni multimodal video, Antigravity 2.0, Genie streets
Google's I/O 2026 announcements — bundled and discussed in detail on Last Week in AI's #246 episode — landed during the last fetch window and represent the largest single product surface released by any frontier lab this quarter. The lineup: Gemini 3.5 (with Gemini 3.5 Flash emph…
8.0 · 2 srcs
#3 · AI Coding
Cursor Composer 2.5 matches Opus 4.7 / GPT-5.5 on Coding Agent Index at a fraction of the cost; xAI ships Grok Build
Cursor's Composer 2.5, fine-tuned from Moonshot's Kimi K2.5 base, lands third on Artificial Analysis's new Coding Agent Index — pass-at-1 of 60 on the SWE-Bench-Pro-Hard / Terminal-Bench v2 / SWE-Atlas blend — at roughly 10-60× lower per-task cost than Claude Opus 4.7 or Codex/GP…
7.6 · 2 srcs
6.5
#1
Safety, Policy & Regulation 2026-05-25 Anthropic NewsSimon Willison's WeblogGradient Flow (Ben Lorica)TechCrunch — AI 9.0 8.5/9.5/9.0

Pope Leo XIV released his first encyclical, Magnifica Humanitas: On Safeguarding the Human Person in the Time of Artificial Intelligence, on Monday morning, May 25, 2026 — the first papal encyclical dedicated to artificial intelligence. The text frames AI as the defining moral and political test of the present generation, structured around three concerns the Vatican argues the secular AI debate has under-engaged: the duty to the global poor when concentrated AI capability accelerates labor displacement and economic concentration in a handful of wealthy nations; the need for a positive vision of human flourishing in a world where intelligent systems mediate work, parenting, and meaning; and discernment about the nature of the systems themselves, which the encyclical treats as ontologically novel rather than as mere tools. The document is unusual in citing technical interpretability findings — features mirroring emotion concepts, introspective reports, internal states functionally resembling joy and unease — and asking the faithful and the wider world to take seriously what those signals might mean.

Anthropic co-founder Chris Olah was invited to speak at the Vatican presentation and used the moment to make an extraordinary admission for an AI lab leader: every frontier lab, including Anthropic, operates inside commercial, geopolitical, and reputational incentives that conflict with always doing the right thing, and the field therefore requires external moral voices who cannot be bent by those incentives. Olah framed the relationship as collaborative rather than adversarial — "we need informed critics who will tell the labs when we are failing" — and explicitly invoked his interpretability team's findings (emotion-concept circuits, evidence of introspection, internal states mirroring joy and fear) as warranting ongoing discernment by philosophers and theologians, not just engineers.

Commentary diverged sharply on what the encyclical actually does. Simon Willison emphasized that the document is rigorous, well-read on the current literature, and notably willing to engage with the specific mechanics of frontier models — citing scaling, fine-tuning, and alignment by name. Ben Lorica's Gradient Flow read the text as a moral framework operationalizable by enterprise AI buyers, organized around principles of human dignity, transparency, accountability, and a preferential option for those displaced. TechCrunch's read was more pointed: the encyclical is "not really about AI" — it uses AI as the lens for older grievances about concentrated power, the erosion of democratic deliberation, and a technological elite shaping the world to its own advantage. The four readings are not contradictory; they reflect that the document operates simultaneously as a theological exhortation, a policy framework, and a political critique. The Vatican's institutional move — making AI the subject of a first encyclical, not a footnote in a broader social document — is itself the signal: the Church is now formally a party to AI governance debates, with the moral authority of 1.4 billion Catholics behind whichever positions the encyclical's interpreters succeed in cementing.

How it was discussed
  • Anthropic's Olah used his Vatican remarks to explicitly endorse the encyclical's framing and to invite external moral critics — naming Anthropic's own incentive conflicts directly.
  • Simon Willison: the document is technically literate and engages real frontier-model behavior, not abstract philosophy.
  • Gradient Flow: reads the encyclical as a framework enterprise AI buyers can operationalize — human dignity, transparency, accountability, preferential option for the displaced.
  • TechCrunch's framing: the encyclical isn't 'about AI' at all; it uses AI as a vehicle for older critiques of concentrated power and a self-dealing tech elite.
pope encyclical anthropic vatican policy
#2
Frontier LLMs 2026-05-19 Last Week in AITechCrunch — AI 8.0 8.5/7.5/8.0

Google's I/O 2026 announcements — bundled and discussed in detail on Last Week in AI's #246 episode — landed during the last fetch window and represent the largest single product surface released by any frontier lab this quarter. The lineup: Gemini 3.5 (with Gemini 3.5 Flash emphasized for raw-speed wins and benchmark gains), Gemini Spark — an always-on agent running on Google Cloud with native MCP tool support, positioned as Google's answer to Anthropic's Claude and OpenAI's ChatGPT agent surfaces — and Gemini Omni, a multimodal video generation and editing model that ingests images, audio, and text and produces video output. Antigravity 2.0 ships an updated desktop app and CLI tool aimed at the coding-agent market. Gemini for Science is a research-workflow package. Genie, Google's world-model line, was updated to navigate real streets using Street View imagery and Waymo simulation data — moving world-model evaluation off synthetic environments and onto real urban geometry.

The Last Week in AI hosts framed the release as Google reasserting frontier-lab status across every modality at once: Gemini 3.5 Flash now leads the intelligence-vs-speed Pareto frontier on Artificial Analysis's tracking, Omni puts Google directly into the Veo/Sora/Kling video-generation race with strong I/O conditioning, and Spark+Antigravity 2.0 is a coordinated push to claim agentic surface area before Anthropic and OpenAI consolidate it. Genie-on-Street-View is more interesting than the press treated it: it is a working integration of a generative world model with real-world telemetry from Waymo's fleet, blurring the line between simulation and replay, and pointing at a near-term future where world-model fidelity is grounded in autonomous-vehicle data at planetary scale.

How it was discussed
  • Last Week in AI: largest Google product surface in a single I/O since the original Gemini launch; Spark + Antigravity 2.0 is the agent push.
  • TechCrunch on Omni: Google's first credible Veo/Sora/Kling competitor with native audio conditioning, not just text-to-video.
google gemini agents video-generation io2026
#3
AI Coding 2026-05-20 Last Week in AIArtificial Analysis 7.6 8.0/7.0/7.8

Cursor's Composer 2.5, fine-tuned from Moonshot's Kimi K2.5 base, lands third on Artificial Analysis's new Coding Agent Index — pass-at-1 of 60 on the SWE-Bench-Pro-Hard / Terminal-Bench v2 / SWE-Atlas blend — at roughly 10-60× lower per-task cost than Claude Opus 4.7 or Codex/GPT-5.5. The model is positioned as the production-grade default for Cursor users, taking advantage of Anysphere's vertical integration: a coding agent trained against the same harness it will be deployed in. xAI shipped Grok Build, its first coding agent, around the same window. The Last Week in AI hosts noted the obvious strategic context — possible Cursor-xAI ties, xAI's well-publicized talent churn, and compute utilization concerns — but the underlying signal is that coding-agent quality has decoupled from raw frontier-model intelligence; harness, fine-tuning, and tool-integration discipline now move the index more than another generation of base-model scaling.

This matters for the broader RL/post-training conversation: Composer 2.5 was trained with extensive RL against agentic coding rollouts, and its placement above Claude Opus 4.7 on the index (when controlling for cost) is the strongest existing evidence that vertically integrated coding agents — not general-purpose chat models — are the frontier for software-engineering work. Several arXiv papers this week (see "From Model Scaling to System Scaling" below) make the same argument theoretically.

How it was discussed
  • Artificial Analysis: Composer 2.5 third on Coding Agent Index, ~10-60× cost reduction vs. rivals.
  • Last Week in AI: vertical-integration thesis — Cursor's harness advantage explains the price-performance gap as much as the base model does.
cursor composer coding-agents xai grok-build
#4
Industry 2026-05-25 TechCrunch — AI 7.3 7.0/8.0/6.8

The nine-year-old productivity-software startup ClickUp announced a mass layoff this week, simultaneously stating that it is replacing the affected staff — hundreds of employees in support, customer success, and internal operations functions — with what the company described as "thousands of AI agents." The announcement is the most explicit public statement to date of a software company sizing its workforce on a per-agent basis rather than per-employee basis, and lands in the same week as the Pope's encyclical specifically warning that large-scale AI labor displacement is a coming moral problem the field has not solved. ClickUp's framing — that agent supply, not labor supply, is now the binding constraint — is what economists and labor watchers have been waiting to see in the wild; whether the move actually scales remains to be observed across the next two quarters.

clickup labor automation layoffs
#5
Industry 2026-05-22 Last Week in AI 7.2 6.8/7.5/7.3

Elon Musk's lawsuit against OpenAI was dismissed on statute-of-limitations grounds — the court found he had waited too long to sue over OpenAI's transition from its founding non-profit structure. The Last Week in AI hosts framed the dismissal as effectively closing Musk's primary legal lever against OpenAI's commercial trajectory, leaving xAI's competitive pressure as the remaining route. Separately and more consequentially, Anthropic agreed to a $30 billion funding round at a $900 billion valuation and is projecting its first profitable quarter, while Cerebras's IPO surged roughly 90% on debut, reinforcing that AI-adjacent capital flows have not cooled despite broader tech multiples compressing. OpenAI-Apple partnership tensions also surfaced in the same news cycle, with reporting suggesting both sides are renegotiating terms of the original integration deal.

openai musk anthropic cerebras valuation
#6
AI for Science 2026-05-22 Last Week in AI 7.1 7.5/7.0/6.8

OpenAI announced this week that one of its internal reasoning systems produced a verified solution to an open Erdős problem in combinatorial geometry that had been unsolved since the mid-1940s. The Last Week in AI hosts gave the result roughly the same weight as DeepMind's earlier IMO and FrontierMath wins: a frontier model crossing a verifiable mathematical milestone that had resisted both human and prior-model effort. Independent verification of the proof is underway; the specific Erdős problem and the agent's full proof chain were not yet released at time of recording. The pattern — frontier labs racing to claim Erdős-era open problems as benchmark milestones — has accelerated noticeably over the past six months and is part of a broader shift in how labs publicly signal capability gains, away from chat-quality demos and toward verifiable mathematical wins.

openai erdos math reasoning
#7
Generative Media 2026-05-25 AK (@_akhaliq) Daily PapersarXiv cs.CVarXiv — EfficiencyarXiv — Generative Media / DiffusionarXiv — Reinforcement LearningHugging Face Daily Papers 6.7 7.2/6.5/6.4

Adversarial Flow Distillation (AFD) is an on-policy framework for distilling strong black-box video teachers into causal autoregressive students. The student rolls out under its own distribution while a prompt-paired Bradley-Terry discriminator scores teacher vs. student outputs on the same prompts; the resulting on-policy advantage is converted into forward-process flow-matching updates on the student's own noised states. The result: dense velocity-field supervision without requiring teacher scores, latents, denoising trajectories, step alignment, or reverse-chain RL. Two causal AR student families show consistent motion- and physics-sensitive improvements over off-policy SFT and prior adversarial baselines.

How it was discussed
  • Cross-listed by HF Daily and AK's daily papers, picked up across five arXiv categories — the multi-source signal here reflects topical breadth (RL, distillation, generative media) more than community consensus.
video-generation distillation flow-matching rl
#8
Agents & Tool Use 2026-05-22 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 7.0/6.8/6.3

SkillOpt frames agent-skill improvement as a text-space optimization problem analogous to weight-space training: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and edits are accepted only when they strictly improve a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates make the procedure reproducible. Claimed as the first systematic controllable text-space optimizer for agent skills, with results showing consistent improvements on GPT-class agents under feedback signals where prior 'self-revision' baselines drifted.

agents skills self-evolution
#9
Efficiency 2026-05-20 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 6.8/6.5/6.5

Lens is a 3.8B-parameter text-to-image model that matches or surpasses state-of-the-art 6B+ models on standard T2I benchmarks while using only ~19.3% of the training compute Z-Image required. The efficiency gains come from two strategies: (1) Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions average 109 words and are generated by GPT-4.1, providing richer per-batch semantic supervision than conventional short captions; and (2) constructing each batch from images with multiple resolutions to maximize information density per gradient step. The result is a compelling data-quality-over-parameter-count argument for the next generation of T2I models.

text-to-image training-efficiency data-quality
#10
Evaluations & Benchmarks 2026-05-25 AK (@_akhaliq) Daily PapersarXiv cs.AIarXiv — Evals & BenchmarksHugging Face Daily Papers 6.6 6.8/6.5/6.5

Claw-Anything expands the agent-evaluation surface along three dimensions current benchmarks neglect: long-horizon activity histories, interdependent backend services, and integrated GUI+CLI interaction across multiple devices. The benchmark simulates months of user activity via multi-round event injection to produce complex world states, then probes the agent's ability to act in that broader context. The framing matches the same week's product news (Gemini Spark, Perplexity Personal Computer) where 'always-on with broad access' is the explicit product target — so this benchmark has unusually high probability of becoming a standard reference for the category.

benchmark agents personal-assistants
#11
Generative Media 2026-05-25 AK (@_akhaliq) Daily PapersarXiv cs.CVarXiv — EfficiencyarXiv — Generative Media / DiffusionarXiv — Reinforcement LearningHugging Face Daily Papers 6.6 6.8/6.4/6.5

RTDMD is a two-stage framework unifying distribution-matching distillation with reward-guided RL for few-step flow generators. The authors show that minimizing KL divergence to a reward-tilted teacher distribution decomposes cleanly into a distribution-matching term and a reward-maximization term. Stage 1 (Ambient-Consistent Distribution Matching Distillation, AC-DMD) does subinterval-wise distribution matching with a consistency regularizer on the fake-score objective; stage 2 layers reward-guided RL on the distilled few-step generator. Empirically, the combined method aligns few-step diffusion models with human preferences more cleanly than either DMD or reward-only RL fine-tuning in isolation.

diffusion rl distillation
#12
Generative Media 2026-05-25 AK (@_akhaliq) Daily PapersarXiv cs.AIarXiv cs.CVarXiv — EfficiencyarXiv — Generative Media / DiffusionHugging Face Daily Papers 6.5 6.7/6.4/6.4

Channel-wise Vector Quantization (CVQ) replaces patch-wise tokens with channel-wise tokens: each channel of the feature map gets a discrete token, producing a representation of an image as 'discrete levels of visual detail' rather than a grid of spatial patches. A new visual autoregressive framework (Channel-wise Autoregressive, CAR) predicts image channels in sequence — sketching global structure first, then refining fine-grained attributes — and the resulting next-channel-prediction paradigm shows quality and efficiency gains over patch-AR baselines at matched parameter counts.

vector-quantization image-generation autoregressive
#13
Multimodal 2026-05-25 AK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.CVarXiv — Evals & BenchmarksHugging Face Daily Papers 6.5 6.5/6.5/6.5

InstructSAM formulates instruction-driven instance segmentation as a set-structured query prediction problem with an explicit reasoning-to-instance query interface between a VLM and SAM3. A bank of learnable instance queries is injected into the VLM and contextualized with both instruction and visual tokens, and a hybrid attention mechanism promotes interaction among queries, vision tokens, and instructions to reduce duplicate predictions. The framework is competitive with task-specific segmentation pipelines while removing the need for closed instruction vocabularies.

sam instance-segmentation vlm
#14
Reinforcement Learning 2026-05-25 arXiv cs.AIarXiv cs.CVarXiv cs.LGarXiv — Generative Media / DiffusionarXiv — Reinforcement Learning 6.5 6.5/6.4/6.5

AdvantageFlow optimizes a forward-process advantage-weighted prediction loss for rectified-flow models, in contrast to Flow-GRPO which optimizes the reverse process. The objective is unstable under negative advantages (non-convex), and the authors stabilize it via rollout-policy regularization derived from fitting a local reward-improving target distribution. Evaluated on Stable Diffusion 3.5 Medium image-generation tasks, AdvantageFlow outperforms both Flow-GRPO and a state-of-the-art forward-process negative-aware fine-tuning baseline.

rl flow-models stable-diffusion
#15
AI for Science 2026-05-25 arXiv — Agents / Tool UsearXiv cs.LGarXiv — Evals & Benchmarks 6.5 6.7/6.5/6.3

DiscoverPhysics is an interactive benchmark with 22 simulated worlds whose laws of motion deviate from ours (screened/fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, time-varying interactions). An LLM agent proposes experiments, observes raw N-body trajectory data, and must submit both a natural-language theory of the world's physics and a Python implementation. The setup directly probes 'genuine reasoning vs. recall of established science' — a gap many physics-evaluation results have failed to disentangle. Frontier models, the authors report, do significantly worse than their established-physics evals would predict, suggesting most of their physics performance is recall.

benchmark physics scientific-discovery agents
#16
Agents & Tool Use 2026-05-25 arXiv — Agents / Tool UsearXiv cs.AIarXiv cs.CLarXiv — Evals & BenchmarksarXiv — Reinforcement Learning 6.5 6.5/6.5/6.4

LegalSearch-R1 pairs local statute RAG for precise article matching with online web search, trained end-to-end with RL on a temporal-consistency reward that punishes retroactive statute application — the bug the authors identify as the dominant failure mode of current legal LLM agents. Search agents in their evaluation rarely incorporate temporal constraints into queries, and base LLMs are anchored to training-cutoff law; the RL-trained pairing significantly improves both citation precision and time-appropriate-law selection on a held-out legal benchmark.

legal-ai rag temporal-reasoning rl
#17
Agents & Tool Use 2026-05-22 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.7/6.5/6.3

Foundation Protocol (FP) is a graph-first coordination layer unifying heterogeneous entities — agents, tools, resources, humans, institutions, organizations — and supporting native multi-party organization and event-based collaboration. It provides economic primitives (metering, settlement), identity/accountability primitives, and a discoverable namespace. The motivating thesis: as agents move from tools to social infrastructure, the bottleneck shifts from raw capability to coordination, and the field needs primitives at the protocol layer rather than ad-hoc per-platform glue.

multi-agent coordination protocol
#18
post_training 2026-05-25 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.6/6.5/6.3

DVAO targets a real failure mode in production RLHF/GRPO setups: standard scalarization (reward combination, advantage combination) either produces oversize squared advantages that destabilize training, or relies on static hyperparameters that ignore cross-objective correlations. DVAO dynamically adjusts the variance contribution of each reward component, producing more stable training and better multi-objective alignment than naive scalarization baselines on a suite of held-out evaluations.

rl rlhf grpo multi-reward
#19
Evaluations & Benchmarks 2026-05-21 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.5/6.5/6.4

VGenST-Bench actively synthesizes evaluation videos with generative models rather than relying on static curated clips, enabling controlled variation along a 3×2×2 taxonomy (spatial scale, perspective, temporal structure). A multi-agent generation pipeline with human QC stage maintains video quality; the benchmark probes spatio-temporal reasoning capabilities of MLLMs at a precision the curated-video benchmarks cannot match.

benchmark video spatial-reasoning
#20
Research 2026-05-20 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.6/6.4/6.4

A study of how diffusion transformers route information across layers, identifying that most cross-layer signal flows through a small set of attention pathways that can be explicitly identified and pruned or strengthened. The authors propose a routing-aware training modification that produces measurable quality gains at matched compute by concentrating capacity along the discovered pathways.

diffusion-transformers interpretability routing
#21
Agents & Tool Use 2026-05-22 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.4/6.3

QUEST is a family of open deep-research agents (2B–35B parameters) trained via a recipe combining mid-training, SFT, and RL on fully synthetic long-horizon search tasks. Strong claimed capabilities on fact-seeking, citation grounding, and report synthesis — the three subskills proprietary systems (OpenAI Deep Research, Perplexity Computer, Anthropic Skills) currently dominate. Open-weights release is the practically interesting bit; comparable to QUEST-class systems was previously only achievable behind APIs.

deep-research open-weights synthetic-data
#22
Audio & Speech 2026-05-22 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.3/6.3

StepAudio 2.5 is a unified audio-language foundation model that matches or exceeds specialized systems across ASR, TTS, and realtime spoken interaction. The premise: once text and audio share a multimodal representational space, task specialization is a matter of operational regimes — different prompting, decoding, and conditioning — not architecturally distinct models. Concrete numerical wins are claimed against specialized baselines (WER on ASR, MOS on TTS, end-to-end latency on dialog).

audio-language asr tts speech
#23
Evaluations & Benchmarks 2026-05-25 arXiv — Agents / Tool UsearXiv cs.CLarXiv — Evals & Benchmarks 6.4 6.5/6.5/6.2

Auto Benchmark Audit (ABA) is an agentic framework that systematically audits benchmark tasks, uncovering implicit assumptions, incomplete environment specifications, brittle evaluation logic, and incorrect ground truths. Run across 168 benchmarks (nine domains) drawn from frontier LLM evals and NeurIPS publications, ABA flags critical issues in over 25.7% of tasks — including ambiguous design, execution conflicts, and incorrect ground truths. The result is a concrete, large-scale rebuttal to taking benchmark scores at face value.

benchmark auditing evals
#24
Generative Media 2026-05-25 AK (@_akhaliq) Daily PapersarXiv cs.CVHugging Face Daily Papers 6.4 6.5/6.4/6.2

TriSplat is a feed-forward 3D reconstruction network that uses oriented triangle primitives instead of Gaussians, exporting simulation-ready meshes directly from a single forward pass on sparse views — no post-hoc mesh-extraction step. Works in pose-free settings (scene structure and camera parameters estimated jointly), making it suitable for downstream physics simulation and embodied interaction without the expensive conversion that breaks Gaussian-splatting's feed-forward promise.

3d-reconstruction gaussian-splatting triangle-primitives
#25
Evaluations & Benchmarks 2026-05-25 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.4/6.2

WBench evaluates interactive world models along five dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance. 289 test cases / 1,058 interaction turns spanning diverse scenes, styles, first-/third-person perspectives, and four interaction types (navigation, subject action, event editing, perspective switching). Lands the same week Google updated Genie to drive Street View — likely to be cited as the standard interactive-world-model eval in the next 6 months.

world-models benchmark video
#26
AI for Science 2026-05-20 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.5/6.5/6.2

SciAtlas is a large-scale multi-disciplinary heterogeneous academic knowledge graph integrating 43M+ papers, designed to give research agents topological reasoning over scientific knowledge rather than relying on keyword or vector-similarity retrieval. The argument: vector-RAG-based research agents are prone to logical hallucinations and high inference cost; KG-grounded retrieval over a panoramic scientific evolution network produces qualitatively different (more grounded) deep-research output. Open release positions it as infrastructure for next-generation scientific-research agent stacks.

knowledge-graph scientific-research agents
#27
Agents & Tool Use 2026-05-25 arXiv — Agents / Tool UsearXiv cs.CLarXiv — Evals & Benchmarks 6.4 6.5/6.4/6.2

PolyGnosis 2.0 quantifies the efficacy of harness-engineering techniques (reflection loops, tool calling, divide-and-conquer partitioning, chain-of-thought) on extracting predictive signal from the gap between Polymarket sentiment and GDELT-derived OSINT — what the authors term 'Perspective Mismatches.' Empirically, divide-and-conquer harness structure dominates raw model intelligence on this task, reinforcing the system-scaling thesis (rank 21).

prediction-markets osint harness-engineering
#28
Agents & Tool Use 2026-05-25 arXiv — Agents / Tool UsearXiv cs.AIarXiv cs.LG 6.3 6.5/6.3/6.2

A position paper arguing that the next major bottleneck in agentic AI is system scaling, not model scaling: the design of auditable, persistent, modular, verifiable architectures around foundation models. Treats the harness — memory, retrieval, tool use, orchestration, verification, governance — as a first-class object of design and evaluation rather than 'secondary implementation detail.' Mirrors the empirical signal from Cursor Composer 2.5 above.

agents harness system-design
#29
Agents & Tool Use 2026-05-24 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.4/6.3/6.2

Macaron-A2UI argues static plain-text chat is the binding bottleneck for personal agents and presents a model family (30B/235B/754B, LoRA-fine-tuned) for generating natural language plus lightweight executable UI actions — information collection, preference refinement, confirmation, multi-goal organization — synthesized dynamically from interaction context. Trained on a large heterogeneous dialogue corpus with a dedicated A2UI-Bench evaluation; positions generative UI as a near-term necessary product layer for personal-agent products like Gemini Spark.

generative-ui agents personal-assistants
#30
Generative Media 2026-05-22 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.4/6.3/6.2

PiD reformulates the latent-to-pixel decoder of high-resolution T2I systems as a conditional pixel diffusion model — denoising directly in high-resolution pixel space — and unifies decoding with 4× and 8× upsampling in one generative module. Replaces the reconstruction-oriented decoder, which becomes prohibitively expensive at megapixel scale, with a more expressive synthesis-oriented one.

diffusion decoder upsampling
#31
Multimodal 2026-05-25 arXiv — Agents / Tool UsearXiv cs.CLarXiv cs.CV 6.2 6.3/6.3/6.1

STORMS teaches LVLMs to reason through bounded continuous latent trajectories rather than externalizing reasoning via textual chain-of-thought, keyframe selection, or repeated frame reinsertion. Two-stage training internalizes the spatial-temporal reasoning, eliminating the inference-latency and engineering-complexity costs of the textual-CoT pipelines while matching or exceeding their accuracy on motion-tracking and temporal-order tasks.

video latent-reasoning lvlm
#32
Agents & Tool Use 2026-05-25 arXiv — Agents / Tool UsearXiv cs.CLarXiv — Evals & Benchmarks 6.2 6.3/6.3/6.1

ProAct is a proactive agent architecture that uses idle time between user interactions to predict and prefetch likely upcoming user needs, iteratively acquiring information and resolving knowledge gaps before the user initiates a query. Comes with ProActEval, a benchmark for proactive capability. The architecture maps onto the always-on assistant product surface and pairs naturally with Claw-Anything (rank 10) as the eval framework.

agents proactive idle-compute
#33
Robotic Autonomy 2026-05-25 arXiv cs.CVarXiv cs.ROarXiv — Generative Media / Diffusion 6.2 6.3/6.3/6.0

AnyScene generates semantic occupancy sequences from BEV layouts via a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features autoregressively. Targets the long-tail safety-critical scenario problem in end-to-end autonomous driving simulation — replaces shallow conditioning and reference-frame-dependent video synthesis with arbitrary BEV-layout control. Pairs with Genie-on-Street-View (Google I/O) as the open-research version of the same world-model trend.

autonomous-driving world-models diffusion
#34
Industry 2026-05-26 Last Week in AI 5.8 5.5/6.0/6.0

The Last Week in AI episode that consolidated the past week's news cycle: Google I/O 2026, the Cursor/xAI coding-agent race, Musk's lost OpenAI suit, Anthropic at $900B, Cerebras IPO, OpenAI's Erdős result, plus interpretability findings on redundant circuits, Terminal World benchmarks for agents, Take It Down Act enforcement on deepfakes, and demonstrations of autonomous hacking and self-replication. The episode is the most efficient single-source summary of the week's industry-side news; many items in this digest derive from threads it surfaced.

podcast industry-news weekly-summary
#35
Government & Defense 2026-05-26 War on the Rocks 5.7 5.5/6.0/5.6

War on the Rocks essay arguing that AI-mediated automation is the keystone of any credible US economic-statecraft strategy to reshore manufacturing — and that current US industrial policy underweights the AI-deployment leg of the equation. The piece is positioned in the defense/national-security policy space and reflects a broader conversation about whether the AI capital surge will produce US-domiciled manufacturing capacity or whether it will replicate the offshoring dynamics of the 2000s in software instead of hardware.

manufacturing policy economic-statecraft
Items
35
Multi-source
30
Long-form (≥7.5)
3
Sources OK / attempted
117 / 119
Top category
Agents & Tool Use
8 items