Wolf Digest — 2026-05-15

#1

Anthropic + Gates Foundation: $200M, 4-year partnership on global health, education, and economic mobility

Industry 2026-05-14 Anthropic News 8.5 8.3/9.0/8.2

Anthropic and the Gates Foundation announced a four-year, $200 million partnership — combining grant funding, Claude usage credits, and engineering support — focused on global health and life sciences, education, and economic mobility. The work is run by Anthropic's Beneficial Deployments team, which also develops AI-related public goods like public health datasets and evaluation benchmarks and provides discounted Claude access to nonprofits and education institutions.

The largest portion of the partnership targets low- and middle-income country health, where roughly 4.6 billion people lack essential health services. Concrete deliverables include healthcare-specific connectors that grant Claude direct access to other platforms, benchmarks, and evaluation frameworks for measuring AI performance on healthcare-related tasks. Anthropic and the Gates Foundation will jointly engage health ministries on using health-intelligence data for workforce deployment, supply chain management, and outbreak detection. A targeted drug-and-vaccine-screening track will start with polio, HPV (~350,000 annual deaths, 90% in LMICs), and eclampsia/preeclampsia, exploring whether Claude can compress early-stage development timelines by computationally screening candidates before pre-clinical work. A separate partnership with the Gates Foundation's Institute for Disease Modeling (IDM) will integrate Claude with IDM's malaria and tuberculosis forecasts to make them accessible to non-specialist practitioners and to help IDM build more predictive transmission models.

The education pillar funds K-12 deployments in the US, sub-Saharan Africa, and India. Public goods slated for later this year include benchmarks, datasets, and knowledge graphs to evaluate AI tools for math tutoring, college advising, and curriculum design. Foundational literacy and numeracy apps for sub-Saharan Africa and India fall under the Global AI for Learning Alliance (GAILA). On economic mobility, the partnership will build agriculture-specific improvements to Claude — local crop datasets and agricultural benchmarks released as public goods, aimed at the roughly two billion people whose incomes depend on smallholder farming. US-side commitments include portable skills/certifications records, career guidance for new entrants and retrainees, and tooling to link training-program data to employment outcomes for measuring which interventions actually improve job and wage outcomes.

The structural framing matters: as a public benefit corporation, Anthropic is committing to extend AI to areas where commercial markets alone will not. The Gates Foundation contributes decades of operational experience and measured impact in the four priority domains, which addresses one of the field's recurring failure modes — well-funded AI initiatives that lack the implementation pathways and local relationships needed to translate model capability into outcomes. Anthropic has signaled it intends to publish thinking and results from beneficial-deployments work, including which interventions deliver the promised outcomes and which do not.

ai-for-health ai-for-education anthropic global-development

#2

PwC + Anthropic expand alliance: Claude rolling out to global workforce, 30K certifications, new Office of the CFO business unit

Industry 2026-05-15 Anthropic News 8.4 8.5/8.0/8.6

Anthropic and PwC announced a substantial expansion of their strategic alliance. Claude Code and Claude Cowork will roll out starting with PwC's U.S. teams and expand toward a global workforce of hundreds of thousands. The two firms are establishing a joint Center of Excellence and a program to train and certify 30,000 PwC professionals on Claude. PwC is also launching a new finance-focused business group — the Office of the CFO — as the first standalone PwC business unit explicitly anchored on Anthropic's technology, starting in regulated industries (banking, insurance, healthcare) where accuracy and auditability dominate.

The collaboration targets three areas where the leverage is highest. First, agentic technology build: PwC engineering teams are using Claude Code to ship production software for major companies in weeks rather than quarters, with a growing portfolio across financial services, pharma and life sciences, healthcare, and consumer markets. Second, AI-native deal-making: PwC is rebuilding deal execution — diligence, value creation, and integration — with agents working alongside deal teams, which compresses the path from thesis to value capture for private-equity sponsors and corporate acquirers. Third, reinvention of the enterprise function: PwC is running production AI-native operating models for finance, supply chain, HR, and the engineering function itself, not pilots.

The production results are the most concrete proof-points. Across active deployments, clients are reporting delivery improvements of up to 70%. Insurance underwriting cycles compressed from ten weeks to ten days, opening lines of business that were not previously economically viable. A COBOL mainframe modernization that turned out four times larger than the original scope is tracking on time and under budget. An HR transformation that had previously stalled produced a working prototype in one week, a full application in under two months, and now runs thousands of daily transactions. Cybersecurity incident response moved from hours to minutes, with agentic vulnerability operations — code review and automated containment — closing exposure windows before adversaries can exploit them. PwC is also using Claude internally for journal entries, variance analysis, RFPs, and annual planning, having taken the "Customer Zero" approach before bringing the technology to clients.

The client showcase is Advocate Health, building toward full-scale Claude deployment across its 167,000-person workforce, framed by the health system as foundational rather than incremental. The expansion sits inside Anthropic's broader Claude Partner Network — the $100 million investment in services firms that help enterprises actually deploy AI rather than pilot it — and Daniela Amodei's framing positions PwC's depth in financial services, healthcare, life sciences, and cybersecurity as essential to making the technology work at enterprise scale. The structural signal is clear: PwC is betting Claude rather than running multi-vendor pilots, and Anthropic now has a Big Four services partner committed at a scale that complicates competitive positioning against OpenAI's enterprise GTM motion.

anthropic pwc enterprise-ai claude-code consulting

#3

Anthropic interpretability: Natural Language Autoencoders translate Claude's internal activations into unsupervised plain-English explanations

Interpretability 2026-05-14 Transformer Circuits Thread (Anthropic) 8.3 8.5/8.5/7.8

Anthropic's interpretability team (Fraser-Taliente, Kantamneni, Ong, et al., 2026) published a method that trains Claude to translate its own internal activations directly into natural language, producing unsupervised explanations of what individual activations represent. The framing is what makes the result notable: where sparse autoencoders decompose model activations into a learned dictionary of features whose semantics still have to be probed and named after the fact, natural language autoencoders skip the labeling bottleneck — the model emits an English description as the explanation itself, learned without supervised feature labels.

The work sits in a lineage of Anthropic interpretability releases that have steadily reduced the cost of moving from raw activations to human-interpretable concepts. Toy-models-of-superposition in 2022 established the polysemanticity problem; the Towards Monosemanticity and Scaling Monosemanticity papers in 2023–2024 showed sparse-autoencoder features could be extracted at scale on production-sized models; the 2025 Activation Oracles work trained models to answer questions about their own activations. Natural language autoencoders push further in that direction: rather than asking the model targeted questions about an activation, the model is trained to compress the activation into a free-form natural-language description from which the activation can be reconstructed, providing a self-supervised signal that the description preserves the salient information.

A companion release, HeadVis, is an interactive visualization tool aimed specifically at attention-head behaviors — Anthropic has been building this stack publicly since the original Mathematical Framework for Transformer Circuits, and the Transformer Circuits Thread now hosts a growing collection of tools (PySvelte, Garcon, the circuit-tracing harness) that the team uses to actually run interpretability research at scale. HeadVis appears to complement the NLA paper by giving researchers a way to inspect how individual heads route information that the autoencoder is then trying to verbalize.

The practical implication for the field is that interpretability research that previously bottlenecked on human labelers — figuring out what an SAE feature "means," cross-referencing dictionary indices against examples, building human-coded probes — can now route through model-generated explanations as a first pass, with humans graded against the model rather than vice versa. The release continues Anthropic's pattern of treating interpretability as a load-bearing input to safety work rather than a separate academic track, and the publication on Transformer Circuits Thread (rather than arXiv) signals the same incremental, interactive-article-first publication culture the team has used since the Distill Circuits Thread era.

mechanistic-interpretability anthropic sparse-autoencoders alignment-research

#4

Cerebras IPO: $5.5B raise, stock pops 108% on debut — first major AI-hardware IPO of 2026

Infrastructure 2026-05-14 TechCrunch — AI 8.0 7.8/7.5/8.7

Cerebras Systems went public on May 14 in what TechCrunch is calling the first huge tech IPO of 2026, raising approximately $5.5 billion in primary capital. The stock popped 108% on debut — closing well above the offering price — and the result is being read as a sentiment shift on AI-hardware capital markets after a multi-year stretch in which Cerebras's S-1 filings stalled on regulatory review of overseas investors (notably G42) and the conventional wisdom held that a Cerebras IPO might never close.

The valuation jump materially changes the competitive landscape for non-NVIDIA AI accelerators. Cerebras's wafer-scale engine architecture has been a niche bet — strong on extremely large single-model workloads where its 900,000-core wafer can host an entire training run without inter-chip communication overhead — but has historically struggled to displace NVIDIA in the broader market. A successful IPO at this size gives the company both a war chest for the next generation (the WSE-3 successor and supporting interconnect work) and the capital-markets credibility to compete for the very large training contracts that have been migrating toward custom silicon at the hyperscalers. The immediate read-through is bullish for other custom-silicon plays (Groq, Tenstorrent, SambaNova) and for the assumption that AI-infrastructure capital expenditure remains in expansion mode through 2026 despite the recent batch of warnings from hyperscaler CFOs about CapEx sustainability.

The day was also significant for what it signals about tech IPO appetite generally: after the long drought, a hardware company with a complicated regulatory backstory clearing the offering at a 108% pop pulls forward the IPO calendar for the AI cohort waiting in the wings (Databricks, xAI's commercial unit, Anthropic if it ever chooses that path, the SpaceXAI consumer business). Critics flag that wafer-scale economics still depend on customers who can saturate the chip — primarily frontier labs and large training shops — and the question of whether Cerebras's revenue mix is durable enough to justify a public valuation will be relitigated every quarter.

cerebras ipo ai-hardware infrastructure wafer-scale

#5

SDAR: Self-Distilled Agentic RL gates on-policy distillation as auxiliary objective, beats GRPO by 7-10pts on agent benchmarks

Reinforcement Learning 2026-05-14 AK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.AIarXiv cs.CLarXiv cs.LG 7.4 7.5/7.0/7.7

SDAR (Self-Distilled Agentic Reinforcement Learning) addresses the instability of stacking On-Policy Self-Distillation (OPSD) on top of trajectory-level RL for multi-turn LLM agents. The compounding error of multi-turn rollouts destabilizes naive OPSD supervision, and skill-conditioned privileged guidance produces asymmetric teacher rejections whose treatment matters. SDAR keeps RL as the primary backbone and gates OPSD as a sigmoid-controlled auxiliary objective on detached token-level signals — strengthening distillation on teacher-endorsed positive-gap tokens, softly attenuating negative teacher rejections. On Qwen2.5/Qwen3 across ALFWorld, WebShop, and Search-QA, SDAR posts +9.4 ALFWorld, +7.0 Search-QA, +10.2 WebShop-Acc over GRPO, avoids GRPO+OPSD instability, and beats hybrid baselines across model scales.

How it was discussed

arXiv abstract frames the contribution as gating OPSD by token-level confidence rather than treating it as a primary loss.
AK's Daily Papers writeup foregrounded the ALFWorld/WebShop deltas as the practitioner-relevant result, given how stubborn those benchmarks have been to incremental RL methods.

agentic-rl on-policy-distillation grpo alfworld webshop

#6

OpenAI: Codex from anywhere — ChatGPT mobile gains real-time Codex steering across devices and remote environments

AI Coding 2026-05-14 OpenAI ResearchTechCrunch — AI 7.2 7.0/6.5/8.2

OpenAI shipped Codex availability inside the ChatGPT mobile app: monitor, steer, and approve long-running coding tasks from phone, with the agent operating across remote environments while the user supervises asynchronously. TechCrunch's coverage frames this as the natural extension of the Codex-on-Windows-sandbox work that landed earlier in the week — the previous step gave Codex its own controlled execution surface; the mobile launch makes the human supervision layer device-portable. The pattern matches what Latent Space's daily AINews flagged earlier: Codex usage curves are rising, and OpenAI is investing in keeping the human-in-the-loop interface available wherever the developer happens to be.

How it was discussed

OpenAI's own post emphasizes parity with desktop Codex and remote-environment streaming.
TechCrunch frames it as platform-strategy: keeping coding agents tied to ChatGPT rather than letting them drift to standalone surfaces.

openai codex ai-coding agents mobile

#7

DiffusionOPD: unified on-policy distillation across diffusion tasks resolves cross-task interference in multi-task RL

Generative Media 2026-05-14 AK (@_akhaliq) Daily PapersarXiv cs.LGarXiv — EfficiencyarXiv — Evals & BenchmarksarXiv — Generative Media 7.0 7.0/7.0/6.9

DiffusionOPD presents a unified perspective on on-policy distillation for diffusion-based text-to-image models, addressing the cross-task interference and imbalance that joint multi-task RL fine-tuning suffers from. The authors recast existing single-task diffusion-RL methods inside an OPD framework, identifying that the gradient variance and reward-scale mismatch across tasks is the dominant failure mode. Their unified estimator stabilizes multi-task training while maintaining per-task quality, with reported gains on a multi-task image-quality and aesthetics benchmark suite.

diffusion on-policy-distillation text-to-image multi-task-rl

#8

Perplexity Computer + Snowflake: natural-language data-science App Connector, mirrors internal Slackbot running 3K queries/week

Agents & Tool Use 2026-05-14 Perplexity AI 7.0 6.8/7.0/7.2

Perplexity shipped an App Connector for Snowflake inside Computer (its multi-model agent product) that lets non-data-team users query enterprise warehouses in plain language. Architecture mirrors Perplexity's internal Slackbot — a shared semantic layer encodes table relevance, key-metric definitions, and natural-language-to-SQL translations; the company says employees now run up to 3,000 internal queries per week across product, analytics, growth, infrastructure, comms/marketing, support, and security. Security respects Snowflake's Role-Based Access Control with read-only and per-database/schema/table scoping; User OAuth recommended so each user's view is bounded by their existing permissions. Data-map generation takes up to 90 minutes; available now on Pro, Max, and Enterprise. Data is not used to train models.

perplexity computer-agent snowflake enterprise-data nl2sql

#9

Achieving gold-medal Olympiad reasoning via simple unified scaling — single recipe matches IMO/IPhO gold across families

Frontier LLMs 2026-05-13 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.9 6.9/7.0/6.9

Several recent reasoning systems have reached gold-medal performance on IMO and IPhO problems via heterogeneous scaling recipes (longer CoT, multi-sample selection, custom verifiers). This paper proposes a single unified scaling protocol — same compute envelope, same prompting structure, same verifier handling — and shows it reproduces gold-medal-level performance on both IMO and IPhO benchmarks across model families. The framing argues that much of the recent benchmark progress is attributable to test-time-compute discipline rather than recipe-specific tricks.

math-reasoning test-time-scaling imo ipho verifiers

#10

Cisco cuts ~4,000 jobs to fund AI investment, posts record quarterly revenue

Industry 2026-05-14 TechCrunch — AI 6.9 6.5/7.0/7.2

Cisco announced a layoff of roughly 4,000 employees alongside record quarterly revenue, with CEO Chuck Robbins framing the headcount cut as a reallocation toward AI investment. The pattern matches the broader corporate playbook of 2026 — strong top-line tied to AI orders (Cisco previously reported 18% growth in AI-tied orders), aggressive workforce reshaping toward AI-capable headcount, public messaging that pairs cost discipline with reinvestment. It is the company's latest round in a multi-quarter series.

cisco layoffs ai-infrastructure enterprise-it

#11

ATLAS: agentic vs latent visual reasoning — one-word control token toggles between modes within a single VLM

Multimodal 2026-05-14 AK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.AIarXiv cs.CLarXiv — Evals & Benchmarks 6.7 6.7/6.5/6.9

ATLAS unifies the agentic (generate images during reasoning) and latent (keep reasoning in feature space) approaches to multimodal reasoning. The authors show that a single token signal at decode time switches a VLM between the two modes, avoiding the computational cost of always materializing visual states while preserving the gains when explicit visual generation actually helps. The empirical claim is that one mode is not strictly dominant — task structure determines which works better, and a learned controller picks correctly more than 80% of the time.

vlm visual-reasoning agentic-reasoning latent-reasoning

#12

MemEye: visual-centric multimodal agent memory benchmark — most questions answerable from captions alone in prior work

Evaluations & Benchmarks 2026-05-14 AK (@_akhaliq) Daily PapersarXiv cs.CLarXiv — Evals & BenchmarksHugging Face Daily Papers 6.7 6.6/6.8/6.7

MemEye is a multimodal long-term agent memory benchmark constructed so that text traces alone cannot answer the questions — earlier benchmarks let agents "cheat" through caption shortcuts. The benchmark requires preserving and retrieving visual evidence across long horizons. Headline finding: state-of-the-art VLMs lose significant ground when forced to rely on actual visual memory rather than text summaries, with frontier models dropping 15–30 points relative to text-only memory baselines.

agent-memory multimodal-eval vlm-benchmarks

#13

Darwin Family: training-free evolutionary merging via MRI-trust-weighted weight-space recombination

Post-Training 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 6.5/6.8/6.8

Darwin Family is a gradient-free evolutionary merging framework that recombines model weights with an MRI-trust weighting (Mutually Reinforcing Importance) to scale reasoning performance without additional training. The paper claims frontier-level reasoning improvements purely from training-free recombination of existing checkpoints, with stable scaling laws across the search budget — extending the model-merging literature beyond simple linear interpolation and task-arithmetic methods.

model-merging evolutionary-methods training-free-scaling

#14

Concurrency without Model Changes: future-based async function calling for LLM agents removes blocking on every tool call

Agents & Tool Use 2026-05-14 arXiv — Agents / Tool UsearXiv cs.AIarXiv cs.CLarXiv cs.LGarXiv — Evals & Benchmarks 6.6 6.5/6.5/6.7

Standard tool use blocks LLM decoding until each function call completes, so end-to-end latency stacks linearly with the number of independent tool calls. This work introduces a future-based async semantics — the model emits a future handle in place of the awaited result and continues generation, joining when the value is actually consumed. No fine-tuning required: the change is in the runtime, not the model. Reported speedups are most pronounced on agents that issue many independent calls (search, code, retrieval) where blocking dominated wall-clock.

function-calling tool-use concurrency agent-runtime

#15

Orchard: open-source agentic modeling framework — fills the training/infrastructure gap behind top closed systems

Agents & Tool Use 2026-05-14 AK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.AIarXiv cs.CLarXiv — Evals & Benchmarks 6.6 6.5/7.0/6.4

Orchard is an open-source agentic modeling framework that packages the training infrastructure, environment harnesses, and evaluation tooling that have historically been the bottleneck for academic groups reproducing the strongest closed-source agentic results. The release includes ALFWorld/WebShop/Search-QA training environments, standard agentic RL recipes (GRPO, OPSD, hybrid variants), and competitive baselines across model families. The contribution is explicitly infrastructure — closing the reproducibility gap rather than introducing a new method.

open-source agentic-rl infrastructure reproducibility

#16

FutureSim: replay real-world events in arrival order to evaluate adaptive agents on out-of-distribution updates

Evaluations & Benchmarks 2026-05-14 AK (@_akhaliq) Daily PapersarXiv cs.AIarXiv cs.CLarXiv cs.LGarXiv — Evals & Benchmarks 6.6 6.5/6.8/6.5

FutureSim is a grounded-simulation harness that replays real-world events in the order they occurred, scoring agents on how well they adapt to new information as it arrives rather than on static-snapshot accuracy. The benchmark explicitly targets the failure mode where agents trained on retrospective data drop out-of-distribution when narrative or factual conditions shift mid-trajectory.

agent-evaluation adaptive-agents simulation-benchmarks

#17

EVA-Bench: end-to-end voice-agent evaluation — couples simulated conversations with task-completion quality

Audio & Speech 2026-05-13 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.5/6.5/6.5

EVA-Bench is a voice-agent end-to-end evaluation framework that jointly addresses two gaps in prior voice benchmarks: generating realistic simulated user conversations (rather than scripted prompts), and scoring task completion quality (rather than ASR accuracy alone). The framework produces reproducible voice-agent comparisons across enterprise deployment scenarios.

voice-agents evaluation speech-llm

#18

Long-context VLMs effectively trained beyond 128K — recipe shows where naive context extension breaks

Multimodal 2026-05-13 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 6.5/6.5/6.5

Long-context modeling has become a core capability for large VLMs, yet practical training recipes remain underexplored — many published methods that work for text fail when vision tokens are added because the modality-mix ratio interacts with positional encoding extension. This paper documents what actually works for training VLMs that generalize beyond 128K, identifying the specific failure modes of standard YaRN-style and rope-scaling recipes when applied to mixed modalities, and proposes a curriculum that preserves visual grounding at long contexts.

long-context vlm rope-scaling training-recipes

#19

Boosting RLVR via randomly-selected verifiable feedback — sample-efficiency gains via stochastic reward selection

Reinforcement Learning 2026-05-14 arXiv cs.AIarXiv cs.CLarXiv — Evals & BenchmarksarXiv — Reinforcement Learning 6.5 6.4/6.5/6.6

RLVR works well for math and code where verifiers are cheap, but sample efficiency degrades on harder problems where verifier signal is sparse and noisy. This paper randomly subsamples verifier feedback during training — counterintuitively improving downstream reasoning by reducing exposure to noisy or sycophantic verifier failures while preserving signal on well-validated rollouts.

rlvr verifiable-rewards sample-efficiency

#20

Granite Embedding Multilingual R2: IBM open-sources Apache-2.0 multilingual embeddings, MIRACL/MTEB competitive

Frontier LLMs 2026-05-14 Hugging Face Blog 6.4 6.0/6.5/6.7

IBM released Granite Embedding Multilingual R2 under Apache 2.0 — an open embedding model with strong MIRACL and MTEB multilingual scores. The release fills a gap in fully-permissive multilingual embeddings for enterprise RAG stacks that need to avoid the restrictive licensing on competing models. R2 builds on the Granite Embedding family with expanded language coverage and updated negative-sampling.

embeddings multilingual apache-2 ibm-granite rag

#21

AI-Native Healthcare with Abridge: 100M doctor visits, 10-20 hours/week saved per clinician, prior auth in minutes

Industry 2026-05-14 Latent Space PodcastLatent Space (swyx & Alessio) 6.4 6.2/6.5/6.5

Latent Space's first healthcare-focused episode features Abridge's Janie Lee and Chai Asawa walking through what the production AI-native medical-scribe stack looks like at 100M doctor visits processed. Headline numbers: 10–20 hours per clinician per week saved on documentation, prior authorization compressed from days to minutes, and the architecture decisions that make the system survive HIPAA-grade compliance review. The episode is dense on the specifics of clinician workflow integration that academic literature on medical LLMs typically glosses over.

healthcare ambient-scribe abridge prior-authorization

#22

KL for a KL: control-variate baseline stabilizes on-policy distillation under high gradient variance

Post-Training 2026-05-08 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.3/6.5/6.4

OPD has emerged as the dominant post-training paradigm for reasoning-tuned LLMs but suffers from high gradient variance from its single-sample Monte Carlo estimator. KL-for-a-KL introduces a control-variate baseline derived from the teacher's KL itself, reducing variance without changing the expected gradient — producing more stable training curves and better final accuracy on math and code reasoning suites.

on-policy-distillation variance-reduction reasoning-training

#23

Multi-agent systems survey: collaboration, failure attribution, and self-evolution in LLM-based agent collectives

Agents & Tool Use 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 6.2/6.5/6.5

Comprehensive survey of LLM-based multi-agent systems with a focus on three understudied dimensions: structured collaboration patterns (which architectures actually scale beyond toy debates), failure attribution (when agent collectives go wrong, can we localize the failing agent or step?), and self-evolution (whether systems can update their own coordination protocols). Useful as a literature anchor for current work in the area.

multi-agent survey failure-attribution agent-collaboration

#24

IntentVLA: short-horizon intent modeling resolves aliased robot manipulation demonstrations

Robotic Autonomy 2026-05-14 AK (@_akhaliq) Daily PapersarXiv cs.CLarXiv — Evals & BenchmarksHugging Face Daily Papers 6.4 6.3/6.5/6.4

Frame-conditioned VLA policies infer each action chunk from the current observation alone, but robot demonstration data is intrinsically multimodal — similar visual-language inputs may map to different action chunks because human demonstrators acted under different short-horizon intents. IntentVLA introduces an explicit intent token that disambiguates the aliasing, learned jointly with the action policy. The result is materially better behavior cloning on tasks where prior VLAs averaged conflicting demonstrations.

vla robot-learning behavior-cloning intent-modeling

#25

SANA-WM: 2.6B open-source minute-scale world model — 720p video at industrial-baseline quality

Generative Media 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 6.2/6.3/6.4

SANA-WM is a 2.6B-parameter open-source world model trained natively for minute-scale video generation, producing high-fidelity 720p one-minute videos with precise camera control. Quality is reportedly comparable to large-scale industrial baselines (LingBot-World class), at a fraction of the parameter count, using a hybrid linear-diffusion transformer to make long-form generation tractable.

world-models video-generation open-source linear-attention

#26

MeMo: Memory as a Model — augment frozen LLMs with a model-based external memory rather than retrieval+context

Frontier LLMs 2026-05-14 arXiv cs.AIarXiv cs.CLarXiv cs.LGarXiv — Evals & Benchmarks 6.3 6.2/6.5/6.2

MeMo casts external memory as a small trainable model rather than a vector store + context-window stuffing. The approach trains the memory model on the same tokens the frozen base LLM sees, producing a queryable companion that updates online. Reported wins are most pronounced on tasks where standard RAG fails (cross-document reasoning, evolving entities) where the memory model amortizes structure across queries.

external-memory lifelong-learning rag-alternatives

#27

Widening the Gap: outlier-injection attack widens precision gap, exploits LLM quantization for hidden malicious behavior

Safety, Policy & Regulation 2026-05-14 arXiv cs.AIarXiv cs.LGarXiv — Efficiency 6.3 6.5/6.5/5.8

Earlier work showed quantization schemes can hide malicious behaviors that appear only after weight precision drops. This paper extends the attack: an adversary can inject targeted outliers during fine-tuning that widen the full-precision/quantized gap, making the model pass safety evals in FP16 while behaving maliciously in INT4. Practical for deployment-time attacks since most enterprise inference serves quantized variants. Mitigation: quantization-aware safety eval, not just FP eval.

safety quantization-attacks red-teaming

#28

VGGT-Edit: feed-forward native 3D scene editing via residual field prediction

Generative Media 2026-05-14 AK (@_akhaliq) Daily PapersarXiv cs.AIHugging Face Daily Papers 6.3 6.2/6.3/6.4

VGGT-Edit extends feed-forward 3D scene reconstruction to support edits — adding, removing, or modifying scene elements — by predicting a residual field over the base reconstruction in a single forward pass. The architecture inherits the speed of feed-forward reconstruction while adding editability that previously required iterative optimization, with reported quality competitive with NeRF-based editing pipelines.

3d-scenes scene-editing feed-forward-reconstruction

#29

Holistic AI Agent Evaluation: ties failure type to precise step within long agent trajectories

Evaluations & Benchmarks 2026-05-14 arXiv cs.AIarXiv cs.CLarXiv — Evals & Benchmarks 6.3 6.2/6.5/6.2

Standard agent evaluation reports overall success/failure without explaining where or why. This work produces a process-level diagnosis: a taxonomy of failure types (tool misuse, reasoning errors, memory loss, premature termination) tied to specific trajectory steps. The framework helps practitioners localize bugs in long-running agents and quantifies which failure types dominate different task families.

agent-evaluation failure-diagnosis trajectory-analysis

#30

Pax Silica (No Priors): Trump administration tech strategy with Under Secretary of State Jacob Helberg

Government & Defense 2026-05-14 No Priors (Sarah Guo & Elad Gil) 6.2 5.8/6.8/6.2

Sarah Guo and Elad Gil interview US Under Secretary of State for Economic Affairs Jacob Helberg on the Trump administration's tech and AI strategy. The framing — "Pax Silica" — argues AI dominance requires not just leading-edge semiconductors but a full-stack reshaping of rare-earth supply chains, actuators, and adjacent industrial inputs. Useful for tracking how the State Department is articulating economic-statecraft levers around AI compute and supply chains, especially in the wake of recent export-control updates.

us-policy tech-statecraft export-controls rare-earths

#31

When Are Two Networks the Same? Tensor similarity for mechanistic interpretability beyond behavioral matching

Interpretability 2026-05-14 arXiv cs.LGarXiv — Mechanistic Interpretability 6.2 6.3/6.5/5.8

Mech-interp work that decomposes models into components needs a way to verify two components implement the same computation. Behavioral similarity is blind to OOD mechanism divergence; basis-dependent parameter measures are unstable. This paper introduces a tensor-similarity measure that is basis-invariant and OOD-sensitive, with applications to component-equivalence verification in circuit-level interpretability.

mechanistic-interpretability model-equivalence tensor-similarity

#32

Sea Limited's view on agentic software development with Codex

AI Coding 2026-05-14 OpenAI Research 6.2 5.8/6.0/6.8

OpenAI published a co-marketed write-up with Sea Limited's CPO on how Sea is deploying Codex across engineering teams to accelerate AI-native software development across its Southeast Asian product lines. Largely a customer testimonial, but useful for tracking how Codex enterprise rollouts are being framed against Cursor/Claude Code competitors at non-US tech companies.

codex openai enterprise-codegen sea-limited

#33

Talk is (Not) Cheap: benchmark coverage audit shows LLM attack benchmarks miss most of the threat surface

Safety, Policy & Regulation 2026-05-14 arXiv cs.AIarXiv — Evals & Benchmarks 6.2 6.0/6.8/5.8

A reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface — using a 4×6 Target × Technique matrix grounded in STRIDE and a 507-leaf taxonomy (401 data-populated, 106 unaddressed). The audit finds substantial gaps: prompt-injection and jailbreaking are over-represented; supply-chain, prompt-encryption-bypass, and agent-loop attacks are under-represented. Practitioner takeaway: passing the standard red-team suites does not imply broad robustness.

red-team ai-safety attack-taxonomy benchmark-coverage

#34

AI-liability regime guide (a16z policy): Jai Ramaswamy and Matt Perault on what a workable regime needs

Safety, Policy & Regulation 2026-05-14 a16z AI Policy Brief 6.1 5.5/7.0/5.8

a16z's chief legal and policy officer Jai Ramaswamy and policy head Matt Perault outline what they argue a workable AI-liability regime needs: clear allocation between model developer, deployer, and user; tort-vs-statutory tradeoffs; and how product-liability doctrine adapts to systems whose behavior is partly emergent. Framed as input to ongoing state-level legislative drafts (notably California's iterations on SB 53/SB 1047 successors).

ai-policy liability regulation a16z

#35

Khosla Ventures bets $10M on Ian Crosby's autonomous AI bookkeeping startup Synthetic

Industry 2026-05-14 TechCrunch — AI 6.1 5.5/6.0/6.8

Khosla Ventures is putting $10M into Synthetic, a fully-autonomous AI bookkeeping startup from Ian Crosby (whose previous bookkeeping startup Bench imploded last year). The thesis is that the agentic-bookkeeping space has matured enough that a clean rebuild can hit the workflow target that Bench could not on conventional software. Notable as another data point in the back-office automation cohort that includes the Intuit-PwC-Anthropic stack in today's first item.

startups agentic-finance khosla-ventures bookkeeping

#36

OpenAI reportedly preparing legal action against Apple over ChatGPT integration shortfalls

Industry 2026-05-14 TechCrunch — AI 6.0 5.5/6.0/6.5

TechCrunch reports OpenAI is exploring legal action against Apple over the ChatGPT integration shipped with iOS — alleging that promised subscriber-conversion volume and product placement never materialized. If filed, it would be the most public split among the major platform-LLM partnerships and material for the multi-platform-strategy thesis in the Apple-AI coverage of the past year.

openai apple platform-disputes

#37

What happens when AI starts building itself? — Richard Socher's $650M startup

Industry 2026-05-14 TechCrunch — AI 6.0 5.8/6.0/6.2

Richard Socher (you.com founder, former Salesforce chief scientist) raised $650 million for a new startup whose stated ambition is AI that researches and improves itself indefinitely. The framing is in the recursive-self-improvement lineage but Socher insists the company will actually ship products rather than chasing an open-ended AGI horizon. Skepticism is warranted given the recurring failure rate of self-improvement startups, but the size of the raise alone makes it consequential.

startups recursive-self-improvement richard-socher

#38

SpaceXAI losing 50+ staff since Musk's xAI/SpaceX consumer merger

Industry 2026-05-14 TechCrunch — AI 5.9 5.5/5.5/6.5

50+ employees have reportedly left Musk's newly-merged SpaceXAI since February, raising questions about burnout, leadership changes, talent poaching by competitors, and whether liquidity events around the merger weakened retention. Tracks alongside the ongoing Musk vs Altman case that goes to jury this week.

xai spacex elon-musk talent