Wolf Digest — 2026-06-15

#1

US government orders Anthropic to suspend Fable 5 and Mythos 5; global fallout spreads

Safety, Policy & Regulation 2026-06-13 Defense OneInterconnects (Nathan Lambert)AI ExplainedTechCrunch — AIThe Cognitive Revolution (Nathan Labenz)Artificial Analysis 8.6 8.4/9.2/8.2

The most consequential development of the weekend was not a model release but a government order. Late Friday, after United States markets closed, the Commerce Department issued an export-control directive instructing Anthropic to restrict foreign-national access to its two newest frontier models, Fable 5 and Mythos 5. Rather than attempt selective gating, Anthropic disabled both models for all customers worldwide while it works through compliance, cutting off even its own foreign-national employees. Defense One reported the directive followed claims, surfaced by Axios, that another company had jailbroken Mythos; the asserted concern centered on the model reading a codebase and identifying and fixing software vulnerabilities.

Anthropic pushed back publicly, saying the government had provided only verbal evidence of a narrow, non-universal jailbreak, and that comparable capability is widely available from other frontier systems including OpenAI’s GPT-5.5. Mythos 5 had previously been offered only through a restricted trusted-access program, Project Glasswing, aimed at cyber defenders. The Department of Defense’s chief information officer publicly backed the order. The action lands atop an already strained relationship: the Pentagon had earlier flagged Anthropic as a supply-chain risk, and a federal judge issued a temporary injunction in late March in a related dispute.

The second-order effects spread quickly. The Information reported that Amazon chief executive Andy Jassy may have been the original source of the security concerns that reached the government — notable given Amazon is Anthropic’s largest partner and investor — and that the White House is reportedly unlikely to extend the restrictions to other firms. On Artificial Analysis’s independent benchmarks, Fable 5 had just debuted at the top of the Intelligence Index, sharpening the question of what capability the restriction removes from the market. In India, now the second-largest market for both Anthropic and OpenAI, the suspension reignited a sovereign-AI debate; proposals floated by investors and founders ranged from a roughly five-billion-dollar annual AI fund to a renewed push toward smaller open-weight models, while several noted that foreign-national engineering teams now face uncertainty building on United States frontier models at all.

Writing on Interconnects, Nathan Lambert framed the order as the starting gun of an agentic era of AI governance, predicting that comparable action against an open-weight model is plausible within a window of three months to two years, and noting the tension that there is little domestic industry if foreign nationals cannot build with frontier systems. The practical takeaway for practitioners is concrete: model availability is now a policy variable, not only an engineering or pricing one, and access to a specific frontier model can be withdrawn on short notice under export-control authority. How long the Fable 5 and Mythos 5 suspension lasts, and whether it broadens to other labs, are the questions the coming week will begin to answer.

How it was discussed

Defense One: directive cited a claimed Mythos jailbreak; Anthropic says only verbal evidence of a narrow, non-universal exploit was provided.
The Information (via TechCrunch): Amazon CEO Andy Jassy may have been the original source of the concern; White House reportedly unlikely to extend restrictions to other firms.
Interconnects (Lambert): calls it the opening of an 'AGI era' of governance; predicts action against an open-weight model within three months to two years.
TechCrunch: in India — second-largest market for Anthropic and OpenAI — the episode reignited the sovereign-AI debate over dependence on US-controlled models.
Artificial Analysis: Fable 5 had just debuted at #1 on the Intelligence Index, underscoring the capability being withdrawn.
The Cognitive Revolution: weekend roundup tracked Fable in real workflows — safety gates, API refusals, autonomous coding.

export controls Anthropic Fable 5 Mythos 5 governance

#2

Meta begins unwinding its $2B Manus acquisition under Beijing's divestiture order

Industry 2026-06-14 TechCrunch — AI 7.7 7.5/8.1/7.5

Meta has begun dismantling its two-billion-dollar acquisition of Manus, the Chinese-founded AI agent startup, in the most concrete step yet toward complying with a divestiture order Beijing issued roughly two months ago on national-security grounds. According to a Bloomberg report relayed by TechCrunch, Meta has completed an operational separation, halted data sharing, and cut Manus off from internal systems, barring employees from using Manus tools for internal projects.

The unwinding is the clearest sign to date that Beijing intends to keep frontier-adjacent AI talent and technology under domestic control even when the corporate parent is American. Manus drew attention in early 2025 with a viral general-purpose agent demo, then relocated staff to Singapore before the December acquisition was announced; Chinese regulators subsequently cited potential export-control and foreign-investment violations. Manus co-founders have reportedly held preliminary talks about raising roughly one billion dollars from outside investors to reclaim the company, potentially through a Chinese joint-venture structure and an eventual Hong Kong listing — the same venue absorbing a wave of Chinese AI listings including MiniMax and Zhipu.

The move fits a broader tightening. Beijing has expanded travel restrictions on AI researchers and executives, requiring government approval to travel abroad, and is requiring firms such as Moonshot AI, StepFun, and ByteDance to obtain sign-off before accepting United States investment. Manus has continued shipping product, recently adding Similarweb and Shopify integrations. Taken together with the United States export-control action against Anthropic’s newest models the same weekend, the episode underscores how both governments are now treating frontier AI as strategic infrastructure subject to national-security review — Washington restricting who can use its most capable models, Beijing restricting who can own and fund its most promising startups. For companies operating across the two markets, deal structure and model access are increasingly contingent on national-security review rather than commercial terms alone.

Meta Manus China divestiture agents

#3

APPO: fine-grained credit assignment for multi-turn agentic reinforcement learning

Reinforcement Learning 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 7.6 7.6/7.4/7.8

APPO, or Agentic Procedural Policy Optimization, was the most-discussed paper of the weekend on Hugging Face’s daily papers, and it targets a real bottleneck in agentic reinforcement learning: credit assignment over long multi-turn tool-use trajectories. Most current methods assign credit at coarse units such as tool-call boundaries or fixed workflow steps, which makes it hard to identify which intermediate decisions actually drove the final outcome. The authors’ pilot analysis makes two empirical observations that motivate the method. First, the decision points that matter are broadly distributed throughout the generated sequence rather than concentrated at tool calls. Second, token entropy alone does not reliably indicate which tokens are influential, so the common practice of branching exploration at high-entropy positions misses important decisions.

From these findings APPO reframes the problem as two coupled questions — where to branch and how to assign credit after branching — and builds a procedure that branches rollouts at influential points spread across the trajectory and propagates credit at finer granularity than tool-call boundaries. The result is a denser, more accurate learning signal for the policy, addressing the sparse-reward and coarse-credit problems that have limited multi-turn agent training. The work sits alongside a cluster of agentic-RL papers that surfaced the same weekend, including HarnessX, a foundry for composable and self-improving agent harnesses, and Orchestra-o1, an omnimodal agent-orchestration framework — together signaling that the research frontier is concentrating on the machinery that makes agents reliable over long horizons rather than on raw single-turn capability.

For practitioners building tool-using agents, the practical implication is that credit-assignment granularity is now a first-class lever: matching the branching and reward structure to where influential decisions actually occur in a trajectory can recover gains that coarse, tool-call-aligned schemes leave on the table. The paper’s traction — more than fifty community upvotes within a day — reflects how central agentic reinforcement learning has become to the current research agenda.

How it was discussed

arXiv abstract: frames the core contribution as fine-grained credit assignment beyond tool-call boundaries, branching at broadly-distributed influential decision points.
Hugging Face Daily Papers: topped the day's list with 50-plus upvotes, the strongest community signal of the window.

agentic RL credit assignment tool use GRPO

#4

From Chatbot to Digital Colleague: a map of the shift to persistent, autonomous AI

Agents & Tool Use 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 7.5 7.2/7.8/7.5

This position-and-survey paper crystallizes a framing that has been circulating informally: large language models are shifting from conversational generators into integrated systems that reason, act, remember, and self-improve — a move the authors label the transition from chatbot to digital colleague, from conversational answers to persistent work. They organize the shift along two coupled axes. At the cognitive-core level, models are moving from chatbot-era fast-thinking driven by next-token prediction toward thinking systems that lean on inference-time computation, chain-of-thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. At the tool-augmented task level, models are gaining persistent memory, environment interaction, and the ability to carry work forward across sessions rather than resetting each turn.

The paper’s value is less a single result than a map: it ties together threads — long-horizon agents, memory architectures, process reward models, and self-improvement loops — that have advanced in separate subfields, and argues they are converging on a common system archetype. That archetype is precisely the kind of always-on, autonomous worker that the weekend’s governance news was reacting to, which is part of why the survey resonated; it names the capability class now drawing both research investment and regulatory attention.

Several adjacent papers from the same window fill in the technical substrate the survey describes: graph-structured agent memory that reconstructs rather than merely retrieves prior context, harnesses that evolve their own scaffolding from execution traces, and benchmarks probing whether agents can sustain coherent state over long interactions. For readers tracking where applied LLM systems are heading, the paper is a useful synthesis of the persistent-autonomy direction — though, as a conceptual framework rather than an empirical study, its claims await the benchmarks that would quantify how far today’s systems actually are along each axis.

How it was discussed

arXiv abstract: structures the shift along two axes — cognitive core (fast-thinking to deliberate reasoning) and tool-augmented task execution (memory, action, self-improvement).
Hugging Face Daily Papers: 20-plus upvotes; resonated as a synthesis of the persistent-autonomous-agent direction.

agents memory self-improvement survey

#5

Shield AI and Destinus demonstrate autonomous strike and teaming on an interceptor

Government & Defense 2026-06-15 Shield AI 7.0 7.4/6.8/6.8

Shield AI and Destinus demonstrated autonomous collaborative strike capabilities on the Destinus Hornet, an interceptor designed for counter-UAS missions against loitering munitions and drone swarms, in a full-mission flight exercise near Segovia, Spain. The tests validated Shield AI’s Hivemind autonomy stack for in-flight coordination and adaptation in contested airspace, building on two prior integration phases that established Hivemind control of the Hornet in under two months. The demonstration targets scalable, autonomy-enabled response to large-scale uncrewed threats.

Shield AI Hivemind counter-UAS autonomy

#6

Memory is Reconstructed, Not Retrieved: graph memory for LLM agents

Agents & Tool Use 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 7.0 7.0/7.0/7.0

MRAgent replaces the static retrieve-then-reason paradigm with an associative memory graph plus active reconstruction. Memory is stored as a Cue-Tag-Content graph in which associative tags bridge fine-grained cues to content, and the agent interleaves LLM reasoning directly into memory access, iteratively reconstructing relevant context from intermediate evidence discovered during inference rather than fetching a fixed top-k up front. The framing — reconstruction over retrieval — targets the long-horizon reasoning failures of memory-augmented agents.

How it was discussed

Hugging Face Daily Papers: ~35 upvotes; the 'reconstruct, not retrieve' framing drew the most discussion.

agent memory graph retrieval

#7

Rethinking RAG in long videos with V-RAGBench and chunk-adaptive reranking

Multimodal 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 7.0 7.1/6.9/7.0

This work pushes retrieval-augmented generation into long egocentric video, where systems must select query-relevant chunks across modalities and temporal granularities. It introduces V-RAGBench, a benchmark of query / evidence-chunk / answer triplets that decouples retrieval from generation (existing benchmarks let queries be answered without the video, hiding retrieval errors), and CARVE, which runs parallel retrievers across configurations with chunk-adaptive reranking rather than one fixed modality-granularity setting per query.

How it was discussed

arXiv abstract: argues prior VideoRAG benchmarks are gameable without watching the video, motivating the decoupled V-RAGBench design.

VideoRAG retrieval benchmark

#8

OpenAI faces multi-state attorney-general investigation

Safety, Policy & Regulation 2026-06-13 TechCrunch — AI 6.9 6.6/7.6/6.5

A group of state attorneys general has opened an investigation into OpenAI, according to TechCrunch. The specific states involved were not disclosed, but the inquiry reportedly spans a wide range of issues, from the company’s advertising policies to its handling of health data. The probe adds to mounting regulatory scrutiny of frontier labs and signals that state-level enforcement, not only federal action, is becoming a live channel for AI oversight in the United States.

OpenAI regulation investigation

#9

Orchestra-o1: orchestrating agents across text, image, audio, and video

Agents & Tool Use 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.9 6.9/6.7/7.1

Orchestra-o1 is an omnimodal agent-orchestration framework for tasks that require unified understanding and coordination across text, image, audio, and video. The authors argue existing orchestration frameworks handle only a narrow set of modalities and struggle when heterogeneous modalities coexist and interact; Orchestra-o1 targets efficient task decomposition and collaboration in these omnimodal settings, extending multi-agent orchestration beyond the text-centric regime.

How it was discussed

Hugging Face Daily Papers: ~30 upvotes; part of the weekend's agent-orchestration cluster.

orchestration multi-agent omnimodal

#10

OpenAI launches a $150M Partner Network for enterprise deployment

Industry 2026-06-14 OpenAI Research 6.8 7.0/7.2/6.2

OpenAI introduced a Partner Network backed by a 150-million-dollar investment to help consulting and systems-integration partners accelerate enterprise AI adoption, deployment, and transformation. The move is a distribution play rather than a research one: it formalizes a channel for enterprises that need implementation support around GPT-5.5-class models, and mirrors the partner-ecosystem strategies that cloud vendors have long used to scale adoption. The timing, amid a tightening policy environment around frontier-model access, positions OpenAI to capture enterprise demand through services partners.

OpenAI enterprise partners

#11

OmniDirector: multi-shot camera-motion cloning without cross-paired data

Generative Media 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.8 6.9/6.4/7.1

OmniDirector clones camera motion from reference videos for controllable video generation. It encodes cameras as grid-motion videos — a visual representation of camera parameters that supports integrating diverse trajectories for multi-shot generation — and trains on a million-scale camera-grid/video corpus, avoiding the parametric representations that fail on multi-shot settings and the scarce, synthetic cross-paired data prior methods relied on. The result is more robust cloning of complicated, multi-shot camera motion.

How it was discussed

Hugging Face Daily Papers: ~39 upvotes, strong interest from the generative-video community.

video generation camera control diffusion

#12

HarnessX: a foundry for composable, self-evolving agent harnesses

Agents & Tool Use 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.8 6.9/6.7/6.8

HarnessX treats the agent harness — prompts, tools, memory, and control flow — as a first-class object to be composed and improved rather than hand-crafted per model and task. It assembles typed harness primitives via a substitution algebra and adapts them through AEGIS, a trace-driven multi-agent evolution engine that mirrors symbolic adaptation and reinforcement learning, distilling execution traces back into systematic harness improvement and closing the harness-model loop.

How it was discussed

arXiv abstract: positions hand-crafted, static harnesses as the bottleneck and proposes trace-driven evolution as the fix.

agent harness scaffolding self-improvement

#13

NVIDIA's Nemotron 3 Ultra: a fast, open 550B-parameter frontier-class model

Efficiency 2026-06-14 Two Minute Papers 6.7 6.8/6.5/6.8

A Two Minute Papers walkthrough highlighted NVIDIA’s Nemotron 3 Ultra, an open-weight model (roughly 550 billion total parameters with about 55 billion active) that on Artificial Analysis’s independent benchmarks combines a competitive Intelligence Index score with class-leading output speed near 183 tokens per second. The release continues NVIDIA’s push to pair open weights with strong inference efficiency, positioning Nemotron as an openly available alternative for builders who want frontier-adjacent capability without API access constraints.

NVIDIA Nemotron open weights MoE

#14

From AGI to ASI: a report on the post-AGI intelligence continuum

Research 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.6 6.3/7.2/6.3

This report investigates how machine intelligence might develop beyond human-level AGI toward artificial superintelligence, using the theoretically well-understood endpoint of Universal AI to provide formal grounding for analyzing the AGI-to-ASI transition. It is a conceptual and forecasting contribution rather than an empirical result, mapping the questions a post-AGI trajectory raises; its value lies in framing the continuum rigorously, and its limits in the speculative nature of any claims about systems that do not yet exist.

How it was discussed

Hugging Face Daily Papers: ~14 upvotes; circulated alongside the weekend's governance discussion.

AGI superintelligence forecasting

#15

Smaller models are natural explorers: policy-level diversity for GRPO

Reinforcement Learning 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.6 6.7/6.6/6.5

This paper identifies policy-level (rather than token-level) diversity as a lever for GRPO. Smaller models in the same family show higher policy-level diversity — superior pass@k as sample counts grow — that is temporally correlated and preserves logical consistency, unlike the step-wise noise injected by token-level randomness. The proposed S2L-PO (Small-to-Large Policy Optimization) uses smaller models as structured explorers to supply cleaner exploration signal for gradient estimation in larger ones.

How it was discussed

arXiv abstract: distinguishes temporally-correlated policy-level diversity from incoherent token-level noise.

GRPO exploration RL

#16

Pyodide enables publishing WASM Python wheels directly to PyPI

AI Coding 2026-06-13 Simon Willison's Weblog 6.5 6.4/6.6/6.5

Simon Willison flagged the Pyodide 314.0 release, which lets package maintainers build and publish Pyodide-compatible WebAssembly wheels directly to PyPI under the PEP 783 PyEmscripten ABI, installable at runtime in the browser. Previously the Pyodide maintainers had to build and host more than 300 packages themselves, a major community bottleneck. The change matters for browser-side and agent tooling: it makes the long tail of Python packages installable in client-side runtimes without a separate hosting pipeline.

Pyodide WASM PyPI tooling

#17

LLM agents can see code repositories: visual structure for coding agents

AI Coding 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.5 6.5/6.4/6.6

Most coding agents consume repositories as plain text, unlike human developers who exploit visual structure such as folder hierarchies and dependency graphs. This first systematic study evaluates four multimodal models on repository-level issue resolution using visual repository representations. A key finding: a strictly vision-only setup degrades accuracy, implying visual structure helps as a complement to, not a replacement for, textual repository context.

How it was discussed

arXiv abstract: vision-only hurts, but visual structure as a complement to text is the open opportunity.

coding agents multimodal repositories

#18

OmniVideo-100K: audio-visual reasoning via structured scripts and evidence chains

Multimodal 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.5 6.4/6.4/6.7

OmniVideo-100K is a dataset built to fix the decoupled 'video-caption-QA' pipeline, which segments video into short clips and describes audio and visual streams separately — severing sound-to-source associations and producing inconsistent entity descriptions across segments. It uses structured scripts and explicit evidence chains to keep audio-visual associations intact for long-form audio-visual question answering, supporting reasoning that spans both modalities coherently.

How it was discussed

Hugging Face Daily Papers: ~17 upvotes; targets a known weakness in audio-visual QA pipelines.

audio-visual dataset QA

#19

VISTA: view-consistent self-verified training for GUI grounding

Agents & Tool Use 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.4 6.5/6.3/6.4

Applying GRPO to GUI grounding suffers a degenerate-group problem: rollouts from a single screenshot view often come out all-fail on hard instances or all-success on easy ones, yielding no useful relative advantage. VISTA builds each comparison group from multiple target-preserving views of the same GUI instance — crops that keep the target element visible — restoring informative within-group variance and giving a self-verified training signal for grounding.

GUI grounding GRPO computer use

#20

Hy-Embodied-0.5-VLA: a full-stack vision-language-action robot learning system

Robotic Autonomy 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.4 6.5/6.4/6.3

HyVLA-0.5 is an end-to-end report spanning the full robot-learning stack: data collection, model design, continued pre-training and supervised fine-tuning, reinforcement-learning post-training, and real-world deployment. It is presented as an integrated system in which each component plays a distinct role, offering a reference recipe for taking a vision-language-action model from data to a deployed real-world robot policy.

VLA robot learning deployment

#21

Skip a Layer or Loop It? Training-free program-of-layers in LLMs

Efficiency 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.4 6.5/6.4/6.3

This work shows that pretrained transformer layers can be treated as modules and dynamically skipped or looped per input — a training-free 'program-of-layers' (PoLar). For many inputs, substantially shorter layer programs match or beat full-depth execution, and some originally-incorrect predictions are corrected by alternative programs. The finding points to input-adaptive depth as a free efficiency and accuracy lever in existing models, no retraining required.

dynamic depth efficiency inference

#22

AI companies race to go public, pulling an ecosystem along

Industry 2026-06-14 TechCrunch — AI 6.3 6.2/6.4/6.3

A TechCrunch piece surveys the widening rush of AI companies toward public markets and the suppliers, infrastructure providers, and smaller startups trying to ride the wave. The framing — startups hoping to catch a SpaceX-style IPO surge — captures a market in which capital is increasingly available for AI exposure, even as the same week’s export-control news underscores how policy risk now sits alongside the financial upside.

IPO markets AI economy

#23

RedAct: redacting agent execution traces to protect procedural skills

Safety, Policy & Regulation 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.2/6.6/6.1

Agent execution traces expose rich procedural detail — tool invocations, intermediate decisions, error-recovery logic — that can leak private skills such as formulas, thresholds, and strategies without any access to model weights. RedAct introduces CapTraceBench (75 specialized long-horizon tasks) to quantify this trace-based skill-extraction risk and evaluates redaction methods that preserve observability and accountability while limiting what downstream parties can reconstruct.

How it was discussed

arXiv abstract: frames execution traces as an under-appreciated leakage surface for proprietary agent skills.

agents privacy traces

#24

The hidden power of the scaling factor in LoRA optimization

Post-Training 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.3/6.4/6.2

Treating LoRA’s scaling factor alpha as a mere complement to the learning rate misses its real role. Through empirical analysis and a Signal-Drift theoretical framework, the authors show alpha is the dominant driver of effective optimization, delivering gains that learning-rate scaling alone cannot replicate. The practical upshot is that tuning alpha deserves first-class attention when fitting low-rank adapters, not treatment as a fixed hyperparameter.

LoRA fine-tuning optimization

#25

ClinHallu: diagnosing where medical multimodal models hallucinate

Evaluations & Benchmarks 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.2/6.5/6.2

Most medical-hallucination benchmarks focus on data collection but ignore where errors originate. ClinHallu enables source-level diagnosis, distinguishing hallucinations that arise from visual misrecognition, incorrect medical-knowledge recall, or flawed reasoning integration. By localizing the failure stage within a multimodal model’s reasoning, it gives a more actionable picture for building trustworthy clinical decision-support systems than aggregate accuracy alone.

medical AI hallucination MLLM benchmark

#26

Epistemic resilience: LLMs abandon correct medical answers under misleading context

Evaluations & Benchmarks 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.3 6.2/6.6/6.1

High medical-licensing-exam scores do not imply safe judgment. This paper shows that when misleading context is injected into questions an LLM originally answers correctly, it frequently abandons the correct answer. The authors term the ability to hold correct judgment under adversarial context epistemic resilience and introduce MedMisBench to measure it — a pointed caution as patients increasingly rely on chatbots for health advice.

medical AI robustness evaluation

#27

CacheRL: training small tool-calling agents with cached rollouts

Agents & Tool Use 2026-06-15 arXiv 6.3 6.5/6.2/6.2

CacheRL trains small agent foundation models that reach 92 percent process accuracy on multi-step tool-calling tasks — approaching GPT-5’s reported 94 percent — while using about 100 times less compute. It tackles three practical problems: transferring tool-calling knowledge from large models at scale, running reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments, via cached rollouts and a hybrid reward.

tool calling RL efficiency

#28

KPMG pulls an AI-usage report after apparent hallucinations

Industry 2026-06-13 TechCrunch — AI 6.2 6.0/6.4/6.2

KPMG withdrew a published report on AI usage after apparent fabrications were found in the document, TechCrunch reported. The episode is another instance of an AI-assisted professional deliverable being retracted for hallucinated content, reinforcing that even sophisticated enterprise users continue to ship unverified model output. For practitioners it is a reminder that retrieval grounding and human review remain load-bearing in high-stakes reporting.

hallucination KPMG reliability

#29

RepFusion: using multimodal LLM priors for denoising in representation space

Generative Media 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.2 6.2/6.1/6.3

In typical text-to-image systems LLMs only encode text while a separately trained backbone does denoising. RepFusion leverages representation autoencoders, which shift the generation target toward semantically structured visual representations more compatible with pretrained LLM priors, and — inspired by how a simple MLP projector aligns visual representations in multimodal LLMs — brings those priors into the denoising step itself rather than only the text encoder.

text-to-image diffusion representations

#30

MBench: benchmarking memory in video world models

Evaluations & Benchmarks 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.2 6.1/6.3/6.2

Video world models can synthesize high-fidelity sequences but often fail to maintain a stable internal state over long horizons. MBench argues existing benchmarks over-weight visual quality, motion coherence, and text-video alignment while overlooking memory — the core requirement of a functional world model — and introduces a comprehensive suite to measure whether a model preserves and reasons over its state across extended temporal spans.

world models memory benchmark

#31

RhymeFlow: training-free video-generation speedup via asynchronous denoising

Efficiency 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.2 6.3/6.0/6.3

Diffusion-transformer video models are bottlenecked by the quadratic cost of 3D attention. Rather than only cutting per-step compute via sparse attention or KV-caching, RhymeFlow relaxes the rigid constraint that every frame advances through the same denoising step in lockstep, scheduling denoising asynchronously across frames. The training-free approach reduces latency while preserving the standard pipeline’s quality.

video generation diffusion acceleration

#32

SciAgentArena: benchmarking AI agents on real scientific challenges

AI for Science 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.2 6.1/6.5/6.0

SciAgentArena evaluates AI agents in realistic research scenarios, addressing two gaps: agent benchmarks rarely capture the complexity and extended reasoning of scientific work, while scientific-task benchmarks reduce research to static problems with little interactive evaluation. It comprises roughly 200 tasks with stepwise verification inside an interactive, agent-agnostic environment, aiming to measure practical research capability rather than one-shot answer accuracy.

AI for science agents benchmark

#33

When is your LLM steerable? Predicting activation-steering success

Interpretability 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.2 6.1/6.4/6.1

Activation steering is a lightweight inference-time control method, but its success depends heavily on prompt, concept, model, and configuration, usually requiring expensive grid searches and post-hoc rollout evaluation to find the working regime. This work investigates whether steerability can be predicted from a model’s internal states at the very start of generation, which would let practitioners anticipate where steering will and will not take hold without exhaustive search.

activation steering interpretability control

#34

Avatar V: behaviorally faithful avatar video from reference clips

Generative Media 2026-06-15 Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.2 6.3/6.0/6.3

Avatar V targets avatars that are not only visually similar but behaviorally recognizable — reproducing a person’s talking rhythm, gestural tendencies, and expression dynamics. The authors argue single static-image conditioning provides insufficient identity and motion information, and that pixel-level objectives underserve the perceptually critical facial regions; Avatar V instead conditions on reference video and scales video-reference avatar generation to capture dynamic motion traits.

avatars video generation identity