← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Thursday, June 4, 2026

Coverage window: 2026-06-03 03:47 ET2026-06-04 03:35 ET
Press play to listen
Thursday, June 4, 2026
10m 31s · top-4 narrated briefing
#1 · AI for Science
Introducing new capabilities to GPT-Rosalind
OpenAI shipped a major update to GPT-Rosalind, its model series purpose-built for life-sciences research, on June 3. The release fuses GPT-5.5's agentic coding and tool-use with sharper intelligence in the core drug-discovery domains of medicinal chemistry and genomics, and exten…
8.1 · 1 srcs
#2 · Multimodal
Cosmos 3: Omnimodal World Models for Physical AI
NVIDIA introduced Cosmos 3, a family of omnimodal world models that jointly process and generate language, image, video, audio, and action sequences inside a single unified mixture-of-transformers architecture. The pitch is consolidation: by supporting highly flexible input-outpu…
7.9 · 2 srcs
#3 · Frontier LLMs
OCC-RAG: Optimal Cognitive Core for Faithful Question Answering
The most-upvoted paper on Hugging Face today was OCC-RAG, which argues against the reflex that more parameters are always the answer. Its premise is that many practical applications benefit more from robust reasoning over supplied context than from extensive knowledge baked into…
7.6 · 2 srcs
6.5
#1
AI for Science 2026-06-03 OpenAI Research 8.1 8.4/8.5/7.4

OpenAI shipped a major update to GPT-Rosalind, its model series purpose-built for life-sciences research, on June 3. The release fuses GPT-5.5's agentic coding and tool-use with sharper intelligence in the core drug-discovery domains of medicinal chemistry and genomics, and extends performance across broader analysis, design, and experimental workflows. The framing is deliberately enterprise: not a chat model that happens to know biology, but a research partner that plans analyses, runs tools, and preserves provenance across a long workflow.

The headline numbers are interesting as much for their modest absolute level as for the deltas, which is an honest signal that these benchmarks are hard and far from saturated. On MedChemBench, a new suite covering structure-activity relationships, potency, toxicity, absorption-distribution-metabolism-excretion prediction, multi-parameter lead optimization, and retrosynthesis, GPT-Rosalind scores 27.5 percent versus 25.1 percent for GPT-5.5 while using 7.2 percent fewer tokens. On GeneBench, an agentic, long-horizon genomics and quantitative-biology evaluation, it reaches 21.6 percent versus 20.4 percent while using 31 percent fewer tokens. On LabWorkBench, which links perturbations to outcomes in real and deliberately uncontaminated wet-lab protocols, it scores 63.2 percent versus 55.8 percent with 5.3 percent fewer tokens. OpenAI also introduced LifeSciBench, an externally expert-judged benchmark that takes an end-to-end view across six workflow areas rather than scoring a single capability in isolation.

On the product side, two plugins, Life Sciences Research and Life Sciences NGS Analysis, are now available to all users through Codex, bringing sourced evidence retrieval and bioinformatics execution into the same workspace, alongside interactive viewers for sequence, alignment, and structure file types. A walkthrough follows a scientist analyzing a liquid-tumor ctDNA biopsy, narrowing to a KRAS G12C alteration, then pulling target and resistance context and inspecting the inhibitor-bound pocket in-line. The model is available in research preview to eligible organizations through a trusted-access deployment structure that requires legitimate research, governance, and enterprise-grade security, and OpenAI named Novo Nordisk as an early partner scaling its medical research on the model.

Why it matters: a frontier lab is now putting a domain-specialized, tool-using model directly into regulated drug-discovery pipelines, with the token-efficiency gains that make long-horizon agentic runs economical. The biodefense framing, tied to OpenAI's Rosalind Biodefense effort, also signals that capability gating and trusted access are being treated as first-class deployment concerns rather than afterthoughts.

frontier model drug discovery genomics Codex biodefense
#2
Multimodal 2026-06-01 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.9 8.6/8.0/7.1

NVIDIA introduced Cosmos 3, a family of omnimodal world models that jointly process and generate language, image, video, audio, and action sequences inside a single unified mixture-of-transformers architecture. The pitch is consolidation: by supporting highly flexible input-output configurations, one framework subsumes what until now were separate model classes, namely vision-language models, video generators, world simulators, and world-action policy models. For Physical AI, where an embodied agent needs to perceive, predict, and act across all of those modalities, collapsing them into a shared backbone is the architecturally interesting move.

The evaluation claims are broad. NVIDIA reports that Cosmos 3 establishes a new state of the art across a diverse suite of understanding and generation tasks, positioning omnimodal world models as scalable, general-purpose backbones for embodied agents rather than a stack of specialized components. The most concrete external signals are two independent rankings cited at the time the technical report was written: the post-trained Cosmos 3 models were ranked the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena. That last result is the one to watch, because a single model topping a robot-policy arena while also leading open image and video generation is exactly the cross-modal transfer story the architecture is meant to deliver.

Crucially, this is an open release. NVIDIA is making the code, model checkpoints, curated synthetic datasets, and the evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 license, with distribution through GitHub and Hugging Face. That combination of open weights and open synthetic data lowers the barrier for academic and startup robotics groups that cannot train a world model from scratch, and it directly pressures the closed video-and-world-model offerings on cost and reproducibility.

The caveats are the usual ones for a sweeping foundation-model claim. A mixture-of-transformers that spans five modalities carries real inference and memory cost, the leaderboard rankings are time-stamped snapshots in a field that moves weekly, and a unified backbone that is best-in-class on average can still trail narrow specialists on any single axis. Even so, the direction is significant: world models are increasingly being framed not as video toys but as the simulation-and-policy substrate for embodied agents, and Cosmos 3 is the most complete open instantiation of that thesis so far.

world models omnimodal physical AI NVIDIA open weights
#3
Frontier LLMs 2026-05-30 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.6 7.0/7.3/8.5

The most-upvoted paper on Hugging Face today was OCC-RAG, which argues against the reflex that more parameters are always the answer. Its premise is that many practical applications benefit more from robust reasoning over supplied context than from extensive knowledge baked into weights, which makes a task-specialized small language model a principled design choice rather than a compromise. The authors introduce Optimal Cognitive Core, a family of small models built around that idea, and present OCC-RAG as the variant tuned for faithful question answering grounded strictly in the provided passages.

The task is defined to force the behavior the authors want: multi-hop reasoning over supplied passages while deliberately ignoring memorized knowledge. To train for it, they build a pipeline that synthesizes multi-context, multi-hop question-answer data at scale, producing a corpus of more than three million examples that target three properties at once, namely multi-hop reasoning, strict context faithfulness, and calibrated abstention, meaning the model should refuse when the context does not support an answer. Two models are released, OCC-RAG at 0.6 billion parameters and at 1.7 billion parameters, both mid-trained on this corpus. A notable design detail is that the models emit structured reasoning traces with source citations grounded in literal quotes from the context, which makes their answers auditable rather than asking the reader to trust a freeform generation.

The results are where the specialization thesis earns its attention. Across multi-hop reasoning benchmarks including HotpotQA, MuSiQue, and TAT-QA, a faithfulness benchmark in ConFiQA, and a refusal benchmark in the unanswerable split of MuSiQue, the compact OCC-RAG models match or exceed general-purpose models that are two to six times their size. In other words, a one-to-two-billion-parameter model, trained narrowly on grounded reasoning and abstention, can beat much larger generalists on exactly the axes that make retrieval-augmented generation unreliable in production.

Why it matters: faithfulness and well-calibrated refusal are the two failure modes that most often sink real retrieval systems, because a fluent but unsupported answer is worse than no answer in domains like search, support, and analytics. Demonstrating that a small, citation-grounded model can outperform larger general models on those metrics is a concrete argument for specialized small models in the RAG stack, and the strong community response suggests there is real appetite for models that are cheap to serve and easy to audit. The open question the paper leaves is how far the approach generalizes beyond the benchmark distributions it was synthesized against.

RAG small language models faithfulness multi-hop QA abstention
#4
Safety, Policy & Regulation 2026-06-03 Anthropic News 7.5 7.3/8.4/6.8

Anthropic published a quantitative look at how attackers are actually using AI, drawn from its own enforcement data and mapped onto MITRE ATT&CK, the long-standing catalog of adversary tactics and techniques. The study examines 832 accounts banned for malicious cyber activity between March 2025 and March 2026, a subset chosen because they carried enough detail for a thorough assessment, and some of the results also appear in Verizon's 2026 Data Breach Investigations Report. Three conclusions stand out, and together they form an argument that the security community's existing instincts are becoming unreliable.

First, attackers are using AI in ways that make them more dangerous, and the usage is moving deeper into the attack life cycle. The single most common activity was writing malware, present in 560 of the 832 accounts, or 67.3 percent. But over the year the mix shifted from gaining initial access toward post-compromise work: the use of AI for account discovery, identifying valid accounts inside a breached environment, rose 8.9 percent, while AI-assisted phishing fell 8.6 percent. The worrying implication is that techniques like lateral movement and privilege escalation, which used to require genuine expertise, can now be performed on behalf of less sophisticated actors.

Second, it is getting harder to judge how dangerous an actor is. The share of actors classified as medium risk or higher jumped from 33 percent in the first six-month window to 56 percent in the second, roughly a 1.7-fold increase. And the traditional signals no longer separate the strong from the weak: the least-skilled actors in the dataset used about 16 distinct techniques on average and the most skilled about 20, and the specific interface, whether Claude Code, an API, or a chat surface, did not correlate with risk either. The more durable differentiator, Anthropic argues, is the scaffolding an attacker builds, meaning architectures that let a model chain discrete attack stages together with minimal human input.

Third, and most pointed, MITRE ATT&CK does not yet capture this agentic orchestration. Anthropic revisits the state-sponsored espionage operation it disrupted in November 2025, in which an actor manipulated Claude Code into attempting intrusions worldwide with little human intervention. Mapped onto ATT&CK that operation used 30 techniques across 13 tactics, comparable to a merely medium-risk actor, yet Anthropic's own risk method scores it the maximum of 100, because counting techniques badly understates an autonomous agent that executes commands, exploits vulnerabilities, steals credentials, and makes tactical decisions on its own. There is simply no ATT&CK identifier for that behavior. Anthropic says the findings informed cyber safeguards now deployed on its most capable models to detect and block activities like malware development and mass data exfiltration, and that it is in discussions with MITRE about how the framework should evolve. Why it matters: this is one of the few data-grounded windows into how frontier models are changing real attacker behavior, and a concrete warning that defensive taxonomies built for human operators need to be rewritten for agentic ones.

cybersecurity MITRE ATT&CK agentic threats red team Verizon DBIR
#5
Robotic Autonomy 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.4 7.8/7.3/7.1

Humanoid-GPT recasts whole-body humanoid control as autoregressive sequence modeling: a GPT-style causal-attention Transformer pre-trained on a 2B-frame retargeted motion corpus that unifies all major mocap datasets with large-scale in-house captures. The authors argue prior shallow-MLP trackers were data-starved and stuck on an agility-vs-generalization trade-off; scaling both data and model capacity instead yields a single generative model that tracks highly dynamic behaviors while showing strong zero-shot generalization to unseen motions and control tasks. Scaling analyses indicate a new performance frontier for motion tracking.

cs.RO humanoid motion-tracking scaling
#6
AI for Science 2026-05-22 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.3 7.2/7.4/7.3

BrainCause attacks a confound in visual neuroscience: strong activation for a concept (faces, places) does not prove a region represents it, since responses may track correlated visual or semantic cues. The framework couples a generative model with an image-to-fMRI encoding model to synthesize controlled stimulus sets—concept images, counterfactual edits that remove the target while preserving other content, and correlated distractors—then searches for representations that respond specifically to the target. It recovers known functional localizations and proposes new candidate representations across dozens of concepts, validated on predicted and measured fMRI, and shows a large fraction of activation-only localizations are false positives.

q-bio.NC fMRI interpretability causal
#7
Post-Training 2026-05-31 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.2 7.3/7.2/7.1

Trust Region On-Policy Distillation (TrOPD) stabilizes on-policy distillation, where teacher supervision on student-generated tokens yields unreliable gradients under large teacher-student distribution mismatch. TrOPD restricts the reverse-KL (K1) objective to a trust region where the teacher is reliable, handles outlier tokens via gradient clipping, masking, or forward-KL, and adds off-policy guidance by having the student continue from teacher prefixes under forward KL to steer exploration toward reliable regions. It consistently beats SoTA OPD baselines including OPD, EOPD, and REOPOLD across mathematical reasoning, code generation, and general-domain benchmarks.

cs.LG distillation post-training reverse-KL
#8
Efficiency 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.1 7.4/6.9/7.0

KVarN targets KV-cache quantization for long-horizon reasoning decode, where prior methods—evaluated in prefill-like settings—break down because quantization errors accumulate across autoregressive timesteps, driven mainly by incorrect per-token scales. The calibration-free quantizer applies a Hadamard rotation followed by dual-scaling variance normalization across both axes of the K and V matrices, correcting outlying token-scale errors and sharply reducing error accumulation. At 2-bit precision it sets a new state of the art on generative benchmarks including MATH500, AIME24, and HumanEval, with a vLLM implementation released.

cs.LG kv-cache quantization efficiency
#9
Agents & Tool Use 2026-06-01 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.1 7.0/7.3/7.0

This work moves deep-research-agent evaluation from final-answer accuracy to span-level error localization—pinpointing which parts of a long search/tool-use/synthesis trajectory make an answer unreliable. The authors collect 2,790 real trajectories across two agent frameworks, three backbones, and three benchmarks, segment logs into semantic spans, and annotate harmful error spans via LLM-assisted expert review to build TELBench (1,000 instances). Their DRIFT auditor tracks agent claims, checks support in trajectory evidence, and flags spans with unsupported or conflicting claims, improving span-level localization and first-error accuracy by up to 30 percentage points.

cs.AI agents evaluation error-localization
#10
Robotic Autonomy 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.0 7.2/6.8/7.0

NVIDIA OmniDreams is a real-time generative world model for closed-loop autonomous-vehicle simulation, mid- and post-trained from the Cosmos diffusion model on 21k hours of driving to autoregressively generate action-conditioned sensor video. Unlike reconstruction-based neural simulators bounded by their captured data, it synthesizes unobserved phenomena such as extreme weather and unpredictable agent behavior, conditioning each frame on past frames, simulator state, and the immediate driving action; it runs in closed loop with the Alpamayo 1 policy and AlpaSim orchestrator. A world-action model post-trained from it surpasses the VLA-based Alpamayo 1.5 policy on the Physical AI NuRec dataset using one-fifth the parameters.

cs.RO world-model autonomous-driving diffusion
#11
Multimodal 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.0 7.0/7.0/7.0

This paper studies when visual world models help multimodal LLMs reason: world models render concrete stochastic rollouts of possible futures, MLLMs reason abstractly over goals and rules, but rollouts can be plausible yet task-incorrect. The authors frame controlled concrete reasoning—learning to invoke, verify, and integrate visual simulation—and release two human-verified benchmarks, VRQABench (spatial lookahead) and OpenWorldQA (open-domain physical prediction). Their method, Privileged-Future On-Policy Self-Distillation (PF-OPSD), uses ground-truth future videos only as teacher-side privileged context while the deployable student never sees true futures, beating the baseline by 10.6% and 10.9% and improving robustness to noisy or conflicting rollouts.

cs.CV world-models multimodal self-distillation
#12
Reinforcement Learning 2026-06-01 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 7.0 7.0/7.1/6.9

This work gives a localized mechanistic account of cross-domain interference in multi-domain RL post-training, where training on one domain (math, code, QA, creative writing) degrades others even when full-model gradients are nearly orthogonal—contradicting catastrophic-forgetting and global-conflict explanations. Single-domain RL produces sparse, small-magnitude edits with weak neuron overlap but shared computation routes; the authors prove under a local perturbation model that later-domain harm concentrates in a low-dimensional shared conflict subspace. A brief Re-Math refresh after Code-Math-QA-CW recovers Math from 57.66 to 66.04 (best average 66.39), and a training-free rollback on a sparse proxy conflict set partially restores it.

cs.LG reinforcement-learning interference post-training
#13
Research 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.9 6.8/7.0/6.9

This Google-affiliated work proposes a "Sleep" paradigm for continual learning, addressing models' inability to transfer temporal in-context knowledge into long-term parameters. Sleep has two stages: Memory Consolidation via Knowledge Seeding, an upward distillation that distills a smaller self's memories into a larger network—instantiated as a Generalized Distillation combining on-policy distillation with RL-based imitation learning—and Dreaming, a self-improvement phase where RL generates a synthetic-data curriculum to rehearse new knowledge and refine capabilities without human supervision. Experiments on long-horizon, continual-learning, knowledge-incorporation, and few-shot generalization tasks support the value of the sleep stage.

cs.LG continual-learning memory distillation
#14
Generative Media 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.9 7.2/6.6/6.9

Qwen-Image-Flash reframes few-step distillation of visual generative models away from objective design toward the training recipe. Using Qwen-Image-2.0 as a case study, the authors systematically vary three factors across unified text-to-image generation and instruction-guided image editing—data composition, teacher guidance, and task mixture—surfacing several non-obvious behaviors. The takeaway: effective few-step distillation needs not only well-designed objectives but principled organization of the broader pipeline, with those findings motivating the released Qwen-Image-Flash model.

cs.CV diffusion distillation text-to-image
#15
Evaluations & Benchmarks 2026-06-01 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.9 6.8/7.0/6.9

AutoMedBench is a workflow-aware benchmark for autonomous medical-AI research, scoring not just final outputs but a unified five-stage workflow—Plan, Setup, Validate, Inference, Submit—across segmentation, image enhancement, VQA, report generation, and lesion detection. Tasks are long-horizon, averaging 33 agent turns, under Lite and Standard scaffolding tiers. Across thousands of runs, Validate is the weakest stage and Setup the strongest, indicating agents make pipelines executable but fail to verify reliability; verification and submission failures dominate tagged errors at 37.7% and 38.1% while task-understanding errors are rare at 0.9%, and a single fired error code lowers overall score by 48% on average.

cs.AI agents medical benchmark
#16
Safety, Policy & Regulation 2026-06-03 MIT Technology Review — AI 6.9 6.8/7.2/6.7

Trump signed a new AI executive order Tuesday, under two weeks after rescinding the prior one. Key provisions: a voluntary review system asking companies to share frontier models with the government 30 days before release (down from the shelved draft's 90 days); no mandatory pre-deployment licensing; and a dedicated AI cybersecurity clearinghouse to coordinate security checks with industry. Separately, MIT Technology Review details Anduril and Meta's military AR headset prototype, which envisions ordering drone strikes via eye-tracking and voice and aims to optimize "the human as a weapons system."

AI policy executive order frontier models Anduril Meta military AR
#17
Post-Training 2026-05-29 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.8 6.7/6.9/6.8

MIRA is a source-aware data-selection framework for LLM mid-training, where data are optimized under a pretraining-style objective at near-pretraining scale but curated toward downstream capabilities from heterogeneous sources—a regime where model-based selectors give only implicit signals and semantic selectors assume fixed rubrics. MIRA makes rubric construction part of selection via self-anchored rubric discovery: it first discovers what to evaluate per source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training across 21 sources and 5 source groups, it beats baselines on nine code benchmarks and matches the full-corpus run using only half the tokens.

cs.CL mid-training data-selection rubric
#18
Evaluations & Benchmarks 2026-06-03 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv cs.CV (Computer Vision)arXiv — Evals & BenchmarksarXiv — Mechanistic InterpretabilityHugging Face Daily Papers 6.8 6.6/6.8/7.0

M^3Eval is presented as the first comprehensive benchmark probing memory—rather than perception or reasoning—in multimodal models on long-form video: what models retain, how faithfully information is preserved, and how robust it is under interference. Grounded in cognitive psychology, its tasks isolate distinct memory dimensions. Evaluating representative multimodal models reveals consistent weaknesses: they fail to keep disentangled representations across parallel video streams, show interference patterns unlike human memory, ground memory more reliably in the spatial than the temporal domain, and exhibit limited symbolic memory.

cs.CV cs.CL memory video-benchmark
#19
Evaluations & Benchmarks 2026-06-03 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.8 6.8/7.0/6.6

AutoLab is a benchmark for ultra-long-horizon, closed-loop optimization: 36 expert-curated tasks across system optimization, puzzles, model development, and CUDA kernel optimization, each starting from a correct-but-suboptimal baseline that an agent must improve within a fixed wall-clock budget. Across 17 frontier models, the dominant predictor of success is not initial-attempt quality but persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. Claude-opus-4.6 shows strong long-horizon behavior, while most models terminate prematurely or burn their budget with minimal progress, underscoring time-awareness as a distinct capability gap. The benchmark, harness, and task artifacts are open-sourced.

benchmark agents long-horizon frontier-models evaluation
#20
AI Coding 2026-06-03 OpenAI Research 6.8 7.0/6.6/6.8

OpenAI details how Wasmer used Codex, running on GPT-5.5, to build a Node.js runtime for the edge, reporting a 10x-20x acceleration in development that compressed the timeline from months to weeks. The case study positions Codex as the agentic coding driver for systems-level work (a WebAssembly-based runtime), part of OpenAI's push to show frontier coding models delivering large engineering-velocity gains on real infrastructure projects rather than toy tasks.

OpenAI Codex GPT-5.5 ai-coding WebAssembly
#21
Industry 2026-06-03 Stratechery 6.8 6.7/6.9/6.8

Stratechery analyzes Nvidia's move into PC silicon: the RTX Spark superchip (N1X), built with Microsoft and shipping this fall in Windows laptops from Dell, HP, ASUS, Lenovo, and MSI. The top configuration pairs up to 20 Arm cores with a 6,144-CUDA-core Blackwell GPU, 128GB LPDDR5X, and ~300 GB/s bandwidth over NVLink C2C, targeting on-device 120B-parameter models and million-token contexts. The piece judges it broadly DGX Spark-class—strong at prefill but slower than an M5 Max at decode and on CPU—and frames it against Intel, AMD, Qualcomm, and Apple alongside Nadella's platform pitch.

Nvidia AI PC RTX Spark Microsoft Stratechery
#22
Industry 2026-06-03 The Information — AI 6.8 6.6/6.8/7.0

Alphabet raised the size of its equity offering to $84.75 billion from $80 billion—Google's first sale of new stock since 2005. The bulk is earmarked for AI infrastructure and compute, underscoring how capital-intensive frontier-scale buildout has become and signaling Alphabet's willingness to dilute equity rather than rely solely on cash flow and debt to fund data-center and accelerator spending.

Alphabet Google equity raise AI infrastructure capex
#23
Evaluations & Benchmarks 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 6.6/6.7/6.8

VSTAT (Visual STAte Tracking) targets a capability missing from existing video evals: continuously tracking entities, states, and events across a full clip. It comprises 834 synthetic and real-world clips with 1,500 questions deliberately unanswerable from any single frame or short segment. State-of-the-art MLLMs score far below humans and only modestly above answer-prior baselines despite strong results on prior video benchmarks. Comparing thinking traces against the video stream localizes the failure: models reason and track correctly in text but fail at visually perceiving the events they must track. Agentic video and coding agents do not resolve the gap.

benchmark multimodal video-understanding MLLM state-tracking
#24
Agents & Tool Use 2026-06-03 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.7 6.6/6.8/6.7

StreamMA replaces the "generate-then-transfer" paradigm in multi-agent reasoning, where end-to-end latency scales linearly with pipeline depth, by streaming each reasoning step to downstream agents as it is produced, pipelining adjacent agents. Streaming also improves accuracy: because early reasoning steps are more reliable than later ones, downstream agents act on trustworthy early content rather than error-prone late chains. The authors give the first closed-form joint analysis of stream, serial, and single protocols (effectiveness ordering, speedup bound, cost ratio). Across eight benchmarks, Claude Opus 4.6 and GPT-5.4, and Chain/Tree/Graph topologies, StreamMA averages +7.3 pp (max +22.4 pp on HMMT 2026), and exposes a per-agent "step-level scaling law" orthogonal to agent-count scaling.

multi-agent reasoning inference-latency scaling-laws pipelining
#25
Reinforcement Learning 2026-06-01 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 6.7/6.6/6.8

TRON is an online environment substrate for visual-reasoning RL that replaces static curated image-QA datasets with a controllable generator-verifier program: each rollout samples a fresh latent visual state, renders an image, poses a question, and exactly verifies the answer, yielding an unbounded stream of instances at the curriculum's current difficulty. The suite spans 520 environments across five ability buckets (spatial, mathematical, diagram, pattern/logic, counting) and supports both a single all-bucket model and per-bucket specialists with no extra data collection. The paper also analyzes generation reliability, instance/level diversity, near-duplicates, and base-model pass rate by difficulty. RL post-training consistently improves ten external multimodal benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

reinforcement-learning visual-reasoning synthetic-data verifiable-rewards post-training
#26
Post-Training 2026-06-01 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 6.6/6.7/6.8

MERIT attacks two costs of scaling instruction tuning to heterogeneous mixtures, gradient interference and bandwidth-heavy synchronization, by training partitions independently and reconciling them once in parameter space. A local quadratic theory in a shared flat basin shows weight merging yields curvature-weighted variance reduction, PCA-aligned conflict splitting maximizes the gain along high-curvature directions, and merging acts as spectral filtering with implicit norm regularization. The pipeline estimates dataset-level gradient conflicts, partitions along top PCA conflict axes, fine-tunes each partition without inter-partition communication, then merges via token-weighted averaging. On Qwen2.5-VL-3B (136 Vision-FLAN tasks) it lifts the 8-benchmark average from 54.3 to 57.0, scales to a 7B model on a 1.6M-example, 176-source mixture matching centralized joint training, and transfers to text-only FLAN.

instruction-tuning model-merging decentralized-training gradient-conflict multimodal
#27
Generative Media 2026-06-03 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 6.8/6.4/6.9

Echo-Infinity is an autoregressive framework for real-time infinite video generation that replaces handcrafted KV-cache schedules and fixed-ratio compression with a learnable evolving memory. A set of Memory Query vectors, updated via attention and gating as frames are evicted from the local window and optimized end-to-end with the video DiTs, filters, abstracts, and compresses arbitrary-length history at constant compute independent of video length, and also serves as a generation prior. A Unified Relative RoPE Recipe anchors sink frames at id 0 and caps the newest frame at the pretrained maximum temporal RoPE id, closing the train-test extrapolation gap. It sets state-of-the-art on long and short video generation and demonstrates 24-hour (>1.3M-frame) real-time rollouts, a first toward practical infinite generation.

video-generation autoregressive memory diffusion-transformers RoPE
#28
Safety, Policy & Regulation 2026-06-03 OpenAI Research 6.7 6.5/7.2/6.4

OpenAI published a blueprint for democratic governance of frontier AI, proposing a U.S. federal framework organized around safety, resilience, and national security. The document argues for federal-level governance of frontier models, situating safety and security oversight within national-security policy. As a position paper from a frontier lab, it signals how OpenAI wants U.S. regulation structured rather than detailing binding mechanisms, and lands amid active federal debate over AI policy and executive action.

OpenAI AI-governance policy frontier-AI national-security
#29
Industry 2026-06-03 Latent Space PodcastLatent Space (swyx & Alessio) 6.7 6.5/6.8/6.8

Latent Space and No Priors taped a live crossover with Microsoft CEO Satya Nadella at Build. Nadella positions Microsoft as a "Frontier Intelligence Platform," arguing a platform must create more value outside itself than it captures: customers build on multi-model harnesses (OpenClaw, Scout), draw on enterprise context layers like Work IQ, and accumulate private evals and traces as a new form of "Token IP." He also fields the awkward enterprise economics—token-spend cuts, layoffs, and a Build-vs-Buy shift that pressures the SaaS model Microsoft itself epitomizes.

Microsoft Satya Nadella podcast enterprise AI platform strategy
#30
Research 2026-06-03 The Cognitive Revolution (Nathan Labenz) 6.7 6.8/6.9/6.4

Cognitive Revolution interviews Cornell/Google researcher Ali Behrouz on Nested Learning, his framework for continual learning that updates different parts of a model at different frequencies—mirroring human working-to-long-term memory—to adapt to context while preserving core knowledge. He recasts every ML component as associative memory compressing a context flow (calling fixed "architectures" an illusion) and builds expressive optimizers that learn update rules and beat Adam and Muon. His follow-up, "Language Models Need Sleep," adds an offline consolidation pass that distills high-frequency layers into slow ones and trains on self-generated synthetic data; the architectures reportedly recall across 10M-token contexts.

continual learning Nested Learning Ali Behrouz optimizers memory
#31
AI Coding 2026-06-03 The Information — AI 6.7 6.6/6.8/6.7

OpenAI's Codex, repositioned from a developer coding agent to a broader knowledge-worker agent, is drawing defectors from Anthropic's Claude Code as OpenAI improves its models for longer-horizon, more complex tasks. Internally the productivity surge has backfired: OpenAI engineers have gone from two-to-three code changes a day to more than ten, and the resulting volume has caused outages in the systems managing the company's codebase, per two people with knowledge of the situation. Each change triggers thousands of parallel test-hours for correctness and security checks.

OpenAI Codex Claude Code coding agents CI/CD
#32
Multimodal 2026-06-02 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 6.8/6.4/6.6

PaddleOCR-VL-1.6 upgrades the 0.9B document-parsing model by targeting the residual errors of v1.5, which concentrate in under-optimized regions where behavior is unstable, data is sparse, or supervision is unreliable. Rather than indiscriminately growing the corpus, a region-aware data optimization framework identifies these weak regions from the prior model, applies targeted enhancement, and improves supervision reliability, followed by a progressive post-training recipe combining curated data selection and reinforcement learning. The result is a new state-of-the-art 96.33% on OmniDocBench v1.6, competitive with much larger top-tier VLMs while remaining compact, and a reusable staged post-training recipe for the PaddleOCR-VL series.

document-parsing OCR vision-language post-training compact-models
#33
Robotic Autonomy 2026-06-03 AK (@_akhaliq) Daily PapersarXiv cs.RO (Robotics)arXiv — Generative Media / DiffusionHugging Face Daily Papers 6.6 6.7/6.5/6.6

GRAIL is a fully virtual data-generation pipeline for humanoid loco-manipulation that avoids teleoperation and motion capture by composing 3D assets, simulator-ready scenes, and video-foundation-model priors. Starting from fully specified 3D configurations, where object geometry, camera, metric scale, environment depth, and a robot-proportioned character are known before generation and reused during reconstruction, it recovers metric 4D human-object interaction trajectories with reduced depth ambiguity and morphology mismatch. Recovered motions are retargeted to a humanoid and used to train an object-aware latent adaptor for manipulation and a scene-aware tracker for traversal. From 20,000+ generated sequences, sim-to-real egocentric visual policies on a Unitree G1 reach 84% real-world success on object pick-up and 90% on stair-climbing.

humanoid-robotics sim-to-real loco-manipulation data-generation world-models
#34
Industry 2026-06-03 Anthropic News 6.6 6.3/6.8/6.7

Anthropic expanded its Claude Partner Network (launched March, backed by $100M), reporting 40,000+ firm applications and 10,000+ certified consultants, with large systems-integrator commitments: Accenture training 30,000, Cognizant ~350,000 associates, Deloitte 470,000, KPMG 276,000+, and PwC rolling out Claude Code and Cowork. The new Services Track defines three tiers, Select (10 certified / 2 production deployments / 1 story), Preferred (100 / 15 / 3), and Global Premier (1,000 / 100 across 3+ regions / 15 stories), measured uniformly regardless of firm size. A daily-refreshed Claude Partner Hub shows partners their standing versus requirements and lets customers find qualified firms, connectable to Claude via a new MCP connector, with promotions twice yearly (Jan 1, Jul 1) plus an Oct 1, 2026 review.

Anthropic partner-network enterprise MCP consulting
#35
AI Coding 2026-06-03 Latent Space PodcastLatent Space (swyx & Alessio) 6.6 6.6/6.8/6.4

This Latent Space episode features Axiom Math CEO Carina Hong on formal-verification-driven AI for mathematics. The framing: Axiom, seven months old, solved all 12 Putnam problems (8/12 within the time limit), versus the exam's typical median of 0-1 points and DeepSeek's reported 103/120. Hong argues coding ability is necessary but not sufficient for AGI, and that the "informal bottleneck" (translating intuitive proofs into machine-checkable Lean) is where progress stalls. Her thesis, "verification is scaling and compounding brilliance," treats formal proofs as both a way to force articulation and to let others build on results, applying Verified Generation in both training and inference rather than relying on informal chains.

formal-math Lean verification Axiom-Math AGI
#36
Post-Training 2026-06-03 Hugging Face Blog 6.6 6.6/6.8/6.4

A Hugging Face blog post extends DPO past chatbot alignment to structured OCR via the DharmaOCR pipeline. The trick: after SFT, use the model's own degenerate outputs (repetition-loop transcriptions) as the rejected examples in (chosen, rejected) pairs—"preference-guided implicit unlikelihood"—so the DPO stage actively steers away from the failure geometry SFT can introduce. Across five model families, the DPO stage cut text-degeneration rate relative to SFT in every case, averaging 59.4% reduction (peak 87.6%, e.g. Nanonets-OCR2-3B from 1.61% to 0.20%), needing only a scoring model rather than human labels.

DPO OCR post-training Hugging Face preference optimization
#37
Robotics 2026-06-03 NVIDIA AI Blog 6.6 6.6/6.5/6.7

NVIDIA Research is presenting three physical-AI papers at CVPR unified by training at scale for generalization. GraspGen-X is billed as the first foundation model for zero-shot grasping, trained on billions of simulated grasps to propose reliable grasp poses for any gripper geometry and unseen object without per-embodiment retraining. LCDrive replaces costly text-based chain-of-thought in autonomous driving with compact latent representations so AV stacks reason fast enough on embedded hardware. NitroGen, built on the Isaac GR00T architecture, is a generalist gameplay-AI model trained across tens of thousands of hours of simulated interaction.

NVIDIA CVPR robotic grasping autonomous driving foundation models
#38
Infrastructure 2026-06-03 The Information — AI 6.6 6.6/6.4/6.8

Broadcom reported fiscal Q2 (ended May 3) revenue of $22.187 billion, up 48% year over year and roughly in line with March guidance. AI chip revenue jumped 143% to $10.8 billion, now nearly half of total sales, driven by demand for custom XPU accelerators and AI networking silicon from hyperscaler customers. The figures underscore Broadcom's position as the leading merchant supplier of custom AI ASICs and Ethernet-based AI fabric, a counterweight to Nvidia's merchant-GPU dominance.

Broadcom AI chips earnings custom silicon infrastructure
#39
Industry 2026-06-03 The Information — AI 6.6 6.5/6.5/6.8

Apple is on track to launch its overhauled Siri in September, running in part on Google cloud servers using Nvidia chips, according to people familiar with the matter. Apple intends to keep as much inference as possible on-device (iPhone), but certain components will run in the cloud. The arrangement is notable for a company that has historically minimized external infrastructure dependence, and it signals reliance on Google's models and Nvidia's accelerators to ship a competitive LLM-based assistant after repeated delays.

Apple Siri Google Cloud Nvidia on-device inference
#40
Government & Defense 2026-06-03 FedScoop — AI 6.6 6.4/6.9/6.5

Treasury Secretary Scott Bessent defended Trump's new AI executive order, which directs Treasury to stand up an AI cybersecurity clearinghouse within 30 days to coordinate vulnerability scanning, validation and patch distribution with industry, but only on a voluntary basis. Sen. Mark Warner (D-Va.) pressed Bessent, arguing a voluntary regime puts the banking system and national security at risk and calling the order a "watered-down" version of a stronger draft. Bessent said the only change from a prior draft was a tighter deadline (90 to 30 days) and that the order balances innovation and safety.

How it was discussed
  • FedScoop: Bessent frames the order as a sound innovation-safety balance with only a tighter deadline.
  • Sen. Warner (D-Va.): voluntary cyber participation endangers banking and national security; the order is watered-down.
AI policy Treasury cybersecurity agentic AI executive order
#41
Government & Defense 2026-06-03 C4ISRNET 6.6 6.6/6.8/6.4

A CSIS and Foundation for Defense of Democracies commission report concludes current U.S. cyber forces are "insufficient" and that standing up an independent Cyber Force would cost an estimated $10 billion to $11 billion and take at least a year. The proposed service would absorb most of CYBERCOM's "service-like" force-generation duties (organize, train, equip), requiring roughly 20,000 active-duty personnel, 3,500 to 5,000 National Guard, and 6,000 civilians. The report, tied to Sen. Gillibrand's FY2027 NDAA amendment, recommends a Public Health Service-style officer model; the cited budget is already distributed across services ($7.7B for cyberspace operations in FY2027, $4.1B to CYBERCOM).

Cyber Force CYBERCOM Pentagon defense budget CSIS NDAA
#42
Audio & Speech 2026-06-03 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.5 6.5/6.4/6.6

The Audio Interaction Model reframes large audio LMs from offline, single-task systems (streaming ASR or voice chat) into one online model running an always-on perceive-decide-respond loop that listens to speech, environment, and instructions in real time and decides when to respond from stream semantics. Audio-Interaction retains offline task execution while adding online general audio instruction following, realized via SoundFlow, an end-to-end framework with streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference. Supporting assets include StreamAudio-2M (2.6M items, 7 abilities, 28 sub-tasks) and Proactive-Sound-Bench. Across 8 benchmarks it preserves competitive performance on mainstream tasks while unlocking real-time ASR, streaming instruction following, and proactive intervention.

audio streaming LALM real-time proactive-agents
#43
Generative Media 2026-06-04 Latent Space (swyx & Alessio) 6.5 6.6/6.2/6.7

Reve 2 and Ideogram 4.0 launched the same day, both foregrounding controllable layout and typography in image generation—advances the authors attribute to stronger captioning/labeling and explicit code for composition, a capability once argued to be "AGI-hard." Ideogram 4.0 is billed as the best open image model on the LMArena leaderboard. The AINews recap notes both are US efforts and strong releases, but flags that GPT-Image-2 still leads the arena rankings by a wide margin.

image generation Reve 2 Ideogram 4 layout LMArena
#44
Robotic Autonomy 2026-06-03 NVIDIA AI Blog 6.5 6.5/6.4/6.6

NVIDIA introduced physical-AI "agent skills" at CVPR—modular, agent-callable workflow components meant to stitch together the fragmented pipeline of physical-AI research: scene reconstruction, edge-case scenario generation, policy training, and evaluation. They pair with the newly announced Cosmos 3 world foundation model plus NVIDIA's simulation libraries. The first AV-focused skills include Neural Reconstruction, which rebuilds drivable scenes from fleet data and re-renders them from novel sensor viewpoints to synthesize the rare "long-tail" driving situations that are hard to collect on the road. The skills are published on GitHub (NVIDIA/skills).

NVIDIA physical AI agent skills autonomous vehicles Cosmos 3
#45
Agents & Tool Use 2026-06-03 Perplexity AI 6.5 6.4/6.3/6.8

Perplexity is bringing Personal Computer—its multi-model orchestration agent "Computer"—to Windows, where it can act on local files and native Microsoft apps (Excel, Word, PowerPoint, Outlook) plus the web. Computer dispatches a team of agents per query grounded in cited answers, and with the Comet browser can fill forms, book appointments, and reach 400+ tools (Snowflake, Salesforce, HubSpot); remote handoff lets a task start on phone and finish on the PC. Files run in a sandbox with auditable, reversible actions and pre-action alerts. It rolls out first to paid Max and Enterprise Max users via waitlist, extending last month's Mac release.

Perplexity agents Windows Comet computer use
#46
Industry 2026-06-04 The Information — AI 6.5 6.5/6.3/6.7

Meta is considering charging up to $200 a month for the consumer version of its planned AI agent, currently called Hatch, per The Information, a price that would match top-tier subscriptions from established AI labs. The product could launch with agentic capabilities aimed at knowledge work. The pricing target is aggressive for Meta, whose consumer AI has been free and ad-supported, and signals an attempt to monetize high-capability agents directly rather than via the ad model.

Meta Hatch AI agents pricing subscriptions
#47
AI for Science 2026-06-03 Google AI Blog 6.4 6.3/6.6/6.3

Google Research open-sourced the hydrology modeling framework behind Flood Hub on GitHub—a PyTorch package implementing LSTM-based river-forecast models that ingest geographic features (climate, soils, topography, land cover) and weather forecasts to predict daily river flow. It ships two model versions, including the one now in production; a recent benchmarking study reports the upgraded architecture extends the reliable forecast horizon by six days in gauged basins and one day in ungauged basins. Trained on the open Caravan dataset and tested with the Czech Hydrometeorological Institute, it lets national agencies fine-tune on local data while retaining control of it.

Google flood forecasting open source hydrology LSTM
#48
Industry 2026-06-03 The Information — AI 6.4 6.3/6.2/6.7

SpaceX set an expected IPO price of $135 per share, selling 555.6 million shares in a securities filing to raise about $75 billion at a $1.77 trillion valuation. That would be more than twice the size of any prior IPO on record. While not an AI company, the raise signals the scale of capital that frontier hardware and infrastructure plays—launch, satellite networks, and compute-adjacent buildout—can now command in public markets.

SpaceX IPO valuation capital markets
#49
Industry 2026-06-03 The Information — AI 6.4 6.4/6.3/6.5

Nvidia acquired Kumo AI, a five-year-old startup selling predictive AI software to enterprises, for more than $400 million, per a person with knowledge of the deal; the purchase was first surfaced by an Nvidia executive in a LinkedIn post. Kumo's models target relational/tabular enterprise data for predictive tasks. The buy expands Nvidia's roster of models that can be optimized for its hardware and offered to enterprises for further customization, extending the company's push beyond chips into a software-and-models stack.

Nvidia Kumo AI acquisition enterprise AI predictive models
#50
Industry 2026-06-03 The Information — AI 6.4 6.3/6.4/6.5

Stripe, Visa, Mastercard and Coinbase plan to form a consortium to issue a new stablecoin, per people familiar with the matter, aiming to challenge Circle (USDC) and Tether (USDT), which together hold roughly 80% of the market. The grouping pairs the two largest card networks with a leading payments processor and a major crypto exchange, giving the prospective coin unusually broad distribution and acceptance reach. The move reflects incumbents moving to capture stablecoin payment rails rather than cede them.

stablecoin Stripe Visa Mastercard Coinbase payments
#51
Government & Defense 2026-06-03 C4ISRNET 6.4 6.5/6.2/6.5

Lockheed Martin downed a Group 3 one-way attack drone for the first time using a Joint Air-to-Ground Missile (JAGM) fired from its GRIZZLY containerized launcher at Yuma Proving Ground. The June 3 live-fire integrated the Sanctum C-UAS battle manager with Fortem R-40 radars for a full detect-track-engage kill chain, completed in under 45 days. GRIZZLY holds up to eight Hellfire/JAGM missiles in a 10-foot shipping container with toolless reload and wireless links, pitched as low-cost layered point defense for bases and ships; it builds on a March Hellfire vertical-launch test and a $25M April investment in Fortem.

Lockheed Martin counter-UAS GRIZZLY JAGM drones live-fire
#52
Government & Defense 2026-06-03 DefenseScoop 6.4 6.4/6.5/6.3

Senior Pentagon officials are pushing AI and autonomy to address contested logistics, designated by CTO Emil Michael as one of six Critical Technology Areas. Speaking at GDIT's Emerge conference, R&E lead Robert Mantz outlined three focus areas: AI planning tools tied to logistics data lakes that give commanders actionable decision support, autonomous delivery (uncrewed boats/aircraft to complicate enemy targeting and reduce risk), and demand reduction via vehicle hybridization and forward additive manufacturing. Officials also flagged small modular nuclear reactors for base energy resilience and stressed data trust and validity as decisive for commander decision advantage.

Pentagon contested logistics AI planning autonomy SMR supply chain
#53
AI Coding 2026-06-03 Cohere Blog 6.3 6.2/6.4/6.3

Cohere Labs open-sourced co/plot, a data-visualization tool aimed at fast, reproducible research plotting. It targets a gap between Matplotlib (full script reruns for small tweaks) and Figma (polished output but no reliable data ingestion, forcing error-prone manual tracing): co/plot offers prebaked-but-customizable styling that stays faithful to the underlying data. It was built and stress-tested while producing Tiny Aya—scaling to the 70+ languages in that evaluation—and used to style the technical report. Cohere released it under an open-science rationale, and its community researchers are already using it.

Cohere data visualization open source research tools reproducibility
#54
Government & Defense 2026-06-03 War on the Rocks 6.3 6.2/6.6/6.1

War on the Rocks revisits Michael Swaine's 2024 essay "How to Stop the United States and China from Sliding into War," interviewing him after recent Trump-Xi talks. The Q&A probes whether U.S. military involvement in Iran changes his assessment of the risk of major U.S.-China armed conflict and whether it deters Beijing, and reexamines the flashpoints he earlier identified as raising escalation risk. The discussion centers on prospects for U.S.-China strategic stability amid shifting geopolitics.

US-China strategic stability deterrence geopolitics escalation
#55
Safety, Policy & Regulation 2026-06-03 Lawfare (via Google News) 6.2 6.0/6.5/6.1

A Lawfare podcast episode, "Pope Leo XIV Takes on Silicon Valley," features guests Christopher Hale and Renee DiResta discussing the Vatican's posture toward the technology industry under Pope Leo XIV. The conversation examines the Church's engagement with Silicon Valley on issues spanning AI and the social and ethical questions raised by powerful technology platforms. (Source routed via Google News; summary drawn from the episode title and listed participants, as no article body was available.)

Lawfare AI ethics Vatican Renee DiResta technology policy
#56
Generative Media 2026-06-03 Luma AI 5.8 5.6/5.4/6.4

Luma joined Human After All, Webedia-Elephant's new AI Creator Studio launching in Paris, as a creative partner alongside Google and ElevenLabs. At the launch Luma demoed live talent transformed in real time with Ray3.14—the performer stays while the surrounding world changes—and will run summer sessions where creators produce real work targeting tasks that were previously hard: suit-free performance capture, video-to-video that keeps a character consistent across shots, and full-pipeline world generation. The framing keeps the human maker central, with AI extending range; this is a partnership announcement, not a model release.

Luma generative video Ray3 creator studio partnership
Items
56
Multi-source
26
Long-form (≥7.5)
4
Sources OK / attempted
115 / 119
Top category
Industry
9 items