← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Wednesday, May 20, 2026

Coverage window: 2026-05-19 03:23 ET2026-05-20 03:04 ET
Press play to listen
Wednesday, May 20, 2026
10m 58s · top-4 narrated briefing
#1 · Frontier LLMs
Google I/O 2026: Gemini 3.5 Flash, Gemini Omni video model, and Gemini Spark agent
Google rolled out the bulk of its 2026 AI stack at I/O on May 19, 2026, anchored by three releases. Gemini 3.5 Flash skipped the preview stage and went to general availability with a price increase that brings it above Gemini 2.5 Flash; Google's framing is that 3.5 Flash is now t…
8.6 · 8 srcs
#2 · Industry
Andrej Karpathy joins Anthropic's pre-training team
Andrej Karpathy — OpenAI co-founder, former director of AI at Tesla, and the most-followed ML educator on the open internet — is joining Anthropic's pre-training team. TechCrunch and The Information broke the story independently on May 19, 2026; Anthropic confirmed via an interna…
8.2 · 2 srcs
#3 · Industry
Anthropic acquires Stainless to extend agent SDK and MCP server tooling
Anthropic acquired Stainless, the SDK-generation company behind every official Anthropic client library since the earliest days of the Claude API. Stainless turns OpenAPI specs into idiomatic SDKs across TypeScript, Python, Go, Java and a dozen other languages, plus the MCP serve…
7.8 · 1 srcs
6.5
#1
Frontier LLMs 2026-05-19 Google DeepMindLatent Space (swyx & Alessio)Simon Willison's WeblogArtificial AnalysisTechCrunch — AIThe Information — AIMIT Technology Review — AIGradient Flow (Ben Lorica) 8.6 8.5/8.5/9.0

Google rolled out the bulk of its 2026 AI stack at I/O on May 19, 2026, anchored by three releases. Gemini 3.5 Flash skipped the preview stage and went to general availability with a price increase that brings it above Gemini 2.5 Flash; Google's framing is that 3.5 Flash is now the default workhorse model for everything agentic at Google scale, with sharply better tool-calling and the ability to run as both a non-reasoning and a reasoning model from the same checkpoint. Artificial Analysis's eval shipped same-day: 3.5 Flash sits at 55.3 on the Intelligence Index, the new leader on the intelligence-versus-speed curve, with 278 output tokens per second. Simon Willison flagged that despite the price bump it is still cheap relative to GPT-5.4 and Claude Sonnet 4.6 and the long-context AA-LCR score of 69 percent matches GPT-5.4 mini at a fraction of the cost.

Gemini Omni is the headline multimodal release — an any-to-video model that takes images, text, and audio as input and produces video with synchronized audio. The internal name during the run-up was NanoBanana-for-Video. Google described it as the foundation for a single end-to-end generative-media surface inside the Gemini app, and demos at I/O showed text-prompted music videos, image-conditioned cinematics, and reference-driven scene edits — all returned with on-model audio rather than the dub-on-top pipeline the prior Veo line used. Coverage from TechCrunch and The Information emphasizes that Omni's intended competitor isn't Sora 2 or Runway Gen-4 directly; it's the bundle — a single model behind Search, the consumer app, the editing workspace, and the developer API, all running at the Flash price tier.

Gemini Spark is the third release: a persistent agent that lives across Gmail, Calendar, Drive, Workspace, and Android, surfaces proactive work, and runs background tool calls 24 hours a day with user-set guardrails. TechCrunch's read is that this is Google's most direct answer to ChatGPT's memory plus agent stack and Claude's Computer Use — the differentiator is the Gmail and Calendar context that Google can hand it natively, which OpenAI and Anthropic both have to bolt on. Spark is paired with refreshed Gemini app builds across mobile and web, a Universal Cart shopping integration, and new voice prompting in Docs and Keep — Google is betting visibly on agents over chatbots as the surface for I/O 2026. The downstream Search story is that Google Search as you know it is over: the AI Overviews surface is being expanded into a full agentic mode where the assistant can act inside results pages, place calls, and execute multi-step purchases without bouncing the user back to a search results page. The Information's read is that Google is finally willing to cannibalize search-ad revenue to defend the funnel.

Daniel will care about the eval implication: at Flash-level prices, 3.5 Flash now beats Mistral Medium 3.5, DeepSeek V3.2 and nearly matches GPT-5.4 mini, which prices the bottom of the frontier-LLM tier sharply down and squeezes the open-weight cohort that was previously the cheapest path to a 50-plus Intelligence Index score.

How it was discussed
  • Latent Space (swyx & Alessio) framed the day as the most agent-focused I/O ever — emphasis on Spark and the Android CLI, not just Gemini 3.5 Flash.
  • Simon Willison called out that Gemini 3.5 Flash is the new default-everything model at Google, despite a noticeable price bump.
  • Artificial Analysis ran same-day independent evals: 3.5 Flash sits at Intelligence Index 55.3, 278 tokens/sec — new leader on the intelligence-versus-speed curve.
  • TechCrunch's read is that Spark marks Google betting on agents over chatbots for 2026.
  • The Information emphasized the commercial play: Google is willing to cannibalize Search ad revenue with the new AI Search surface and Universal Cart.
  • Gradient Flow (Ben Lorica) called I/O 2026 the moment the agent layer started to take shape across the major labs.
google-io gemini-3-5-flash gemini-omni agents video-generation
#2
Industry 2026-05-19 The Information — AITechCrunch — AI 8.2 8.0/8.0/8.5

Andrej Karpathy — OpenAI co-founder, former director of AI at Tesla, and the most-followed ML educator on the open internet — is joining Anthropic's pre-training team. TechCrunch and The Information broke the story independently on May 19, 2026; Anthropic confirmed via an internal note that surfaced same day. Karpathy spent the past two years running Eureka Labs, his AI-native education startup, and continued publishing the from-scratch transformer tutorials that became one of the most-watched ML curricula online. Anthropic's framing is that he is joining a pre-training group that has roughly doubled in headcount since Claude Opus 4 shipped, and that his focus will be on the next-generation training stack — data composition, curriculum, and the early-pre-training synthetic-data pipeline rather than post-training or RLHF.

Karpathy's move matters at three levels. First, it is a public signal in the Anthropic-versus-OpenAI talent race that has been visibly tilting toward Anthropic over the past eighteen months — the Stainless acquisition the same week and the recent departures of three OpenAI alignment leads to Anthropic and Apollo all read as a single trend. The Information's piece notes that Karpathy turned down a competing offer from Google DeepMind and held informal conversations with Mira Murati's Thinking Machines Lab before landing at Anthropic; the deciding factor was reportedly the freedom to publish openly on training methodology, which Anthropic now offers more readily than the major labs that compete in product. Second, Karpathy's prior public writing has been clearly skeptical of the next-token-prediction-only pre-training paradigm — he's argued repeatedly that the data wall and the synthetic-data regime are the real frontier — which suggests Anthropic is making a deliberate bet on someone with strong opinions about the post-data-wall training problem. Third, the educational halo matters: a generation of ML practitioners learned transformers from his nanoGPT walk-throughs, and his presence at Anthropic will plausibly compound recruiting from the same talent pool that previously defaulted to OpenAI.

How it was discussed
  • TechCrunch frames it as the latest in a string of high-profile OpenAI alumni landing at Anthropic.
  • The Information notes Karpathy turned down a competing DeepMind offer; the deciding factor was freedom to publish on training methodology.
  • His prior public writing is skeptical of next-token-only pre-training — suggesting Anthropic is hiring for someone with strong views on the post-data-wall training problem.
anthropic karpathy talent pre-training
#3
Industry 2026-05-18 Anthropic News 7.8 7.8/8.0/7.5

Anthropic acquired Stainless, the SDK-generation company behind every official Anthropic client library since the earliest days of the Claude API. Stainless turns OpenAPI specs into idiomatic SDKs across TypeScript, Python, Go, Java and a dozen other languages, plus the MCP server tooling that has become the connective tissue for agent integrations. Hundreds of API-first companies — OpenAI, Cloudflare, Lithic, Anthropic itself, and many more — already depend on Stainless to ship their SDKs. The acquisition price was not disclosed; Alex Rattray (Stainless founder/CEO) and the full team are joining Anthropic, and Stainless's existing customers continue to be supported as a third-party platform.

The strategic logic is that Anthropic created MCP specifically to make Claude an agent-platform target, and the bottleneck to agent reach is the long tail of API surfaces — every SaaS tool, every internal service, every database — that needs a clean SDK or MCP server before Claude can use it. By bringing Stainless in-house, Anthropic gains both the team that already builds those SDKs at scale and the relationships with the SaaS companies whose APIs they wrap. Katelyn Lesse, head of Platform Engineering at Anthropic, framed it as extending what agents can reach. Combined with the same-week KPMG enterprise deal and the earlier Cowork and Claude Code rollouts, the picture is of Anthropic systematically buying the platform layer beneath Claude rather than just selling model access — a clearer platform play than OpenAI's stack has shown over the past year.

anthropic mcp sdk agents acquisition
#4
Industry 2026-05-19 Anthropic News 7.6 7.5/7.7/7.5

Anthropic and KPMG announced a global alliance on May 19, 2026 that puts Claude in front of every one of KPMG's 276,000+ employees worldwide and embeds it inside KPMG's Digital Gateway client-work platform. Digital Gateway is the Azure-based environment where KPMG tax, audit, legal, and advisory professionals run client work — proprietary tools, client data, and KPMG's own AI workflows all sit there. Claude Cowork and Anthropic's Managed Agents now run natively inside Digital Gateway, which means KPMG professionals can spin up client-specific AI agents in minutes instead of the weeks that the prior tool-switching workflow required. Rema Serafi, KPMG US's Vice Chair for Tax, gave the canonical example: building an AI agent to help clients adjust to changing tax regulations used to take weeks of work across multiple tools and chat windows; with the new integration the same agent comes up in minutes.

Two more pieces of the alliance matter. First, KPMG becomes Anthropic's preferred private-equity consultant — meaning that PE firms looking to deploy Claude into their portfolio companies route through KPMG's consulting team and through KPMG Blaze, a new offering that embeds Claude Code to modernize aging IT systems inside portfolio companies. Second, the alliance includes a cybersecurity workstream where joint KPMG-Anthropic teams use Claude to find and fix vulnerabilities in critical client systems, governed by KPMG's existing Trusted AI framework. The deal is one of the largest enterprise rollouts of a frontier model to date by employee count, and it follows directly on the recent PwC deal and the new Anthropic enterprise venture with Blackstone, Hellman and Friedman, and Goldman Sachs — a pattern of Anthropic anchoring Claude in the consulting and PE channels that move enterprise procurement.

anthropic kpmg enterprise claude cowork
#5
Government & Defense 2026-05-19 C4ISRNETDefenseScoop 7.5 7.5/7.5/7.5

The Department of Defense awarded Perennial Autonomy a five-year, $500 million indefinite-delivery/indefinite-quantity contract for counter-unmanned-aircraft-systems work, according to coverage from C4ISRNET and DefenseScoop on May 19, 2026. The contract is the single largest Pentagon counter-UAS award to date and signals a clear consolidation around AI-driven, autonomous counter-drone systems after eighteen months of smaller pilots and rapid-acquisition pulls. Perennial's system uses a multi-modal sensor fusion stack — radar, electro-optical/infrared, and radio frequency — feeding a learned classifier that distinguishes commercial micro-UAS from one-way attack drones in cluttered environments. The kinetic effector is reusable and modular: kinetic interceptors, radio-frequency jamming, and high-power microwave can all run on the same command-and-control layer.

The award lands in the same week as the Navy's go-ahead for low-rate production of its drone refueler, Shield AI's selection to provide AI-powered swarming for the LUCAS kamikaze drone program, and the Space-BACN satellite-laser-link program transitioning from DARPA to DIU — all of which fit the broader pattern of the Department of Defense moving rapidly to scale autonomy and counter-autonomy capabilities together. The strategic context is the Ukrainian ground-robot defensive engagement that Defense One covered the same day, where a Ukrainian unmanned ground vehicle held a defensive position against Russian assault for six weeks without crew rotation — a case study that has been circulating in Pentagon procurement circles for several months and is now visibly accelerating counter-UAS spending.

How it was discussed
  • C4ISRNET frames it as the single largest Pentagon counter-UAS award to date and signals consolidation around AI-driven autonomous systems.
  • DefenseScoop's piece emphasizes the IDIQ structure — five years, $500M ceiling — which suggests sustained spend rather than a one-off procurement.
pentagon counter-drone autonomy perennial c-uas
#6
Infrastructure 2026-05-20 The Information — AI 7.2 6.9/5.0/5.4

Alibaba Group’s semiconductor design unit on Wednesday unveiled a new chip that can be used to train and run AI models. The move comes as the Chinese government pushes to accelerate the adoption of homegrown AI chips to reduce the country’s dependence on Nvidia. Alibaba said the new AI chip, ...

#7
AI for Science 2026-05-19 DeepMind 7.2 7.2/5.4/5.4

In an era of information overload, the search for transformative scientific ideas has become a significant bottleneck for progress. Every great scientific breakthrough begins with a single, transformative idea. The spark of discovery relies on a researcher's ability to connect disparate facts and formulate the right hypothesis to test. We believe AI can help dramatically accelerate the pace of breakthroughs by serving as a dedicated partner in the generation and refinement of breakthrough scientific hypotheses. That’s why we’ve developed Co-Scientist, a Gemini-based multi-agent AI system that iteratively generates, debates, and

#9
AI for Science 2026-05-19 DeepMind 7.1 7.2/5.4/5.4

Globally recognized as a silent pandemic, antimicrobial resistance continues to rise as bacteria outpace the development of new antibiotics. When patients stop responding to standard treatments, routine infections can quickly become life-threatening. At the University of Cambridge, Ben Luisi and his team are combining structural biology with advanced AI tools like AlphaFold, Gemini, and Co-Scientist to decode these hidden defense mechanisms. By compressing a process that once took years into just minutes, they are uncovering the critical insights needed to outsmart bacterial evolution. Learn more about science at Google DeepMind:

#10
Infrastructure 2026-05-19 The Information — AI 7.0 6.9/5.0/5.4

Amazon’s yearslong effort to build a serious alternative to Nvidia’s dominant AI chips is starting to gain traction. Anthropic and OpenAI, which have struck multibillion-dollar investment and infrastructure deals with Amazon, have already committed to renting large amounts of current and future Trainium capacity. Now, recent software improvements are prompting smaller developers to consider moving more workloads to Trainium, half a dozen people who use or work with the chips said.

#12
Industry 2026-05-19 OpenAI Research 7.0 7.5/5.5/5.4

OpenAI for Singapore launches a multi-year AI partnership to expand deployment, build local talent, and support businesses and public services with AI.

#13
AI for Science 2026-05-19 Allen Institute for AI (AI2) 7.0 6.9/5.0/5.4

OlmoEarth v1.1 is a more efficient family of remote-sensing models that cuts compute costs by up to 3x while maintaining similar performance, making large-scale satellite mapping faster and cheaper to run.

#15
AI for Science 2026-05-19 DeepMind 7.0 7.2/5.0/5.4

Tropical storms and hurricanes are notoriously volatile, changing structure and intensity in a matter of hours. This unpredictability makes them some of the most challenging weather systems to forecast—putting lives and livelihoods at risk. WeatherNext, our global weather forecasting AI model, successfully predicted the intensity and track of Hurricane Melissa in October 2025. By providing high-confidence signals and advanced notices days before the Category 5 storm made landfall in Jamaica, WeatherNext enabled meteorologists and local authorities to issue life-saving evacuation warnings and protect vulnerable communities.

#17
AI Coding 2026-05-19 TechCrunch — AI 6.9 6.3/5.9/5.4

Google is embracing the rise of AI coding agents with new Android tools designed to work with platforms like Claude Code and OpenAI’s Codex, allowing developers — or their AI assistants — to build Android apps faster from the command line.

#18
Infrastructure 2026-05-19 The Information — AI 6.9 6.9/5.0/5.4

Chipmaker Analog Devices is in advanced talks to buy startup Empower Semiconductor for about $1.5 billion, in a deal that reflects demand for technology that can manage the intense energy needs of AI chips, The Information reported Monday . A deal could be announced as soon as this ...

#20
Government & Defense 2026-05-19 Shield AI 6.9 6.6/5.4/5.4

WASHINGTON (May 19, 2026 ) — Shield AI, the defense-tech company building state-of-the-art autonomy software and aircraft, today announced that the Office of the Under Secretary of War for Research and Engineering (OUSW R&E) has selected Shield AI to integrate its Hivemind autonomy software onto the Low-Cost Uncrewed Combat Attack System (LUCAS) , a new class of low-cost, one-way attack drones often referred to as kamikaze drones designed to operate in large numbers. The LUCAS program, developed by the Office of the Deputy Assistant Secretary of War for Prototyping and

#21
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersarXiv — AI for SciencearXiv cs.AI (Artificial Intelligence)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.8 7.2/5.3/7.9

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textsc{Pivot}/\textsc{Refine} decision loop that

#22
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.8 7.2/5.3/8.0

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers.

#23
Industry 2026-05-19 The Information — AI 6.8 6.9/5.8/5.4

Some developers have told me that the rising costs of frontier AI models from Anthropic and other firms could prompt them to shift to cheaper open-source AI. After all, when companies as sophisticated as Uber are accidentally blowing through their entire year’s AI budget in a matter of months, it makes sense to cut back by using a less capable open-source model to automate simpler tasks. (In fact, companies like Uber and Airbnb are doing exactly that!) It’s not clear whether open-source AI is good enough to meet the challenge,

#26
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.7 7.2/5.0/8.0

Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic

#27
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 7.2/5.4/7.3

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding &

#28
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 7.2/5.8/6.7

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task

#29
Industry 2026-05-19 The Information — AI 6.6 6.9/5.0/5.4

Google DeepMind agreed to pay between $80 million and $90 million to hire employees from AI agent startup Contextual AI and to license its technology, according to a person with knowledge of the deal. Douwe Kiela, Contextual’s cofounder and CEO, is expected to join DeepMind alongside over ...

#30
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.6 7.2/5.8/6.7

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this

#31
Industry 2026-05-19 Latent Space (swyx & Alessio) 6.6 6.9/6.3/5.4

It is the day before Google I/O, when the next major Gemini releases are expected to be previewed, and it will probably be a quiet week from competitors, though Anthropic and OpenAI both had minor wins today, and Cursor shipped their first SpaceXAI model with some nice detail on synthetic data/reward hacking and continued pretraining with Muon . However the probable lasting title story candidate from today will be Vlad Feinberg’s (understandably Google/TPU centric) notes on job preparation, specifically on Pretraining : Specifically he references last year’s Scaling handbook from

#34
Industry 2026-05-19 The Information — AI 6.5 6.9/5.0/5.4

Microsoft’s GitHub Copilot may have lost much of its early lead in the AI coding race to rivals like Anthropic and Cursor, but Microsoft thinks it has an advantage over those companies: roughly 100,000 software engineers who work for Microsoft.  As we reported Monday , Microsoft leaders think they can develop coding models using the wealth of proprietary code those developers have written (and are writing now). It’s the latest example of AI developers using their own employees as a source of AI training data.

#35
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 7.2/5.0/7.3

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary

#36
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.5 7.2/5.0/7.3

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we

#37
Infrastructure 2026-05-19 NVIDIA AI Blog 6.5 6.9/5.0/5.4

At this year’s Google I/O conference, NVIDIA and Google Cloud are accelerating the work of more than 100,000 developers in the companies’ joint developer community , which provides curated learning paths, hands-on labs and events that help them build using the full-stack NVIDIA AI platform on Google Cloud. Launched at Google I/O last year, the community brings together developers, data scientists and machine learning engineers who want to sharpen their AI skills on the latest NVIDIA and Google Cloud technologies. New additions for the community are rolling out this year,

#38
Industry 2026-05-19 Stratechery 6.5 6.6/5.0/5.4

Personal Day — No Update

#39
Industry 2026-05-19 MIT Technology Review — AI 6.5 6.9/5.0/5.4

Listen to the session or watch below Elon Musk lost his suit against OpenAI, in which he alleged CEO Sam Altman and President Greg Brockman had deceived him over the company’s non-profit status. Watch as AI reporter and attorney Michelle Kim, who covered the trial for MIT Technology Review, joins in conversation with editor in chief Mat Honan to go behind the scenes of the trial and the implications for the AI race. Speakers : Mat Honan , Editor in Chief, and Michelle Kim , AI Reporter Recorded on May

#40
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.5 7.2/6.2/6.2

We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP

#41
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/5.3/6.7

Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and

#42
Frontier LLMs 2026-05-11 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/6.1/5.8

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM

#43
Industry 2026-05-19 AI + a16z 6.4 6.0/5.0/5.4

Recorded live at the a16z Fintech Connect conference in Deer Valley, Alex Rampell speaks with Ben Horowitz, cofounder and general partner at a16z, about how AI has rewritten the fundamental rules of software competition, why crypto infrastructure will become essential in an AI-dominated world, and what the future holds for venture capital. Follow Alex Rampell on X: https://twitter.com/arampell Follow Ben Horowitz on X: https://twitter.com/bhorowitz Check out everything a16z is doing with artificial intelligence here , including articles, projects, and more podcasts.   Please note that the content here is for

#44
Frontier LLMs 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/5.3/6.7

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a

#45
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/5.3/6.7

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation.

#46
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersarXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & BenchmarksHugging Face Daily Papers 6.4 7.2/5.0/7.0

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To

#47
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/5.3/6.7

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously

#48
Frontier LLMs 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/5.3/6.7

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from

#49
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/5.7/6.2

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of

#50
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)Hugging Face Daily Papers 6.4 7.2/5.0/7.0

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as

#51
Frontier LLMs 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.4 7.2/5.3/6.7

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than

#52
Industry 2026-05-20 The Information — AI 6.4 6.9/5.0/5.4

OpenAI cofounder and CEO Sam Altman late Tuesday offered to invest $2 million in every startup currently in the Y Combinator startup accelerator program—not in cash, but in OpenAI tokens. “I am excited to see what will happen with tokenmaxxing startups, both for how they work internally and the ...

#53
Industry 2026-05-19 The Information — AI 6.3 6.9/5.5/5.4

SpaceX’s initial public offering prospectus will read like a kaleidoscope when it’s unveiled publicly in the coming days. There will be a dazzling mixture of storylines and figures that show huge financial losses, exciting promises, impressive growth and unmet expectations. Information in the prospectus will help investors knit together parts of the company that feel disconnected—rocket launches, satellite internet, social media, AI models, data centers, defense contracting, Mars voyages—ahead of the largest and most audacious IPO ever.

#54
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.0/6.7

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms

#55
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersarXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence)Hugging Face Daily Papers 6.3 7.2/5.3/6.3

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then

#56
Safety, Policy & Regulation 2026-05-19 a16z AI Policy Brief 6.3 6.3/5.0/5.4

For AI startups, the policy landscape is expanding faster than most small teams can reasonably track. This creates a practical challenge for Little Tech: even when a startup wants to engage constructively, it may not have the resources to follow every debate in every jurisdiction. Ben Supple, head of global policy at ElevenLabs, joins Matt Perault to talk about his experience running a public policy function at a company that is scaling rapidly. ElevenLabs is a leader in voice AI, building products for creators, enterprises, and governments, while its public

#57
Industry 2026-05-20 The Information — AI 6.3 6.9/5.0/5.4

SpaceX tapped Goldman Sachs for the top role on its initial public offering next month, according to the Wall Street Journal. The “lead left” position would put Goldman in the driver’s seat for what will likely be the largest IPO ever. Morgan Stanley, Bank of America, Citi and JPMorgan are the ...

#58
Frontier LLMs 2026-05-17 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.0/6.7

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for

#59
Frontier LLMs 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.4/6.2

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU

#61
Frontier LLMs 2026-05-17 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.0/6.7

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT

#62
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.0/6.7

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers,

#63
Generative Media 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & BenchmarksarXiv — Generative Media / Diffusion 6.3 7.2/5.0/6.6

Diffusion and flow-based generative models dominate visual synthesis, with guidance aligning samples to user input and improving perceptual quality. However, Classifier-Free Guidance (CFG) and extrapolation-based methods are heuristic linear combinations of velocities/scores that ignore the generative manifold geometry, breaking probability conservation and driving samples off the learned manifold under strong guidance. We analyse guidance through the continuity equation and show its effect decomposes into a divergence term and a score-parallel term defined invariantly across parameterisations. We prove the divergence term blows up structurally as sampling approaches the data manifold, motivating

#65
Industry 2026-05-19 The Information — AI 6.3 6.9/5.0/5.4

SpaceX’s initial public offering, expected to be the largest in history, is also set to yield the largest venture returns ever at the time of the IPO. Among those set to win big are Founders Fund, Valor Equity Partners and Sequoia Capital. Valor Equity Partners, whose founder Antonio Gracias is on the SpaceX board and has backed several of Musk’s endeavors, owns about 4% of SpaceX, via the almost $6 billion it spent investing in X, xAI and SpaceX, which are now all part of the same company. If, as

#66
Industry 2026-05-19 The Information — AI 6.3 6.9/5.0/5.4

SpaceX and Cursor expect to proceed with their planned acquisition 30 days after SpaceX begins trading publicly, according to someone familiar with the matter. SpaceX is expected to go public in mid-June in the largest IPO in U.S. history. The Elon Musk-founded rockets-and-AI company announced ...

#67
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.4/6.2

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise

#68
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.8/5.8

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates

#69
Safety, Policy & Regulation 2026-05-19 MIT Technology Review — AI 6.3 6.9/5.0/5.4

Throughout 2025, HPE observed significant changes in how cybercriminals operate. Analyzing real-world threats, our HPE Threat Labs highlighted an industrialization of the cyber criminals’ methods in its new In the Wild Report, enabling greater scale, speed and structure in their campaigns. They typically use automation and AI to exploit longstanding vulnerabilities, and many have adopted a professional, corporate hierarchy to optimize their efficiency. Cybersecurity threats today are as menacing as ever for enterprises, as any CISO or CIO can probably confirm. But, digging behind that straightforward statement, there is a

#70
Frontier LLMs 2026-05-13 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.3 7.2/5.4/6.2

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically

#72
Frontier LLMs 2026-05-17 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.6/5.8

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling.

#73
Frontier LLMs 2026-05-17 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.5/5.8

The deployment of Large Language Models (LLMs) as autonomous economic agents introduces systemic risks that extend beyond individual capability failures. As agents transition to directly interacting with marketplaces, their collective behavior can amplify volatility and mask deception at scale. We introduce the Agent Bazaar, a multi-agent simulation framework for evaluating Economic Alignment, the capacity of agentic systems to preserve market stability and integrity. We identify two failure modes: (1) Algorithmic Instability in a B2C market ("The Crash"), where firms amplify price volatility until the market collapses, and (2) Sybil Deception

#74
Frontier LLMs 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.3/6.2

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented

#75
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.3/6.2

When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look

#76
AI for Science 2026-05-19 MIT Technology Review — AI 6.2 6.9/5.0/5.4

The baby chicks were shifting and starting to pip—or trying to hatch. But not from an egg.  Instead, these chickens were growing inside transparent 3D-printed plastic cups at the Dallas headquarters of Colossal Biosciences. The biotech company today claimed it has developed a “fully artificial egg” as part of its effort to resurrect extinct avian species, including birds like the dodo and the giant moa. But “artificial eggshell” would probably be a better description for the invention. It’s an oval-shaped printed lattice, coated inside with a special silicone-based membrane that

#77
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.2 7.2/5.3/6.2

Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign

#78
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.7/5.8

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based

#79
Frontier LLMs 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.6/5.8

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet

#80
Frontier LLMs 2026-05-17 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.3/6.2

Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation. We present TDDev, a framework that automates this closed loop through three stages: (1) converting high-level requirements

#82
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.2 7.2/5.3/6.2

Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., $\textit{contaminated}$, which diminishes their value as reliable measures of model generalization. In this paper, we argue that benchmark datasets should be $\textit{contamination-resistant}$, i.e., $\textit{unlearnable}$, but support $\textit{inference}$. To accomplish this, we first highlight the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can

#83
Frontier LLMs 2026-04-03 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.3/6.2

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated

#84
Government & Defense 2026-05-19 DefenseScoop 6.2 6.9/5.0/5.4

Gurpartap “GP” Sandhoo has officially been named as director of the Space Development Agency and the new portfolio acquisition executive for Space Force’s missile warning and tracking programs, the agency announced Tuesday. Previously SDA’s deputy director, Sandhoo has been leading the agency as acting director since September 2025 , when Derek Tournear stepped down from the role. In the last few months, Sandhoo has overseen SDA begin the highly anticipated launch campaign of its foundational program — the Proliferated Warfighter Space Architecture ( PWSA ). The announcement that Sandhoo will

#85
Frontier LLMs 2026-05-17 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.3/6.2

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically

#86
Industry 2026-05-19 Gradient Flow (Ben Lorica) 6.2 6.3/5.0/5.4

Subscribe • Previous Issues Integration Is the New Moat: Moving Beyond the LLM The AI Agent Conference in New York was one of the better events I’ve attended to get a read on what’s actually happening with enterprise AI. The formal sessions were great, but the hallway conversations was where I got the inside scoop. The consistent message: deploying AI agents is much harder than most organizations expect, and the reasons are rarely the ones they anticipate. What follows is my attempt to distill what I heard into a practical

#87
Frontier LLMs 2026-05-16 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.6/5.8

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service

#88
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.2 7.2/5.5/5.8

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline

#89
Post-Training 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Post-training / Alignment 6.2 7.2/5.5/5.8

The proliferation of generative artificial intelligence has given rise to an interactive learning environment, where model parameters are continuously updated using not only data generated by natural processes, but also synthetic outputs produced by other models. This paradigm introduces two major challenges: (1) training data are no longer drawn exclusively from the target population, undermining a core assumption of classical statistical learning, and (2) model training processes become inherently correlated, as models interact with one another through repeated exposure to each other's synthetic outputs in a potentially complex manner. Establishing

#90
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.1 7.2/5.3/5.8

While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $Ω(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound

#91
Frontier LLMs 2026-05-16 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent

#92
Frontier LLMs 2026-05-12 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive

#93
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.0/6.2

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a

#94
Evaluations & Benchmarks 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Evals & BenchmarksarXiv — Robotic Autonomy / Embodied AI 6.1 7.2/5.0/6.2

Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and

#95
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.1 7.2/5.0/6.2

Artificial vision models are often evaluated against the human visual cortex by measuring how accurately their internal representations predict brain responses. However, prediction accuracy alone does not indicate which dimensions of the target brain's response space are recovered. Here, we introduce a unified framework for evaluating both model-brain and brain-brain alignment by identifying the response dimensions recovered by prediction. Using repeated fMRI measurements, we first identify target-brain response dimensions that can be reproducibly predicted across independent trial splits. We then predict target-brain responses from either another subject's brain responses or

#96
Efficiency 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Efficiency (Quantization, MoE, Inference) 6.1 7.2/5.0/6.2

Vector quantization is a fundamental primitive for scalable machine learning systems, enabling memory-efficient storage, fast retrieval, and compressed inference. Recent rotation-based quantizers such as EDEN, RabitQ, and TurboQuant have introduced strong guarantees and empirical performance, but the surrounding comparisons have been difficult to interpret because they rely on different distortion criteria, probability regimes, and implementation assumptions. As our first contribution, we provide a unified theoretical comparison of these methods and show that their relative advantages are criterion-dependent rather than absolute: EDEN and TurboQuant are favorable for MSE distortion, EDEN is

#97
Frontier LLMs 2026-05-12 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle

#98
Frontier LLMs 2026-05-16 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.0/6.2

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection,

#99
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision

#100
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson,

#101
Frontier LLMs 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.0/6.2

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional

#102
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.1 7.2/5.0/6.2

We introduce Contrastive FUSE, a fast and unified framework for scalable node representation learning in graphs with partially available pairwise node labels and no available node features. Unlike existing methods, we directly optimize a spectral contrastive objective that integrates community-aware structural signals with signed pairwise constraints. To support large-scale training, we replace the expensive modularity gradient with a lightweight approximation, which preserves the structure-seeking behavior of modularity while reducing the computational cost significantly. This yields an efficient optimization scheme with a natural gradient decomposition and adaptive learning-rate scaling, enabling fast

#103
Reinforcement Learning 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Evals & BenchmarksarXiv — Reinforcement Learning 6.1 6.9/5.3/6.2

Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding

#104
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.1 7.2/5.0/6.2

Discovering shapelets -- i.e., discriminative temporal patterns within time series -- has been widely studied to address the inherent complexity of time-series classification (TSC) and to make model decision-making processes more transparent. However, existing methods primarily focus on population-level shapelets optimized across the entire dataset, which leads to two fundamental limitations: (i) population-level patterns often misalign with instance-specific features, resulting in suboptimal performance and potentially misleading interpretations, and (ii) most methods treat shapelets as independent entities, overlooking important temporal dependencies and interactions among multiple patterns. To address these limitations, we

#105
Reinforcement Learning 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Evals & BenchmarksarXiv — Reinforcement Learning 6.1 7.2/5.0/6.2

The progression of reinforcement learning algorithms have been driven by challenging benchmarks. The rate in which a researcher can iterate on a problem setting directly impacts the speed of algorithm development. Modern machine learning has produced tools that allow for fast and scalable algorithm development like the JAX library. With the availability of these tools, a serious bottleneck in algorithm development is the availability of large and complex domains for experimentation. Most notably, the JAX reinforcement learning ecosystem does not have any benchmarks that test visual first-person tasks; these domains

#106
Post-Training 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Post-training / AlignmentarXiv — Reinforcement Learning 6.1 7.2/5.0/6.2

Infinite-dimensional orthonormal basis expansions play a central role in representing and computing with function spaces due to their favorable linear algebraic properties. However, common bases such as Fourier or wavelets are fixed and do not adapt to the structure of a given problem or dataset. In this paper, we aim to represent these bases with neural networks and optimize them. Our key idea is that any target infinite-dimensional orthonormal basis can be viewed either as a point on the Lie manifold of the orthogonal group, or equivalently, as the endpoint

#107
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.1 7.2/5.0/6.2

Neural policies have shown promise in solving vehicle routing problems due to their reduced reliance on handcrafted heuristics. However, current training paradigms suffer from a fundamental limitation: they primarily focus on next-node prediction for solution construction, resulting in myopic decision-making that undermines long-horizon planning capacity. To this end, we introduce Multi-node Lookahead Prediction (MnLP), a novel training strategy that extends the supervised learning paradigm to predict multiple future nodes simultaneously. We incorporate causal and discardable MnLP modules that operate exclusively during training, facilitating models to anticipate multi-step decisions while preserving

#108
Frontier LLMs 2026-05-16 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert

#109
Frontier LLMs 2026-05-13 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.0/6.2

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require

#110
Frontier LLMs 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.1 7.2/5.3/5.8

Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people's opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K

#111
Frontier LLMs 2026-05-16 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct

#112
Government & Defense 2026-05-19 DefenseScoop 6.1 6.9/5.0/5.4

Navy leadership’s recent decision to make the future Trump-class battleship nuclear-powered introduced a new twist in the saga of one of the service’s most controversial programs. President Donald Trump unveiled his vision for the platform in December when officials shared their desires to arm the vessel with a variety of high-tech weapons such lasers, railguns, hypersonic missiles and nukes. The Pentagon plans to spend more than $17 billion on the lead ship in the class, according to budget documents released last month. Earlier this year, Chief of Naval Operations Adm.

#113
Frontier LLMs 2026-05-17 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark

#114
Research 2026-05-19 arXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence)arXiv — Mechanistic Interpretability 6.1 7.2/5.0/6.2

Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively,

#115
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.6/5.5

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics,

#116
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.1 7.2/5.0/6.2

Training very deep neural networks requires controlling the propagation of magnitudes across depth. Without such control, activations and gradients may vanish, explode, or enter unstable regimes that make optimization fail. Modern architectures often mitigate this problem through Batch Normalization, residual connections, or other normalization layers, which repeatedly re-scale or bypass intermediate representations. However, these mechanisms are not always appropriate. In Physics-Informed Neural Networks (PINNs), the network represents a continuous physical field and its input derivatives define the training objective, making batch-dependent normalization problematic because it can introduce non-local dependencies into

#117
Generative Media 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Evals & BenchmarksarXiv — Generative Media / Diffusion 6.1 7.2/5.0/6.2

Standard generative models struggle with heavy-tailed data: Lipschitz architectures cannot produce power-law tails from Gaussian noise, and interpolating between heavy-tailed data and Gaussians is ill-posed. We propose a simple fix: apply the soft-log transform $φ(x) = \mathrm{sign}(x) \cdot \log(1 + |x|)$ coordinate-wise to data before training, then exponentiate samples after generation. A Hill diagnostic decides per-coordinate whether to transform, leaving light-tailed margins untouched at no added complexity. This compresses heavy tails into a range where standard flow matching succeeds, without heavy-tailed base distributions or architectural modifications. We provide theoretical intuition

#118
Efficiency 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Efficiency (Quantization, MoE, Inference) 6.1 7.2/5.3/5.8

Distillation transfers knowledge from a large model trained on broad data to a smaller, more efficient model suitable for deployment. In structured prediction settings, prior knowledge about the task can guide the choice of a target architecture that is algorithmically aligned with the underlying problem. Building on recent learning-theoretic analyses of decision-tree (DT) distillation (Boix-Adsera, 2024), we study when distillation succeeds for combinatorial optimization tasks. We focus on the case where the target model is a graph neural network whose architecture is aligned with a dynamic programming (DP) algorithm for

#119
Post-Training 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Post-training / AlignmentarXiv — Reinforcement Learning 6.1 6.9/5.3/6.2

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts,

#120
Frontier LLMs 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone,

#121
Frontier LLMs 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.3/5.8

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel

#122
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.1 7.2/5.0/6.2

Recent work pairs LLMs with evolutionary search to iteratively generate, modify, and select code using task-specific feedback. These systems have produced strong results in mathematical discovery and algorithm design, yet a fundamental question remains: what do they actually evolve? Progress is typically summarized by the best score a run reaches under a task-specific evaluator, but that score can reflect several different mechanisms: new algorithmic structure, re-tuning an existing strategy, recombining ideas already in the model's internal knowledge, or overfitting to the evaluator. Distinguishing these mechanisms requires inspecting the search process

#123
Reinforcement Learning 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv — Reinforcement Learning 6.1 7.2/5.0/6.2

Reward-poisoning attacks present a significant risk to learning-based wireless control systems. Given this, we propose a Disagreement-Guided Reward Poisoning (DGRP) adaptive attack on a Soft Actor-Critic (SAC) agent. In a Cognitive Radio Network (CRN) environment assisted by Reconfigurable Intelligent Surfaces (RIS), the SAC agent is tasked with maximizing the long-term secondary users' (SUs) rate by simultaneously optimizing the transmission power of the SU transmitter and the RIS phase shifts. DGRP corrupts rewards, particularly when the SAC dual critics exhibit substantial disagreement-especially in high-leverage, high-uncertainty states-resulting in distorted value estimations and

#124
Frontier LLMs 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.1 7.2/5.0/6.2

Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the

#125
Frontier LLMs 2026-05-19 Anthropic News 6.1 7.5/5.4/5.4

At Anthropic, we want to build AI systems that advance humanity and act for the global good. Over the past several months, we've been organizing dialogues with groups whose work and traditions bear on the questions raised by AI. Our first round of discussions has been with wisdom traditions—including scholars, clergy, philosophers, and ethicists from more than 15 religious and cross-cultural groups. We are thinking carefully about what a flourishing future could look like in a world of powerful AI, what it means for an AI system that interacts with

#126
Research 2026-05-19 arXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 6.0 6.9/5.0/6.2

Documentation has long guided computer system tuning by distilling expert knowledge into per-parameter recommendations. Yet such guides capture only what experts conclude, discarding how they reason. This fundamental gap manifests in three concrete deficiencies: documentation grows stale as software evolves, fails under heterogeneous workloads, and ignores inter-parameter dependencies. We propose shifting from static documentation to dynamic action for system tuning. We introduce PerfEvolve, which translates expert tuning methodologies into executable skills that equip LLM-based agents to perform version-consistency verification, workload-specific profiling, and multi-parameter joint optimization. Evaluated on PostgreSQL under TPC-C

#127
Post-Training 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Generative Media / DiffusionarXiv — Post-training / Alignment 6.0 6.9/5.0/6.2

Concept-based Explainable Artificial Intelligence (XAI) interprets deep learning models using human-understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low-level image data and high-level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero-shot Text-to-Image (T2I) generative models as a source of synthetic concept datasets for concept-based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their

#128
Frontier LLMs 2026-05-11 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action

#129
Government & Defense 2026-05-19 FedScoop — AI 6.0 6.6/5.0/5.4

The Department of Labor is asking Congress for additional funding to strengthen its identity verification systems, part of what the agency’s head said is a multipronged effort to crack down on improper payments. Appearing Tuesday before a Senate Appropriations subcommittee , acting Secretary Keith Sonderling told lawmakers that the Labor Department is focused on working with states to ensure they have “proper identity verification systems to make sure that not a single dollar goes out” that shouldn’t.  The DOL’s fiscal 2027 budget request seeks $2.8 billion in state grants to

#130
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic

#132
Government & Defense 2026-05-19 War on the Rocks 6.0 6.6/5.0/5.4

From May 14 to 15, U.S. President Donald Trump held a summit in Beijing with Chinese leader Xi Jinping. In addition to pageantry, the summit featured discussions about Iran and the Strait of Hormuz, Taiwan, and bilateral trade. Both Washington and Beijing emphasized a relationship based on “constructive strategic stability.”Many countries, particularly those in Asia, were watching closely to see how the two leaders got along, what they agreed on, and what divided them. We asked four experts to tell us about the reactions in Japan, South Korea, Taiwan, and

#133
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose AtlasVA, a teacher-free visual skill

#134
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.0 7.2/5.0/5.8

Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time scale. We build a universal microstate tokenizer from a large medical EEG dataset by clustering continuous EEG signals into sequences of discrete microstates. The microstate tokenizer is

#135
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.0 7.2/5.0/5.8

JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry $H\succ0$, the minimax and maximum-entropy covariance under a Hamiltonian energy budget is $(c/d)H^{-1}$, and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do

#136
Generative Media 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Generative Media / Diffusion 6.0 7.2/5.0/5.8

Ride-hailing platforms like DiDi Chuxing operate in highly dynamic environments where balancing driver supply and passenger demand is critical. Although driver-side subsidies serve as a primary lever to align these forces and improve key KPIs like completed rides (\texttt{Rides}) and gross merchandise value (\texttt{GMV}), optimizing them in production requires simultaneously meeting three constraints: (i) responsiveness to stochastic shocks, (ii) strict subsidy-rate caps, and (iii) low-latency execution at city scale. These requirements rule out expensive per-order optimization, calling for a forward-looking, constraint-aware city-level controller for online sequential decision making. To meet

#137
Frontier LLMs 2026-05-13 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this

#138
Frontier LLMs 2026-05-12 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that

#139
Frontier LLMs 2026-05-16 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the

#140
Frontier LLMs 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we

#141
Frontier LLMs 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.0 7.2/5.0/5.8

Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and

#142
Government & Defense 2026-05-19 FedScoop — AI 6.0 6.6/5.0/5.4

As the amount of data generated by space exploration increases exponentially, NASA is looking to artificial intelligence tools to more rapidly synthesize information and provide mission support. During a keynote address last week, Troy LeBlanc, chief information officer of the Johnson Space Center in Houston, illustrated how technology advancements have multiplied the agency’s data flow by focusing on some of NASA’s most recognizable outputs: photos.  From pictures of the first moon landing to training photos to the latest captures of the four Artemis II astronauts landing safely back on earth,

#143
Efficiency 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Efficiency (Quantization, MoE, Inference) 6.0 6.9/5.3/5.8

Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last. Stragglers emerge when too many heavily used

#144
Frontier LLMs 2026-05-15 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds

#145
Frontier LLMs 2026-05-16 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Memory systems can store vastly different amounts of information despite similar hardware constraints. Here, we show that superior spatial memory emerges from a discrete stiffening of hippocampal population geometry-a transition from disorganized to crystalline collective coding. Comparing food-caching chickadees to non-caching zebra finches, we found that the caching hippocampus maintains a topologically rigid, "crystalline" geometry with significantly higher geometric stability (Shesha 0.245 v 0.166) and nearly two-fold greater temporal coherence (Shesha 0.393 v 0.209), while the non-caching hippocampus resembles a disorganized "mist." This stability is actively constructed by synergistic circuit

#146
Frontier LLMs 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks 6.0 7.2/5.0/5.8

Bayesian optimization (BO) selects evaluation points for expensive black-box objectives using Gaussian process (GP) predictive distributions. Kernel choice and hyperparameter selection can lead to miscalibrated predictive distributions and an inappropriate exploration-exploitation trade-off. For minimization, sampling criteria such as expected improvement (EI) depend on the predictive distribution below the current best value, so lower-tail miscalibration directly affects the sampling decision. This article studies goal-oriented calibration of GP predictive distributions below a low threshold $t$ in the noiseless setting, for standard GP models with hyperparameters selected by maximum likelihood. A framework for

#147
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.0 7.2/5.0/5.8

Flash floods in Bangladesh's haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also

#148
Government & Defense 2026-05-19 FedScoop — AI 6.0 6.6/5.0/5.4

Immigration and Customs Enforcement has led the adoption of artificial intelligence for the Department of Homeland Security, but the unit is taking a slower pace with AI agents, according to the component’s top IT official.  “We are not using a lot of those features,” ICE CIO Dustin Goetz said during AFCEA Bethesda’s LEAPS Summit last week in Washington, D.C. “And we’re not planning on it.” AI agents are the subject of immense hype — and confusion. Even as technology vendors trumpet the technology’s potential, there is no universally agreed-upon definition

#149
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond

#150
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction

#151
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.3/5.5

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more

#152
Reinforcement Learning 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Reinforcement Learning 6.0 6.9/5.3/5.8

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with

#153
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification

#154
Frontier LLMs 2026-05-19 arXiv cs.LG (Machine Learning)arXiv — Mechanistic Interpretability 6.0 7.2/5.0/5.8

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional

#155
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 6.0 6.9/5.3/5.8

Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of the parameters of modern large language models (LLMs) by iteratively refining a latent state and final answer. While powerful, their deterministic recursion can lead to convergence at suboptimal solutions, without escape mechanism. A common workaround relies on task-specific input perturbations at test time combined with answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-agnostic framework for test-time compute scaling that addresses this limitation through stochastic exploration. PTRM injects Gaussian noise at each deep recursion step, enabling

#156
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.0 7.2/5.0/5.8

Counterfactual Regret Minimization (CFR) is the dominant algorithmic family for solving large imperfect-information games, underpinning breakthroughs such as Libratus and Pluribus in No-Limit Texas Hold'em poker. In real-time game-playing systems, the solver must compute a near-equilibrium strategy within a strict time budget of only a few seconds per decision, and the number of CFR iterations completed in this window directly determines play strength. We present \textbf{Parallel CFR}, the first parallelization framework for real-time depth-limited CFR solving that seamlessly integrates pruning, abstraction, and advanced CFR variants. We decompose each CFR iteration

#157
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 6.0 6.9/5.3/5.8

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via

#158
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Autoregressive language models execute Transformer layers sequentially, creating a latency bottleneck that is not removed by conventional tensor or pipeline parallelism. We study whether this layerwise dependency can be relaxed by treating the hidden-state trace across layers as the solution of a nonlinear residual equation and solving it with parallel Newton-style updates. While this view is principled, exact Newton corrections require expensive Jacobian-vector products and naive fixed-point iterations are unstable on trained Transformers. We introduce Structured Newton Layer Parallelism (SNLP), a training and inference framework that replaces exact layer Jacobians

#159
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for

#160
Frontier LLMs 2026-05-18 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal

#161
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 6.0 6.9/5.3/5.8

Constraint programming practitioners accelerate hard problems through a layered set of techniques applied in order of risk. Standard hardening (symmetry-breaking and implied constraints) is applied first and preserves satisfiability. Streamliner constraints, which restrict search to a structural sub-family of solutions, do not preserve satisfiability and are reserved as a final lever. Existing automated streamliner-synthesis approaches either search a constraint grammar or prompt a Large Language Model directly on the problem model. We propose a different approach: enumerate feasible solutions, train a Convolutional Neural Network contrastively against perturbed non-solutions to detect

#162
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.0 7.2/5.0/5.8

Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and

#163
Frontier LLMs 2026-05-12 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 6.0 7.2/5.0/5.8

Language models are instruction-tuned to refuse harmful requests, but the mechanisms underlying this behavior remain poorly understood. Popular steering methods operate on the residual stream and degrade output coherence at high intervention strengths, limiting their practical use. We introduce contrastive neuron attribution (CNA), which identifies the 0.1% of MLP neurons whose activations most distinguish harmful from benign prompts, requiring only forward passes with no gradients or auxiliary training. In instruct models, ablating the discovered circuit reduces refusal rates by over 50% on a standard jailbreak benchmark while preserving fluency and

#165
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.0 7.2/5.0/5.8

Backpropagation with gradient descent is a common optimization strategy employed by most neural network architectures in machine learning. However, finding optimal hyperparameters to guide training has proven challenging. While it is widely acknowledged that selecting appropriate parameters is crucial for avoiding overfitting and achieving unbiased outcomes, this choice remains largely based on empirical experiments and experience. This paper presents a new probabilistic framework for the learning rate, a key parameter in stochastic gradient descent. The framework develops classic Bayesian statistics into a double-Bayesian decision mechanism involving two antagonistic Bayesian processes.

#166
Government & Defense 2026-05-19 C4ISRNET 6.0 6.6/5.0/5.4

Air Force Special Operations Command is testing whether it can take its new Skyraider II apart, pack it inside a cargo jet and put it back together in the field, officials said this week at Special Operations Forces Week . The single-engine, prop-driven OA-1K, a militarized version of the Air Tractor AT-802 crop duster, is built to give isolated special operations teams eyes overhead and firepower on call from rough dirt strips with little support. “It is essentially a Swiss Army Knife of airborne capability,” Lt. Col. Robert Wilson, AFSOC’s

#167
Government & Defense 2026-05-19 War on the Rocks 6.0 6.6/5.0/5.4

When the U.S. military launched its war against Iran in Feb. 2026, it did not just dismantle Iranian military capabilities. It shattered the illusion that the United States would consult with its closest allies and that an ally’s refusal to grant base access can stop an American war in motion. Rather than the much discussed “outward flows” of military assets from the Korean Peninsula to the Middle East, it is the anticipated “inward flows” of U.S. military assets that could be more consequential in times of crisis.Focusing on the early

#168
Research 2026-05-19 arXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 6.0 6.9/5.0/6.2

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emph{when} Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions

#170
Frontier LLMs 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning) 6.0 7.2/5.0/5.8

While conventional (k=1) discrete-time barrier certificate conditions impose strict safety constraints by requiring the function to be non-increasing at every step, k-inductive barrier certificates relax this by allowing a temporary increase -- up to k-1 times, each within a threshold $ε$ -- while maintaining overall safety, and improving flexibility. This paper leverages neural networks and constructs k-inductive neural barrier certificates (k-NBCs) for (partially) unknown nonlinear systems. While neural networks offer scalability in the design process, they lack formal guarantees, requiring additional approaches such as counterexample-guided inductive synthesis (CEGIS) with satisfiability

#171
Research 2026-05-19 arXiv — Agents / Tool UsearXiv cs.AI (Artificial Intelligence) 5.9 6.9/5.0/5.8

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six

#172
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

We study the contextual multi-armed bandit problem with a finite context space (a.k.a. subpopulations), where the learner recommends a best action for each context and is evaluated by context-weighted simple regret. Our guarantees are worst-case over the reward distributions, while remaining instance-dependent with respect to the context distribution vector $p$. Akin to experimental design problems where the population of interest is fixed but the sampled subpopulation can be controlled, we allow the learner to actively choose which context to sample from. For a known $p$, we characterize tight regret rates:

#173
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Multi-tenant retrieval-augmented generation (RAG) services advertise per-account differential privacy as the operative leakage boundary: each account's queries are guaranteed to satisfy $(\varepsilon_{\text{acc}}, δ_{\text{acc}})$-DP with respect to the index. We identify same-index multi-account collusion as a privacy-boundary failure: for $k$ same-tenant accounts coordinating against the tenant's index -- the operative regime -- known DP composition theory implies joint leakage degrades unconditionally at rate $Θ(\sqrt{k} \cdot \varepsilon_{\text{acc}})$ for Gaussian-noised retrieval. Cross-tenant and external collusion match the rate only under explicit access-control failure (M4); without M4 these regimes have zero leakage by design

#174
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 5.9 6.9/5.0/5.8

Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify

#175
Efficiency 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Efficiency (Quantization, MoE, Inference) 5.9 6.9/5.0/5.8

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling

#176
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Text-attributed graph fraud detection (TAGFD) plays a critical role in preventing fraudulent activities on online social and e-commerce platforms. However, to evade detection, fraudsters continuously evolve their camouflaging strategies by deliberately mimicking textual responses of benign users, thereby concealing their malicious purposes. This phenomenon, referred to as semantic camouflage, fundamentally undermines commonly relied assumptions on how structural and attribute cues can be exploited to identify fraudsters, and makes it difficult to spot fraudsters with unsupervised TAGFD. To bridge the gaps, we propose a Case-Adaptive Multi-cue Expert fRAmework (CAMERA) for unsupervised

#177
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

This work addresses the problem of learning directed acyclic graphs (DAGs) from nodal observations generated by a linear structural equation model. DAG learning is a central task in signal processing, machine learning, and causal inference, but it remains challenging because acyclicity is a global combinatorial property. Continuous acyclicity constraints have led to important algorithmic advances by replacing the discrete DAG constraint with smooth equality constraints. However, existing formulations still involve difficult non-convex optimization landscapes and may suffer from degenerate first-order optimality conditions. Here, we restrict attention to DAGs with non-negative

#178
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Distributed acoustic sensing (DAS) systems generate continuous, ultra-high-channel-count data streams at rates that exceed the capabilities of conventional batch-oriented analysis frameworks. As a result, essential tasks such as interactive exploration of long-duration recordings, scalable event annotation, and real-time algorithm-in-the-loop monitoring remain inadequately supported by workflows built around manually selected data segments and offline processing. This paper presents FiLark (Fiber Lark), a Python framework that applies a \emph{streaming-first} principle uniformly across data access, signal processing, visualization and monitoring for DAS. Instead of operating on manually selected data segments, FiLark presents any

#179
Industry 2026-05-20 The Information — AI 5.9 6.9/5.4/5.4

If you pay attention to one thing in AI this week, it should be Google’s I/O presentation on Tuesday. The tech giant unveiled a bunch of AI tools, most significantly the addition of new AI features such as agents to Google Search. Any distinction that existed between Google’s Gemini AI chatbot and its search bar is disappearing. No longer should we compare OpenAI’s ChatGPT to Gemini—it’s ChatGPT versus Google Search. That’s a big deal, given that Search is much more widely used than Gemini—Google CEO Sundar Pichai intimated today that

#180
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Predicting protein-ligand binding affinity remains intractable for multi-domain proteins, where inter-domain dynamics govern molecular recognition. Existing geometric deep learning methods typically treat proteins as monolithic static graphs, suffering from rigid-body assumptions and aleatoric noise in flexible regions. To address this, we introduced HCLBind, a self-supervised framework that decouples geometric representation learning from affinity regression. HCLBind leverages a general-to-specific pre-training paradigm on the Q-BioLiP database to learn a robust physical grammar of binding. We propose a novel hierarchical decoy strategy: the model learns local physicochemical constraints through protein coordinate perturbation in

#181
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix--air interfaces, pores, line-like defects, and mixed morphologies-so that every classification

#182
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.9 6.9/5.4/5.4

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean

#183
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Visual-Inertial Odometry(VIO), which is critical to mobile robot navigation, uses cameras with a large number of pixels. Capturing and processing camera images requires significant resources. This work presents a minimalist approach to planar odometry, demonstrating that just four visual measurements and an IMU can provide robust motion estimation for differential-drive robots. Our key insight is that four downward-facing photodiodes that sense the world through optical Gabor masks produce signals that encode speed. Based on this, we jointly optimize the mask parameters alongside a Temporal Convolutional Network (TCN) using a physically-grounded

#184
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Blind source separation (BSS) is a natural framework for studying how latent causes may be recovered from sensory mixtures, but deriving online and biologically plausible algorithms for structured (i.e., constrained to known domains) and potentially correlated sources remains challenging. Recent work has derived neural networks for BSS from maximization of an entropy measure, yet its online implementations involve complex and nonlocal recurrent dynamics. Motivated by this perspective, we propose Predictive Entropy Maximization, which achieves competitive performance in BSS, using only local weight updates. The method employs a close approximation of

#185
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size $n$ from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems $\left( d \in \{2,3\} \right)$, algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of $n$ and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures

#186
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently

#187
Frontier LLMs 2026-05-19 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 5.9 7.2/5.0/5.5

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt

#188
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Competitive selection processes, from scientific funding to admissions and hiring, use evaluations to score candidates, and eventually choose a subset of them based on those scores. Recently, many organizations have adopted partial lotteries, which randomize selection based on evaluation scores. However, existing lottery designs are inherently unstable, as a small change to a single candidate's score can cause large shifts in their selection probabilities. This instability undermines a key goal of lotteries: reducing the influence of fine-grained score distinctions near the decision boundary. We propose smoothness as a design principle

#189
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

While optimal transport (OT) enforces a rigid constraint by requiring two measures to be matched exactly, partial optimal transport relaxes this requirement by allowing mass to remain unmatched through a global budget, scalar rebate, or uniform rejection rule. However, many applications call for more structured, pointwise rejection mechanisms, where the decision to leave mass unmatched depends on side-specific reliability, support geometry, or external information about which components should participate in the comparison. We introduce \emph{intent-controlled partial optimal transport} (IC-POT), a targeted generalization of partial transport that replaces the global rejection

#190
Frontier LLMs 2026-05-14 AK (@_akhaliq) Daily PapersHugging Face Daily Papers 5.9 7.2/5.0/5.5

We introduce TopoPrimer, a framework that makes the global topological structure of the series population an explicit input to any forecasting model. TopoPrimer improves accuracy across diverse domains, stabilizes forecasts under seasonal demand spikes, and closes the cold-start gap. Precomputed once per domain via persistent homology and spectral sheaf coordinates, TopoPrimer deploys per token for fully-trained models and as a lightweight adapter for pre-trained backbones. Of these two components, sheaf coordinates are the primary accuracy driver. Across four public benchmarks on Chronos and TimesFM, TopoPrimer consistently improves forecasting accuracy, with

#191
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Bayesian filtering is a well-known problem that aims to estimate plausible states of a dynamical system from observations. Among existing approaches to solve this problem, particle filters are theoretically exact for non-linear dynamics and observations, but suffer from poor scalability in high dimensions. In this work, we show that diffusion-based emulators of dynamical systems can be used to implement, without additional training, an optimal variant of particle filters that has remained largely unexplored due to implementation challenges with classical numerical solvers. Experiments on nonlinear chaotic systems, including atmospheric dynamics, demonstrate

#192
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Learning generalizable trajectory representations from raw GPS traces remains difficult because the data is continuous, noisy, and irregularly sampled. Spatial tokenization is also challenging: fine grids yield sparse cells with weak embeddings, while coarse grids merge heterogeneous movement patterns into the same token. We present TrajTok, a trajectory encoder with a simple pretraining recipe for transferable trajectory embeddings. TrajTok first learns a multi-resolution hexagonal cell partition from the spatial distribution of GPS points, converting noisy GPS sequences into discrete cell tokens. To capture both geometry and kinematics, it uses a

#193
AI for Science 2026-05-19 DeepMind 5.9 7.2/5.0/5.4

In Uganda, the incidence of early-onset breast cancer is growing at an alarming rate. Dr. Daudi Jjingo and his team at Makerere University are working to identify genetic targets for potential vaccine development. By utilizing tools like AlphaFold, AlphaGenome, and Antigravity, they can conduct this research using only a laptop and a server, enabling seamless collaboration with local hospitals and institutions. By analyzing a protein highly expressed among breast cancer patients, the team successfully evaluated 15,000 potential binding sites, narrowing the scope to just 15 viable targets for laboratory validation.

#194
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Uniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy-maximizing resampling, but its resampling weights depend on a local k-nearest-neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial-maximization moment estimator can replace the plug-in density rule without changing the surrounding MASEM architecture. The proposed PMM-MASEM module computes shell spacings from nested k-nearest-neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing

#195
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks 5.9 6.9/5.0/5.8

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-,

#196
Research 2026-05-19 arXiv cs.LG (Machine Learning) 5.9 7.2/5.0/5.4

Decentralized learning (DL) is an emerging machine learning paradigm where nodes collaboratively train models without a central server. However, the collaborative nature of DL makes it vulnerable to backdoor attacks, where a model is taught to behave normally on standard inputs while executing hidden, malicious actions when encountering data with specific triggers. Backdoor attacks in DL remain understudied and existing defenses often overlook DL constraints. We introduce Argus, a novel backdoor detection framework native to DL that requires neither a central coordinator nor prior knowledge of the trigger. In Argus,

#197
Government & Defense 2026-05-19 DARPA — News 5.8 6.9/5.0/5.4

DARPA and State of Utah have entered into an agreement to establish the Strategic Materials Accelerator and Research Test Bed (SMART).

#198
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

Dramatic cost reductions driven by private sector innovations have led to a rapid increase in the number of satellites in orbit and a corresponding surge in space-generated data. As this trend continues, transmitting large volumes of data to Earth for processing may become increasingly costly and challenging due to potential space-to-Earth link congestion and increased latency. Moreover, traditional ground station networks may face difficulties accommodating growing data flows and workloads because of capacity constraints, complex scheduling logistics, and restricted visibility windows, which can limit scalability. Space Data Centers (SDCs) --

#199
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in

#200
Industry 2026-05-20 The Information — AI 5.8 6.9/5.0/5.4

For the past two days, I’ve been at JP Morgan’s annual tech conference in Boston. As usual, AI has dominated most of my conversations with investors. But in a change from the past few months, it’s not Anthropic and OpenAI driving those discussions—but hardware and utility companies a few steps removed from what the AI model makers do. On Monday, for example, NextEra Energy and Dominion Energy agreed to a $400 billion tie-up driven in part by AI’s voracious demand for energy, the two utility giants said. That’s also happening

#201
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

The Power grid is a critical infrastructure underpinning all aspects of modern society and its services. Maintaining its effectiveness requires continuous adaptations. In particular, addressing sustainability targets, demand patterns, and urbanisation trends requires implementing changes to the network. Actual developments can potentially span over a decade, with supply continuity and service quality that must be preserved throughout by ensuring conformance to several topological and combinatorial invariants. Long-term power grid planning deals with the above process, and although planning languages could be a natural choice, the kind of properties and invariants

#202
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

Claim verification is an important problem in high-stakes settings, including health and finance. When information underpinning claims is incomplete or conflicting, uncertain answers may be more appropriate than binary true or false classifications. In all cases, faithful explanations of the considerations determining the final verdict are crucial. We introduce inference-time argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification in which a formal argumentation semantics giving the strength of claims is used both (i) to guide LLM training as models learn to generate arguments and assign them base scores

#204
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought

#205
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category

#206
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems.

#208
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context,

#209
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

Explicit software architecture models are essential artifacts for communicating, analyzing, and evolving complex software-intensive systems. In ROS~2-based robotic systems, however, structural (de-)composition and integration semantics are often only implicitly encoded across distributed artifacts such as source code and launch files, making recovery of hierarchical architecture particularly difficult. Existing approaches mainly focus on node-level entities and communication wiring, while providing limited support for recovering hierarchical structural (de-)composition across multiple abstraction levels. In this paper, we extend our previously proposed blueprint-guided LLM-assisted architecture recovery pipeline for ROS~2 systems through two major enhancements:

#210
Research 2026-05-19 arXiv cs.AI (Artificial Intelligence) 5.8 6.9/5.0/5.4

AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified. This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6. The generated artifact states a generalized Lean version of the theorem, contains four verified helper lemmas for local components of a maximality and adjacent-swap exchange strategy, and leaves the main theorem grasshopper closed directly by one unresolved sorry.

Items
214
Multi-source
122
Long-form (≥7.5)
5
Sources OK / attempted
90 / 119
Top category
Frontier LLMs
96 items