Wolf Digest — 2026-04-22

Kimi K2.6 released — now the leading open-weights model (Moonshot)

Frontier LLMs 2026-04-21 Moonshot AIArtificial Analysis

7.7

I 9.7 Im 6.6 P 6.5

Moonshot released Kimi K2.6, described as a "natively multimodal model, powerful coding capabilities, and Agent performance" spanning multiple modes. Artificial Analysis positioned it as the new leading open-weights model on independent evaluations. Follow-up coverage focuses on the coding and agent sub-scores.

How it was discussed across sources

Moonshot AI: Covered by Moonshot AI.
Artificial Analysis: Evaluated by Artificial Analysis' independent benchmarking.

evalsfrontier_llmmultimodalai_coding

Qwen3.5-Omni Technical Report

Frontier LLMs 2026-04-21 alphaXiv trendingQwen Team (Alibaba)

7.5

I 8.9 Im 6.6 P 6.7

Qwen3.5-Omni: an omnimodal LLM from Alibaba achieving state-of-the-art on audio-and-audio-visual benchmarks across text, image, audio, and video. Extended from the Qwen3 base.

How it was discussed across sources

alphaXiv trending: Surfaced on alphaXiv Trending — the high-velocity community-paper tracker.
Qwen Team (Alibaba): Covered by Qwen Team (Alibaba).

frontier_llmmultimodalaudioevals

Emotion Concepts and their Function in a Large Language Model

Interpretability 2026-04-21 Transformer Circuits Thread (Anthropic)

Sofroniew et al.

6.9

I 6.0 Im 6.8 P 7.5

Sofroniew et al. identify representations of emotion concepts within Claude Sonnet 4.5 and demonstrate causal influence on outputs. Uses SAE features and patching experiments to isolate emotion-specific directions in residual stream.

interpretabilityfrontier_llm

ChatGPT's new Images 2.0 model — strong text generation in images

Generative Media 2026-04-21 TechCrunch — AISimon Willison's Weblog

6.8

I 7.5 Im 6.2 P 6.4

OpenAI shipped Images 2.0 inside ChatGPT. Simon Willison flagged that the model is "surprisingly good at generating text" inside images — long-standing failure mode for image generators. Community testing shows legible multi-line text and improved typographic consistency.

How it was discussed across sources

TechCrunch — AI: Covered by TechCrunch — AI.
Simon Willison's Weblog: Covered on Simon Willison's weblog — practitioner-oriented commentary.

Changes to GitHub Copilot Individual plans

Agents & Tool Use 2026-04-22 Simon Willison's Weblog

6.7

I 8.1 Im 7.1 P 4.6

Changes to GitHub Copilot Individual plans On the same day as Claude Code's temporary will-they-won't-they $100/month kerfuffle (for the moment, they won't), here's the latest on GitHub Copilot pricing. Unlike Anthropic, GitHub put up an official announcement about their changes, which include tightening usage limits, pausing signups for individual plans (!), restricting Claude Opus 4.7 to the more expensive $39/month "Pro+" plan, and dropping the previous Opus models entirely. The key paragraph: Agentic workflows have fundamentally changed Copilot’s compute demands. Long-running, parallelized sessions now regularly consume far more resources than the original plan structure was built to support. As Copilot’s agentic capabilities have expanded rapidly, agents are doing more work, and more customers are hitting usage limits designed to maintain service reliability. It's easy to forget that just six months ago heavy LLM users were burning an order of magnitude less tokens. Coding agents consume a lot of compute. Copilot was also unique (I believe) among agents in charging per-request, not per-token. (Correction: Windsurf also operated a credit system like this which they abandoned last month.) This means that single agentic requests which burn more tokens cut directly into their margins. The most recent pricing scheme addresses that with token-based usage limits on a per-session and weekly basis. My one problem with this announcement is that it doesn't clearly clarify which product called "GitHub Copilot" is affected by these changes. Last month in How many products does Microsoft have named 'Copilot'? I mapped every one Tey Bannerman identified 75 products that share the Copilot brand, 15 of which have "GitHub Copilot" in the title. Judging by the linked GitHub Copilot plans page this covers Copilot CLI, Copilot cloud agent and code review (features on GitHub.com itself), and the Copilot IDE features available in VS Code, Zed, JetBrains and more. Via Hacker News Tags: github, microsoft, ai, generative-ai, github-copilot, llms, llm-pricing, coding-agents

agentsfrontier_llmai_codinginterpretability

Where's the raccoon with the ham radio? (ChatGPT Images 2.0)

Generative Media 2026-04-21 Simon Willison's Weblog

6.4

I 8.1 Im 6.2 P 4.6

OpenAI released ChatGPT Images 2.0 today, their latest image generation model. On the livestream Sam Altman said that the leap from gpt-image-1 to gpt-image-2 was equivalent to jumping from GPT-3 to GPT-5. Here's how I put it to the test. My prompt: Do a where's Waldo style image but it's where is the raccoon holding a ham radio gpt-image-1 First as a baseline here's what I got from the older gpt-image-1 using ChatGPT directly: I wasn't able to spot the raccoon - I quickly realized that testing image generation models on Where's Waldo style images (Where's Wally in the UK) can be pretty frustrating! I tried getting Claude Opus 4.7 with its new higher resolution inputs to solve it but it was convinced there was a raccoon it couldn't find thanks to the instruction card at the top left of the image: Yes — there's at least one raccoon in the picture, but it's very well hidden. In my careful sweep through zoomed-in sections, honestly, I couldn't definitively spot a raccoon holding a ham radio. [...] Nano Banana 2 and Pro Next I tried Google's Nano Banana 2, via Gemini: That one was pretty obvious, the raccoon is in the "Amateur Radio Club" booth in the center of the image! Claude said: Honestly, this one wasn't really hiding — he's the star of the booth. Feels like the illustrator took pity on us after that last impossible scene. The little "W6HAM" callsign pun on the booth sign is a nice touch too. I also tried Nano Banana Pro in AI Studio and got this, by far the worst result from any model. Not sure what went wrong here! gpt-image-2 With the baseline established, let's try out the new model. I used an updated version of my openai_image.py script, which is a thin wrapper around the OpenAI Python client library. Their client library hasn't yet been updated to include gpt-image-2 but thankfully it doesn't validate the model ID so you can use it anyway. Here's how I ran that: OPENAI_API_KEY="$(llm keys get openai)" \ uv run https://tools.simonwillison.net/python/openai_image.py \ -m gpt-image-2 \ "Do a where's Waldo style image but it's where is the raccoon holding a ham radio" Here's what I got back. I don't think there's a raccoon in there - I couldn't spot one, and neither could Claude. The OpenAI image generation cookbook has been updated with notes on gpt-image-2, including the outputQuality setting and available sizes. I tried setting outputQuality to high and the dimensions to 3840x2160 - I believe that's the maximum - and got this - a 17MB PNG which I converted to a 5MB WEBP: OPENAI_API_KEY="$(llm keys get openai)" \ uv run 'https://raw.githubusercontent.com/simonw/tools/refs/heads/main/python/openai_image.py' \ -m gpt-image-2 "Do a where's Waldo style image but it's where is the raccoon holding a ham radio" \ --quality high --size 3840x2160 That's pretty great! There's a raccoon with a ham radio in there (bottom left, quite easy to spot). The image used 13,342 output tokens, which are charged at $30/million so a total cost of around 40 cents. Takeaways I think this new ChatGPT image generation model takes the crown from Gemini, at least for the moment. Where's Waldo style images are an infuriating and somewhat foolish way to test these models, but they do help illustrate how good they are getting at complex illustrations combining both text and details. Update: asking models to solve this is risky rizaco on Hacker News asked ChatGPT to draw a red circle around the raccoon in one of the images in which I had failed to find one. Here's an animated mix of their result and the original image: Looks like we definitely can't trust these models to usefully solve their own puzzles! Tags: ai, openai, generative-ai, chatgpt, llms, text-to-image, llm-release, nano-banana

frontier_llmgenerative_mediapost_training

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

Multimodal 2026-04-21 arXiv cs.RO (Robotics)arXiv cs.CV (Computer Vision)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang +4

6.1

I 6.3 Im 4.4 P 7.3

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

How it was discussed across sources

arXiv cross-listings: Listed under cs.RO (Robotics), cs.CV (Computer Vision), cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

multimodalroboticsfrontier_llmevals

ChatGPT’s new Images 2.0 model is surprisingly good at generating text

Generative Media 2026-04-21 TechCrunch — AI

Amanda Silberling

6.0

I 7.5 Im 6.2 P 4.0

ChatGPT Images 2.0, the newest image-generation model from OpenAI, shows just how much AI capabilities have evolved over the last few years.

AgentSPEX: An Agent Specification and Execution Language

Agents & Tool Use 2026-04-22 HF +44 HF Daily PapersUIUC ScaleML Lab

UIUC ScaleML Lab

5.9

I 4.0 Im 4.0 P 9.3

AgentSPEX is a declarative language for specifying and executing LLM agents; the paper frames the contribution as a composition abstraction that separates agent capabilities, policies, and execution from prompts. 44 upvotes on HF Daily Papers — strongest community signal of the day.

How it was discussed across sources

HF Daily Papers: Featured on HuggingFace Daily Papers — an early-day community popularity signal.
UIUC ScaleML Lab: Covered by UIUC ScaleML Lab.

agentsfrontier_llm

#10

Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)

Kyuhee Kim, Auguste Poiroux, Antoine Bosselut

5.8

I 8.0 Im 4.0 P 5.1

Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization-gaming.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.AI (Artificial Intelligence) — spans multiple research subfields.

frontier_llmai_codingevalsrl

#11

Meta to start capturing employee mouse movements, keystrokes for AI training

Agents & Tool Use 2026-04-21 ReutersTechCrunch — AIHacker News

5.8

I 4.5 Im 4.7 P 7.8

Reuters reports Meta will begin recording employee keystrokes and mouse movements to use as training signal for its AI systems. Story hit HN front page and was picked up by TechCrunch; sparked debate over internal surveillance versus usefulness of the signal for agent training.

How it was discussed across sources

Reuters: Covered by Reuters.
TechCrunch — AI: Covered by TechCrunch — AI.
Hacker News: Discussed on Hacker News AI front page — developer-community attention.

agentsinfra

#12

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Frontier LLMs 2026-04-21 arXiv cs.RO (Robotics)arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks +4

5.8

I 4.8 Im 5.0 P 7.1

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

How it was discussed across sources

arXiv cross-listings: Listed under cs.RO (Robotics), cs.CL (Computation & Language), cs.AI (Artificial Intelligence) — spans multiple research subfields.

frontier_llmevalssafety_policyrobotics

#13

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Jonas Waldendorf, Bashar Awwad Shiekh Hasan, Evgenii Tsymbalov

5.7

I 4.8 Im 4.8 P 7.1

Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

frontier_llmaudioevalsefficiency

#14

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Frontier LLMs 2026-04-21 arXiv — Mechanistic Interpretability

Alex Polyakov, Daniel Kuznetsov

5.7

I 7.5 Im 5.0 P 4.3

Safety alignment in large language models relies on behavioral training that can be overridden when sufficiently strong in-context patterns compete with learned refusal behaviors. We introduce Involuntary In-Context Learning (IICL), an attack class that uses abstract operator framing with few-shot examples to force pattern completion that overrides safety training. Through 3479 probes across 10 OpenAI models, we identify the attack's effective components through a seven-experiment ablation study. Key findings: (1)~semantic operator naming achieves 100\,\% bypass rate (50/50, $p < 0.001$); (2)~the attack requires abstract framing, since identical examples in direct question-and-answer format yield 0\,\%; (3)~example ordering matters strongly (interleaved: 76\,\%, harmful-first: 6\,\%); (4)~temperature has no meaningful effect (46--56\,\% across 0.0--1.0). On the HarmBench benchmark, IICL achieves 24.0\,\% bypass $[18.6\%, 30.4\%]$ against GPT-5.4 with detailed 619-word responses, compared to 0.0\,\% for direct queries.

frontier_llmsafety_policypost_trainingevals

#15

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Agents & Tool Use 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.CV (Computer Vision)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan +3

5.6

I 4.8 Im 4.9 P 6.7

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.CV (Computer Vision), cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

agentsfrontier_llmai_codingevals

#16

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Frontier LLMs 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Alankrit Chona, Igor Kozlov, Ambuj Kumar

5.6

I 8.0 Im 4.9 P 3.4

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.

frontier_llmevalsagentsgov_defense

#17

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Reinforcement Learning 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu +9

5.4

I 4.8 Im 4.5 P 6.5

Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

rlresearchagentsfrontier_llm

#18

Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics

Research 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Syed Sajid Ullah, Amir Khan

5.4

I 4.8 Im 4.6 P 6.5

Construction workers are highly vulnerable to heat stress, yet tools that translate real-time physiological data into actionable safety intelligence remain scarce. This study addresses this gap by developing and evaluating deep learning models, specifically a baseline Long Short-Term Memory (LSTM) network and an attention-based LSTM, to predict heat stress among 19 workers in Saudi Arabia. Using Garmin Vivosmart 5 smartwatches to monitor metrics such as heart rate, HRV, and oxygen saturation, the attention-based model outperformed the baseline, achieving 95.40% testing accuracy and significantly reducing false positives and negatives. With precision, recall, and F1 scores of 0.982, this approach not only improves predictive performance but also offers interpretable results suitable for integration into IoT-enabled safety systems and BIM dashboards, advancing proactive, informatics-driven safety management in the construction industry.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

researchevalssafety_policy

#19

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Robotics 2026-04-21 arXiv cs.RO (Robotics)arXiv cs.AI (Artificial Intelligence)

Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai +2

5.3

I 5.7 Im 5.0 P 4.8

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

How it was discussed across sources

arXiv cross-listings: Listed under cs.RO (Robotics), cs.AI (Artificial Intelligence) — spans multiple research subfields.

roboticssafety_policyresearchfrontier_llm

#20

Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty

Reinforcement Learning 2026-04-21 arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Hunter L. Brown, Geoffrey Hollinger, Stefan Lee

5.2

I 4.8 Im 4.0 P 6.5

Reinforcement learning-based control policies have been frequently demonstrated to be more effective than analytical techniques for many manipulation tasks. Commonly, these methods learn neural control policies that predict end-effector pose changes directly from observed state information. For tasks like inserting delicate connectors which induce force constraints, pose-based policies have limited explicit control over force and rely on carefully tuned low-level controllers to avoid executing damaging actions. In this work, we present hybrid position-force control policies that learn to dynamically select when to use force or position control in each control dimension. To improve learning efficiency of these policies, we introduce Mode-Aware Training for Contact Handling (MATCH) which adjusts policy action probabilities to explicitly mirror the mode selection behavior in hybrid control. We validate MATCH's learned policy effectiveness using fragile peg-in-hole tasks under extreme localization uncertainty. We find MATCH substantially outperforms pose-control policies -- solving these tasks with up to 10% higher success rates and 5x fewer peg breaks than pose-only policies under common types of state estimation error. MATCH also demonstrates data efficiency equal to pose-control policies, despite learning in a larger and more complex action space. In over 1600 sim-to-real experiments, we find MATCH succeeds twice as often as pose policies in high noise settings (33% vs.~68%) and applies ~30% less force on average compared to variable impedance policies on a Franka FR3 in laboratory conditions.

How it was discussed across sources

arXiv cross-listings: Listed under cs.RO (Robotics), cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

rlroboticsresearchsafety_policy

#21

Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

Generative Media 2026-04-21 arXiv cs.CL (Computation & Language)

Tianxiang Ma, Weijie Feng, Xinyu Wang, Zhiyong Cheng

5.1

I 5.4 Im 5.8 P 3.7

Emotion-Cause Pair Extraction in Conversations (ECPEC) aims to identify the set of causal relations between emotion utterances and their triggering causes within a dialogue. Most existing approaches formulate ECPEC as an independent pairwise classification task, overlooking the distinct semantics of emotion diffusion and cause explanation, and failing to capture globally consistent many-to-many conversational causality. To address these limitations, we revisit ECPEC from a semantic perspective and seek to disentangle emotion-oriented semantics from cause-oriented semantics, mapping them into two complementary representation spaces to better capture their distinct conversational roles. Building on this semantic decoupling, we naturally formulate ECPEC as a global alignment problem between the emotion-side and cause-side representations, and employ optimal transport to enable many-to-many and globally consistent emotion-cause matching. Based on this perspective, we propose a unified framework SCALE that instantiates the above semantic decoupling and alignment principle within a shared conversational structure. Extensive experiments on several benchmark datasets demonstrate that SCALE consistently achieves state-of-the-art performance. Our codes are released at https://github.com/CoCoSphere/SCALE.

generative_mediaai_codingpost_trainingevals

#22

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)

Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni +3

5.1

I 4.8 Im 4.4 P 5.7

Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.LG (Machine Learning) — spans multiple research subfields.

frontier_llmpost_trainingefficiencyinfra

#23

TEMPO: Scaling Test-time Training for Large Reasoning Models

Infrastructure 2026-04-22 HF +17 HF Daily Papers

5.1

I 4.0 Im 4.4 P 6.4

TEMPO scales test-time training for large reasoning models: during inference the model updates a small set of parameters per-query using gradient signals computed from its own chain-of-thought rollouts, trading compute for accuracy on reasoning benchmarks. 17 upvotes on HF Daily Papers.

infrafrontier_llmefficiencyevals

#24

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

Frontier LLMs 2026-04-21 arXiv cs.CV (Computer Vision)

Jing Jin, Hao Liu, Yan Bai, Yihang Lou +7

5.1

I 7.5 Im 4.4 P 3.0

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.

frontier_llmevalsmultimodalai_science

#25

When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift

Research 2026-04-21 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Saket Maganti

5.1

I 5.4 Im 4.4 P 5.1

The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset's topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.

How it was discussed across sources

arXiv cross-listings: Listed under cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

researchevalsai_codinginterpretability

#26

Generalization at the Edge of Stability

Research 2026-04-21 arXiv cs.CV (Computer Vision)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal

5.0

I 4.0 Im 4.0 P 6.5

Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CV (Computer Vision), cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

researchinfra

#27

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Multimodal 2026-04-21 arXiv cs.CL (Computation & Language)

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan +4

5.0

I 5.4 Im 5.5 P 3.7

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

multimodalevalspost_trainingrl

#28

Mind2Drive: Predicting Driver Intentions from EEG in Real-world On-Road Driving

AI Coding 2026-04-21 arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)

Ghadah Alosaimi, Hanadi Alhamdan, Wenke E, Stamos Katsigiannis +2

5.0

I 4.8 Im 4.6 P 5.1

Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real-world driving due to EEG signal non-stationarity and the complexity of cognitive-motor preparation. This study proposes and evaluates an EEG-based driver intention prediction framework using a synchronised multi-sensor platform integrated into a real electric vehicle. A real-world on-road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro-F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact-handling pipelines, and prediction performance peaks within a 400-600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG-based driver intention decoding under real-world on-road conditions. Code: https://github.com/galosaimi/Mind2Drive.

How it was discussed across sources

arXiv cross-listings: Listed under cs.RO (Robotics), cs.LG (Machine Learning) — spans multiple research subfields.

ai_codingroboticsresearchevals

#29

Scalable Memristive-Friendly Reservoir Computing for Time Series Classification

Infrastructure 2026-04-21 arXiv cs.LG (Machine Learning)arXiv cs.NE (Neural & Evolutionary Computing)

Coşku Can Horuz, Andrea Ceni, Claudio Gallicchio, Sebastian Otte

5.0

I 4.8 Im 4.8 P 5.1

Memristive devices present a promising foundation for next-generation information processing by combining memory and computation within a single physical substrate. This unique characteristic enables efficient, fast, and adaptive computing, particularly well suited for deep learning applications. Among recent developments, the memristive-friendly echo state network (MF-ESN) has emerged as a promising approach that combines memristive-inspired dynamics with the training simplicity of reservoir computing, where only the readout layer is learned. Building on this framework, we propose memristive-friendly parallelized reservoirs (MARS), a simplified yet more effective architecture that enables efficient scalable parallel computation and deeper model composition through novel subtractive skip connections. This design yields two key advantages: substantial training speedups of up to 21x over the inherently lightweight echo state network baseline and significantly improved predictive performance. Moreover, MARS demonstrates what is possible with parallel memristive-friendly reservoir computing: on several long sequence benchmarks our compact gradient-free models substantially outperform strong gradient-based sequence models such as LRU, S5, and Mamba, while reducing full training time from minutes or hours down seconds or even only a few hundred milliseconds. Our work positions parallel memristive-friendly computing as a promising route towards scalable neuromorphic learning systems that combine high predictive capability with radically improved computational efficiency, while providing a clear pathway to energy-efficient, low-latency implementations on emerging memristive and in-memory hardware.

How it was discussed across sources

arXiv cross-listings: Listed under cs.LG (Machine Learning), cs.NE (Neural & Evolutionary Computing) — spans multiple research subfields.

infrassmresearchevals

#30

Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

Agents & Tool Use 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)

Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang +4

5.0

I 4.4 Im 5.0 P 5.1

Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.AI (Artificial Intelligence) — spans multiple research subfields.

agentssafety_policyfrontier_llmresearch

#31

Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Jake Lee

4.9

I 4.8 Im 4.4 P 5.1

The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique -- which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm -- we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.

How it was discussed across sources

arXiv cross-listings: Listed under cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

evalsinterpretability

#32

An Efficient Black-Box Reduction from Online Learning to Multicalibration, and a New Route to $Φ$-Regret Minimization

Multimodal 2026-04-21 arXiv cs.LG (Machine Learning)

Gabriele Farina, Juan Carlos Perdomo

4.9

I 4.0 Im 6.6 P 3.7

We give a Gordon-Greenwald-Marks (GGM) style black-box reduction from online learning to online multicalibration. Concretely, we show that to achieve high-dimensional multicalibration with respect to a class of functions H, it suffices to combine any no-regret learner over H with an expected variational inequality (EVI) solver. We also prove a converse statement showing that efficient multicalibration implies efficient EVI solving, highlighting how EVIs in multicalibration mirror the role of fixed points in the GGM result for $Φ$-regret. This first set of results resolves the main open question in Garg, Jung, Reingold, and Roth (SODA '24), showing that oracle-efficient online multicalibration with $\sqrt{T}$-type guarantees is possible in full generality. Furthermore, our GGM-style reduction unifies the analyses of existing online multicalibration algorithms, enables new algorithms for challenging environments with delayed observations or censored outcomes, and yields the first efficient black-box reduction between online learning and multiclass omniprediction. Our second main result is a fine-grained reduction from high-dimensional online multicalibration to (contextual) $Φ$-regret minimization. Together with our first result, this establishes a new route from external regret to Phi-regret that bypasses sophisticated fixed-point or semi-separation machinery, dramatically simplifies a result of Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '25) while improving rates, and yields new algorithms that are robust to richer deviation classes, such as those belonging to any reproducing kernel Hilbert space.

multimodalefficiency

#33

Are Large Language Models Economically Viable for Industry Deployment?

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Abdullah Mohammad, Sushant Kumar Ray, Pushkar Arora, Rafiq Ali +4

4.9

I 4.5 Im 5.4 P 4.3

Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\r{ho}sys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the <2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.

frontier_llmevalsinfraefficiency

#34

LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)

Fanyu Wang, Xiaoxi Kang, Paul Burgess, Aashish Srivastava +5

4.9

I 4.0 Im 5.2 P 5.1

More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs' capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.AI (Artificial Intelligence) — spans multiple research subfields.

frontier_llminterpretabilityresearchefficiency

#35

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)

Kihyuk Lee

4.8

I 4.0 Im 5.0 P 5.1

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.AI (Artificial Intelligence) — spans multiple research subfields.

frontier_llmevalsai_codingsafety_policy

#36

Deezer says 44% of songs uploaded daily are AI-generated

Generative Media 2026-04-20 Hacker NewsTechCrunch — AI

4.8

I 4.0 Im 4.0 P 6.0

Deezer reports that 44% of songs uploaded to its platform daily are AI-generated — a sharp increase from prior disclosures. Reignites discussion about royalty/attribution schemes and detection tooling.

How it was discussed across sources

Hacker News: Discussed on Hacker News AI front page — developer-community attention.
TechCrunch — AI: Covered by TechCrunch — AI.

#37

From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

Evaluations & Benchmarks 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Beining Wu, Fuyou Mao, Jiong Lin, Cheng Yang +6

4.8

I 5.8 Im 4.9 P 3.4

Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine-specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV-CF, a dual-axis metric that unifies semantic visibility with attribution accuracy. We further release MSME-GEO-Bench, a multi-scenario, multi-engine benchmark grounded in real-world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine-specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning-driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu-beining/MAGEO

evalsresearchagentsai_coding

#38

Is Claude Code going to cost $100/month? Probably not - it's all very confusing

Frontier LLMs 2026-04-22 Simon Willison's Weblog

4.8

I 4.6 Im 4.8 P 4.6

Anthropic today quietly (as in silently, no announcement anywhere at all) updated their claude.com/pricing page (but not their Choosing a Claude plan page, which shows up first for me on Google) to add this tiny but significant detail (arrow is mine, and it's already reverted): The Internet Archive copy from yesterday shows a checkbox there. Claude Code used to be a feature of the $20/month Pro plan, but according to the new pricing page it is now exclusive to the $100/month or $200/month Max plans. Update: don't miss the update to this post, they've already changed course a few hours after this change went live. So what the heck is going on? Unsurprisingly, Reddit and Hacker News and Twitter all caught fire. I didn't believe the screenshots myself when I first saw them - aside from the pricing grid I could find no announcement from Anthropic anywhere. Then Amol Avasare, Anthropic's Head of Growth, tweeted: For clarity, we're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected. And that appears to be the closest we have had to official messaging from Anthropic. I don't buy the "~2% of new prosumer signups" thing, since everyone I've talked to is seeing the new pricing grid and the Internet Archive has already snapped a copy. Maybe he means that they'll only be running this version of the pricing grid for a limited time which somehow adds up to "2%" of signups? I'm also amused to see Claude Cowork remain available on the $20/month plan, because Claude Cowork is effectively a rebranded version of Claude Code wearing a less threatening hat! There are a whole bunch of things that are bad about this. If we assume this is indeed a test, and that test comes up negative and they decide not to go ahead with it, the damage has still been extensive: A whole lot of people got scared or angry or both that a service they relied on was about to be rug-pulled. There really is a significant difference between $20/month and $100/month for most people, especially outside of higher salary countries. The uncertainty is really bad! A tweet from an employee is not the way to make an announcement like this. I wasted a solid hour of my afternoon trying to figure out what had happened here. My trust in Anthropic's transparency around pricing - a crucial factor in how I understand their products - has been shaken. Strategically, should I be taking a bet on Claude Code if I know that they might 5x the minimum price of the product? More of a personal issue, but one I care deeply about myself: I invest a great deal of effort (that's 105 posts and counting) in teaching people how to use Claude Code. I don't want to invest that effort in a product that most people cannot afford to use. Last month I ran a tutorial for journalists on "Coding agents for data analysis" at the annual NICAR data journalism conference. I'm not going to be teaching that audience a course that depends on a $100/month subscription! This also doesn't make sense to me as a strategy for Anthropic. Claude Code defined the category of coding agents. It's responsible for billions of dollars in annual revenue for Anthropic already. It has a stellar reputation, but I'm not convinced that reputation is strong enough for it to lose the $20/month trial and jump people directly to a $100/month subscription. OpenAI have been investing heavily in catching up to Claude Code with their Codex products. Anthropic just handed them this marketing opportunity on a plate - here's Codex engineering lead Thibault Sottiaux: I don't know what they are doing over there, but Codex will continue to be available both in the FREE and PLUS ($20) plans. We have the compute and efficient models to support it. For important changes, we will engage with the community well ahead of making them. Transparency and trust are two principles we will not break, even if it means momentarily earning less. A reminder that you vote with your subscription for the values you want to see in this world. I should note that I pay $200/month for Claude Max and I consider it well worth the money. I've had periods of free access in the past courtesy of Anthropic but I'm currently paying full price, and happy to do so. But I care about the accessibility of the tools that I work with and teach. If Codex has a free tier while Claude Code starts at $100/month I should obviously switch to Codex, because that way I can use the same tool as the people I want to teach how to use coding agents. Here's what I think happened. I think Anthropic are trying to optimize revenue growth - obviously - and someone pitched making Claude Code only available for Max and higher. That's clearly a bad idea, but "testing" culture says that it's worth putting even bad ideas out to test just in case they surprise you. So they started a test, without taking into account the wailing and gnashing of teeth that would result when their test was noticed - or accounting for the longer-term brand damage that would be caused. Or maybe they did account for that, and decided it was worth the risk. I don't think that calculation was worthwhile. They're going to have to make a very firm commitment along the lines of "we heard your feedback and we commit to keeping Claude Code available on our $20/month plan going forward" to regain my trust. As it stands, Codex is looking like a much safer bet for me to invest my time in learning and building educational materials around. Update: they've reversed it already In the time I was typing this blog entry Anthropic appear to have reversed course - the claude.com/pricing page now has a checkbox back in the Pro column for Claude Code. I can't find any official communication about it though. Let's see if they can come up with an explanation/apology that's convincing enough to offset the trust bonfire from this afternoon! Update 2: it may still affect 2% of signups? Amol on Twitter:was a mistake that the logged-out landing page and docs were updated for this test [embedded self-tweet] Getting lots of questions on why the landing page / docs were updated if only 2% of new signups were affected. This was understandably confusing for the 98% of folks not part of the experiment, and we've reverted both the landing page and docs changes. So the experiment is still running, just not visible to the rest of the world? Tags: ai, generative-ai, llms, anthropic, llm-pricing, ai-ethics, coding-agents, claude-code, codex-cli

frontier_llmai_codingagentsinterpretability

#39

Quoting Bobby Holley

Frontier LLMs 2026-04-22 Simon Willison's Weblog

4.8

I 4.6 Im 4.9 P 4.6

As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week’s release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation. [...] Our experience is a hopeful one for teams who shake off the vertigo and get to work. You may need to reprioritize everything else to bring relentless and single-minded focus to the task, but there is light at the end of the tunnel. We are extremely proud of how our team rose to meet this challenge, and others will too. Our work isn’t finished, but we’ve turned the corner and can glimpse a future much better than just keeping up. Defenders finally have a chance to win, decisively. — Bobby Holley, CTO, Firefox Tags: anthropic, claude, ai, firefox, llms, mozilla, security, generative-ai, ai-security-research

frontier_llmevals

#40

SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets

Efficiency 2026-04-21 arXiv cs.CV (Computer Vision)arXiv cs.LG (Machine Learning)

Inhyeok Choi, Hyuncheol Park

4.8

I 4.8 Im 4.0 P 5.1

Edge-cloud hybrid inference offloads difficult inputs to a powerful remote model, but the uplink channel imposes hard per-request constraints on the number of bits that can be transmitted. We show that selecting transmitted content based solely on attention-based importance, the standard approach in collaborative inference, is inherently limited under hard budgets. Two findings support this claim. First, replacing high-importance units with low-importance but complementary ones improves server accuracy. This shows that what matters is not individual importance but how well the transmitted set covers diverse aspects of the input. Second, spatially uniform selection without any content information achieves competitive accuracy at moderate budgets. This confirms that spatial coverage alone carries independent value. Based on this analysis, we propose SAGE (Semantic Attention-Guided Evidence), a principled, training-free method that combines importance filtering with embedding-diversity sampling. SAGE achieves 93% of the server ceiling in offloaded accuracy while transmitting fewer than half of the available evidence units on ImageNet-1K, substantially outperforming importance-only composition.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CV (Computer Vision), cs.LG (Machine Learning) — spans multiple research subfields.

efficiencyinfra

#41

FASTER: Value-Guided Sampling for Fast RL

Robotics 2026-04-21 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Perry Dong, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn

4.7

I 4.5 Im 4.0 P 5.1

Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

How it was discussed across sources

arXiv cross-listings: Listed under cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

roboticsrlinframultimodal

#42

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

Post-Training 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu +8

4.7

I 4.5 Im 5.3 P 4.0

Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

post_trainingfrontier_llmmultimodalgenerative_media

#43

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Safety, Policy & Regulation 2026-04-21 arXiv cs.CL (Computation & Language)

Euntae Kim, Soomin Han, Buru Chang

4.7

I 4.0 Im 6.0 P 3.7

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench

safety_policyfrontier_llmpost_trainingevals

#44

M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit

Reinforcement Learning 2026-04-21 arXiv cs.RO (Robotics)arXiv cs.AI (Artificial Intelligence)

Yukai Feng, Zhiheng Wu, Zhengxing Wu, Junwen Gu +1

4.7

I 4.8 Im 4.0 P 4.8

Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M$^{2}$GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.

How it was discussed across sources

arXiv cross-listings: Listed under cs.RO (Robotics), cs.AI (Artificial Intelligence) — spans multiple research subfields.

rlroboticsresearchagents

#45

Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and Experiments

Research 2026-04-21 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Jianyang Gao, Yutong Gou, Yuexuan Xu, Jifan Shi +4

4.7

I 4.6 Im 4.0 P 5.1

This technical note revisits the relationship between RaBitQ and TurboQuant under a unified comparison framework. We compare the two methods in terms of methodology, theoretical guarantees, and empirical performance, using a reproducible, transparent, and symmetric setup. Our results show that, despite the claimed advantage of TurboQuant, TurboQuant does not provide a consistent improvement over RaBitQ in directly comparable settings; in many tested configurations, it performs worse than RaBitQ. We further find that several reported runtime and recall results in the TurboQuant paper could not be reproduced from the released implementation under the stated configuration. Overall, this note clarifies the shared structure and genuine differences between the two lines of work, while documenting reproducibility issues in the experimental results reported by the TurboQuant paper.

How it was discussed across sources

arXiv cross-listings: Listed under cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

research

#46

ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

Evaluations & Benchmarks 2026-04-21 arXiv cs.CL (Computation & Language)

Xianming Li, Zongxi Li, Tsz-fung Andrew Lee, Jing Li +2

4.7

I 5.3 Im 4.8 P 3.7

Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.

evalsfrontier_llmefficiencyresearch

#47

Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

Evaluations & Benchmarks 2026-04-21 arXiv cs.CL (Computation & Language)

Alessandro Maisto

4.7

I 4.8 Im 5.2 P 3.7

The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.

evalsinterpretability

#48

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Evaluations & Benchmarks 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen +5

4.6

I 4.8 Im 5.2 P 3.4

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

evalsfrontier_llmmultimodalinterpretability

#49

Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)

Gonzalo Nápoles, Isel Grau, Yamisleydi Salgueiro

4.6

I 4.0 Im 5.6 P 3.7

Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.

evalsinterpretabilityresearchefficiency

#50

CrabTrap: An LLM-as-a-judge HTTP proxy to secure agents in production

Agents & Tool Use 2026-04-21 Brex EngineeringHacker News

4.6

I 4.0 Im 4.0 P 5.6

Brex published "CrabTrap," an open pattern/reference for running an LLM-as-a-judge HTTP proxy in front of agent stacks — screens tool calls and network egress against a policy before they execute. Authors frame it as containment for agent blast radius.

How it was discussed across sources

Brex Engineering: Covered by Brex Engineering.
Hacker News: Discussed on Hacker News AI front page — developer-community attention.

agentssafety_policyfrontier_llmrl

#51

ECLASS-Augmented Semantic Product Search for Electronic Components

Agents & Tool Use 2026-04-21 arXiv — Agents / Tool Use

Nico Baumgart, Markus Lange-Hegermann, Jan Henze

4.6

I 5.3 Im 4.8 P 3.4

Efficient semantic access to industrial product data is a key enabler for factory automation and emerging LLM-based agent workflows, where both human engineers and autonomous agents must identify suitable components from highly structured catalogs. However, the vocabulary mismatch between natural-language queries and attribute-centric product descriptions limits the effectiveness of traditional retrieval approaches, e.g., BM25. In this work, we present a systematic evaluation of LLM-assisted dense retrieval for semantic product search on industrial electronic components, and investigate the integration of hierarchical semantics from the ECLASS standard into embedding-based retrieval. Our results show that dense retrieval combined with re-ranking substantially outperforms classical lexical methods and foundation model web-search baselines. In particular, the proposed approach achieves a Hit_Rate@5 of 94.3 %, compared to 31.4 % for BM25 on expert queries, while also exceeding foundation model baselines in both effectiveness and efficiency. Furthermore, augmenting product representations with ECLASS semantics yields consistent performance gains across configurations, demonstrating that standardized hierarchical metadata provides a crucial semantic bridge between user intent and sparse product descriptions.

agentsfrontier_llmevalsefficiency

#52

EgoSelf: From Memory to Personalized Egocentric Assistant

AI Coding 2026-04-21 arXiv cs.CV (Computer Vision)arXiv cs.AI (Artificial Intelligence)

Yanshuo Wang, Yuan Xu, Xuesong Li, Jie Hong +3

4.6

I 4.0 Im 4.5 P 4.8

Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \href{https://abie-e.github.io/egoself_project/}{https://abie-e.github.io/egoself\_project/}.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CV (Computer Vision), cs.AI (Artificial Intelligence) — spans multiple research subfields.

ai_codingpost_training

#53

Evaluation-driven Scaling for Scientific Discovery

Frontier LLMs 2026-04-21 arXiv cs.LG (Machine Learning)

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo +21

4.6

I 5.3 Im 4.4 P 3.7

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

frontier_llmevalsresearchinterpretability

#54

Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)

Wenqing Wu, Chengzhi Zhang, Yi Zhao, Tong Bao

4.6

I 4.0 Im 4.4 P 5.1

With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.AI (Artificial Intelligence) — spans multiple research subfields.

frontier_llmevalsinterpretability

#55

Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teaming

Robotics 2026-04-21 arXiv cs.RO (Robotics)arXiv cs.AI (Artificial Intelligence)

Alex Cuellar, Michael Hagenow, Julie Shah

4.6

I 4.0 Im 4.5 P 4.8

Effective human-robot teaming is crucial for the practical deployment of robots in human workspaces. However, optimizing joint human-robot plans remains a challenge due to the difficulty of modeling individualized human capabilities and preferences. While prior research has leveraged the multi-cycle structure of domains like manufacturing to learn an individual's tendencies and adapt plans over repeated interactions, these techniques typically consider task-level and motion-level adaptation in isolation. Task-level methods optimize allocation and scheduling but often ignore spatial interference in close-proximity scenarios; conversely, motion-level methods focus on collision avoidance while ignoring the broader task context. This paper introduces RAPIDDS, a framework that unifies these approaches by modeling an individual's spatial behavior (motion paths) and temporal behavior (time required to complete tasks) over multiple cycles. RAPIDDS then jointly adapts task schedules and steers diffusion models of robot motions to maximize efficiency and minimize proximity accounting for these individualized models. We demonstrate the importance of this dual adaptation through an ablation study in simulation and a physical robot scenario using a 7-DOF robot arm. Finally, we present a user study (n=32) showing significant plan improvement compared to non-adaptive systems across both objective metrics, such as efficiency and proximity, and subjective measures, including fluency and user preference. See this paper's companion video at: https://youtu.be/55Q3lq1fINs.

How it was discussed across sources

arXiv cross-listings: Listed under cs.RO (Robotics), cs.AI (Artificial Intelligence) — spans multiple research subfields.

roboticsgenerative_mediapost_trainingindustry

#56

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Bo-Jyun Wang, Ying-Jia Lin, Hung-Yu Kao

4.6

I 4.8 Im 4.8 P 3.7

Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.

frontier_llmevalsai_codingefficiency

#57

Understanding the Mechanism of Altruism in Large Language Models

Interpretability 2026-04-21 arXiv — Mechanistic Interpretability

Shuhuai Zhang, Shu Wang, Zijun Yao, Chuanhao Li +3

4.6

I 4.0 Im 5.7 P 3.7

Altruism is fundamental to human societies, fostering cooperation and social cohesion. Recent studies suggest that large language models (LLMs) can display human-like prosocial behavior, but the internal computations that produce such behavior remain poorly understood. We investigate the mechanisms underlying LLM altruism using sparse autoencoders (SAEs). In a standard Dictator Game, minimal-pair prompts that differ only in social stance (generous versus selfish) induce large, economically meaningful shifts in allocations. Leveraging this contrast, we identify a set of SAE features (0.024% of all features across the model's layers) whose activations are strongly associated with the behavioral shift. To interpret these features, we use benchmark tasks motivated by dual-process theories to classify a subset as primarily heuristic (System 1) or primarily deliberative (System 2). Causal interventions validate their functional role: activation patching and continuous steering of this feature direction reliably shift allocation distributions, with System 2 features exerting a more proximal influence on the model's final output than System 1 features. The same steering direction generalizes across multiple social-preference games. Together, these results enhance our understanding of artificial cognition by translating altruistic behaviors into identifiable network states and provide a framework for aligning LLM behavior with human values, thereby informing more transparent and value-aligned deployment.

interpretabilityfrontier_llmresearchai_coding

#58

Agent-World: Scaling Real-World Environment Synthesis

Agents & Tool Use 2026-04-21 alphaXiv trending

4.5

I 4.4 Im 4.4 P 4.4

Agent-World is a framework that lets agents autonomously discover tools and environments; reports improvements across 23 agent benchmarks via self-evolving training.

agentsevalsinfra

#59

Benign Overfitting in Adversarial Training for Vision Transformers

Research 2026-04-21 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Jiaming Zhang, Meng Ding, Shaopeng Fu, Jingfeng Zhang +1

4.5

I 4.0 Im 4.0 P 5.1

Despite the remarkable success of Vision Transformers (ViTs) across a wide range of vision tasks, recent studies have revealed that they remain vulnerable to adversarial examples, much like Convolutional Neural Networks (CNNs). A common empirical defense strategy is adversarial training, yet the theoretical underpinnings of its robustness in ViTs remain largely unexplored. In this work, we present the first theoretical analysis of adversarial training under simplified ViT architectures. We show that, when trained under a signal-to-noise ratio that satisfies a certain condition and within a moderate perturbation budget, adversarial training enables ViTs to achieve nearly zero robust training loss and robust generalization error under certain regimes. Remarkably, this leads to strong generalization even in the presence of overfitting, a phenomenon known as \emph{benign overfitting}, previously only observed in CNNs (with adversarial training). Experiments on both synthetic and real-world datasets further validate our theoretical findings.

How it was discussed across sources

arXiv cross-listings: Listed under cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

researchsafety_policyinfragov_defense

#60

Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

Evaluations & Benchmarks 2026-04-21 arXiv cs.CL (Computation & Language)

Bowen Li, Haochen Ma, Yuxin Wang, Jie Yang +4

4.5

I 4.0 Im 5.5 P 3.7

The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.

evalsfrontier_llmpost_trainingsafety_policy

#61

CAST: Modeling Semantic-Level Transitions for Complementary-Aware Sequential Recommendation

Frontier LLMs 2026-04-21 arXiv cs.LG (Machine Learning)

Qian Zhang, Lech Szymanski, Haibo Zhang, Jeremiah D. Deng

4.5

I 5.4 Im 4.0 P 3.7

Sequential Recommendation (SR) aims to predict the next interaction of a user based on their behavior sequence, where complementary relations often provide essential signals for predicting the next item. However, mainstream models relying on sparse co-purchase statistics often mistake spurious correlations (e.g., due to popularity bias) for true complementary relations. Identifying true complementary relations requires capturing the fine-grained item semantics (e.g., specifications) that simple cooccurrence statistics would be unable to model. While recent semantics-based methods utilize discrete semantic codes to represent items, they typically aggregate semantic codes into coarse item representations. This aggregation process blurs specific semantic details required to identify complementarity. To address these critical limitations and effectively leverage semantics for capturing reliable complementary relations, we propose a Complementary-Aware Semantic Transition (CAST) framework that introduces a new modeling paradigm built upon semantic-level transitions. Specifically, a semantic-level transition module is designed to model dynamic transitions directly in the discrete semantic code space, effectively capturing fine-grained semantic dependencies often lost in aggregated item representations. Then, a complementary prior injection module is designed to incorporate LLM-verified complementary priors into the attention mechanism, thereby prioritizing complementary patterns over co-occurrence statistics. Experiments on multiple e-commerce datasets demonstrate that CAST consistently outperforms the state-of-the-art approaches, achieving up to 17.6% Recall and 16.0% NDCG gains with 65x training acceleration. This validates its effectiveness and efficiency in uncovering latent item complementarity beyond statistics. The code will be released upon acceptance.

frontier_llmai_codinginfra

#62

Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)

Akash Yadav, Taiwo A. Adebiyi, Ruda Zhang

4.5

I 4.5 Im 4.8 P 3.7

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.

evalsefficiencyfrontier_llminfra

#63

Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Niclas Doll, Jasper Schulze Buschhoff, Shalaka Satheesh, Hammam Abdelwahab +2

4.5

I 4.0 Im 4.8 P 4.3

This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.

frontier_llmevalspost_traininginfra

#64

CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

Evaluations & Benchmarks 2026-04-21 arXiv cs.CV (Computer Vision)arXiv cs.AI (Artificial Intelligence)

Yanhui Chen, Baoyao Yang, Siqi Liu, Jingchao Wang

4.5

I 4.0 Im 4.4 P 4.8

SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CV (Computer Vision), cs.AI (Artificial Intelligence) — spans multiple research subfields.

evalsefficiencyinfra

#65

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Multimodal 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Zhihong Zhang, Jie Zhao, Xiaojian Huang, Jin Xu +4

4.5

I 4.8 Im 4.9 P 3.4

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

multimodalfrontier_llmpost_trainingevals

#66

Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference

AI Coding 2026-04-21 arXiv cs.CL (Computation & Language)

François Remy

4.5

I 4.0 Im 5.4 P 3.7

Reliable biomedical and clinical retrieval requires more than strong ranking performance: it requires a practical way to find systematic model failures and curate the training evidence needed to correct them. Late-interaction models such as ColBERT provide a first solution thanks to the interpretable token-level interaction scores they expose between document and query tokens. Yet this interpretability is shallow: it explains a particular document--query pairwise score, but does not reveal whether the model has learned a clinical concept in a stable, reusable, and context-sensitive way across diverse expressions. As a result, these scores provide limited support for diagnosing misunderstandings, identifying irreasonably distant biomedical concepts, or deciding what additional data or feedback is needed to address this. In this short position paper, we propose Diagnosable ColBERT, a framework that aligns ColBERT token embeddings to a reference latent space grounded in clinical knowledge and expert-provided conceptual similarity constraints. This alignment turns document encodings into inspectable evidence of what the model appears to understand, enabling more direct error diagnosis and more principled data curation without relying on large batteries of diagnostic queries.

ai_codinginterpretabilitypost_trainingevals

#67

FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Abdulmoneam Ali, Ahmed Arafa

4.5

I 4.8 Im 4.6 P 3.7

Personalized Federated Learning (PFL) aims to learn multiple task-specific models rather than a single global model across heterogeneous data distributions. Existing PFL approaches typically rely on iterative optimization-such as model update trajectories-to cluster users that need to accomplish the same tasks together. However, these learning-dynamics-based methods are inherently vulnerable to low-quality data and noisy labels, as corrupted updates distort clustering decisions and degrade personalization performance. To tackle this, we propose FB-NLL, a feature-centric framework that decouples user clustering from iterative training dynamics. By exploiting the intrinsic heterogeneity of local feature spaces, FB-NLL characterizes each user through the spectral structure of the covariances of their feature representations and leverages subspace similarity to identify task-consistent user groupings. This geometry-aware clustering is label-agnostic and is performed in a one-shot manner prior to training, significantly reducing communication overhead and computational costs compared to iterative baselines. Complementing this, we introduce a feature-consistency-based detection and correction strategy to address noisy labels within clusters. By leveraging directional alignment in the learned feature space and assigning labels based on class-specific feature subspaces, our method mitigates corrupted supervision without requiring estimation of stochastic noise transition matrices. In addition, FB-NLL is model-independent and integrates seamlessly with existing noise-robust training techniques. Extensive experiments across diverse datasets and noise regimes demonstrate that our framework consistently outperforms state-of-the-art baselines in terms of average accuracy and performance stability.

researchinterpretabilitypost_trainingsafety_policy

#68

Lyapunov-Certified Direct Switching Theory for Q-Learning

Reinforcement Learning 2026-04-21 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)

Donghwan Lee

4.5

I 4.0 Im 4.0 P 5.1

Q-learning is one of the most fundamental algorithms in reinforcement learning. We analyze constant-stepsize Q-learning through a direct stochastic switching system representation. The key observation is that the Bellman maximization error can be represented exactly by a stochastic policy. Therefore, the Q-learning error admits a switched linear conditional-mean recursion with martingale-difference noise. The intrinsic drift rate is the joint spectral radius (JSR) of the direct switching family, which can be strictly smaller than the standard row-sum rate. Using this representation, we derive a finite-time final-iterate bound via a JSR-induced Lyapunov function and then give a computable quadratic-certificate version.

How it was discussed across sources

arXiv cross-listings: Listed under cs.LG (Machine Learning), cs.AI (Artificial Intelligence) — spans multiple research subfields.

rlfrontier_llmresearchsafety_policy

#69

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

Multimodal 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Chuou Xu, Liya Ji, Qifeng Chen

4.5

I 4.8 Im 4.9 P 3.4

Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.

multimodalevalsresearchfrontier_llm

#70

PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models

Interpretability 2026-04-21 arXiv cs.LG (Machine Learning)

Salvatore Greco, Jacek Karolczak, Roman Słowiński, Jerzy Stefanowski

4.5

I 4.0 Im 5.3 P 3.7

Explainable artificial intelligence (XAI) has predominantly focused on generating model-centric explanations that approximate the behavior of black-box models. However, such explanations often overlook a fundamental aspect of interpretability: different users require different explanations depending on their goals, preferences, and cognitive constraints. Although recent work has explored user-centric and personalized explanations, most existing approaches rely on heuristic adaptations or implicit user modeling, lacking a principled framework for representing and learning individual preferences. In this paper, we consider Preference-Based Explainable Artificial Intelligence (PREF-XAI), a novel perspective that reframes explanation as a preference-driven decision problem. Within PREF-XAI, explanations are not treated as fixed outputs, but as alternatives to be evaluated and selected according to user-specific criteria. In the PREF-XAI perspective, here we propose a methodology that combines rule-based explanations with formal preference learning. User preferences are elicited through a ranking of a small set of candidate explanations and modeled via an additive utility function inferred using robust ordinal regression. Experimental results on real-world datasets show that PREF-XAI can accurately reconstruct user preferences from limited feedback, identify highly relevant explanations, and discover novel explanatory rules not initially considered by the user. Beyond the proposed methodology, this work establishes a connection between XAI and preference learning, opening new directions for interactive and adaptive explanation systems.

interpretabilitypost_trainingevals

#71

PlayCoder: Making LLM-Generated GUI Code Playable

Evaluations & Benchmarks 2026-04-21 arXiv — Agents / Tool Use

Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying +2

4.5

I 4.8 Im 5.0 P 3.4

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than only pass/fail outcomes. To study this problem, we introduce PlayEval, a repository-aware benchmark built from 43 multilingual GUI applications in Python, TypeScript, and JavaScript. Unlike prior GUI benchmarks that are difficult to adapt to desktop environments, PlayEval covers six major GUI application categories and directly supports code-generation evaluation. We further propose Play@k, a metric that measures whether at least one of *k* generated candidates can be played end-to-end without logical errors. To support reliable evaluation, we develop PlayTester, an LLM-based agent that performs task-oriented GUI playthroughs and detects logic violations automatically. Experiments on 10 state-of-the-art code LLMs show that, despite high compilation rates, they achieve near-zero Play@3, revealing major weaknesses in generating logically correct GUI applications. To address this limitation, we present PlayCoder, a multi-agent, repository-aware framework that generates, evaluates, and iteratively repairs GUI application code in a closed loop. PlayCoder substantially improves both functional correctness and semantic alignment for open-source and closed-source models, reaching up to 38.1% Exec@3 and 20.3% Play@3. Case studies further show that it can uncover silent logic bugs missed by traditional metrics and fix them through targeted edits.

evalsfrontier_llmai_codingpost_training

#72

Time Series Augmented Generation for Financial Applications

Frontier LLMs 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov, Abhishek Saxena

4.5

I 4.6 Im 4.4 P 4.0

Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.

frontier_llmevalsagents

#73

Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

Efficiency 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen +7

4.5

I 4.8 Im 4.8 P 3.4

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.

efficiencyinfraaudioevals

#74

A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

Agents & Tool Use 2026-04-21 arXiv cs.CL (Computation & Language)

Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu +7

4.4

I 4.4 Im 4.8 P 3.7

As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.

agentsevalsefficiencyresearch

#75

Accelerating Optimization and Machine Learning through Decentralization

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Ziqin Chen, Zuang Wang, Yongqiang Wang

4.4

I 4.8 Im 4.4 P 3.7

Decentralized optimization enables multiple devices to learn a global machine learning model while each individual device only has access to its local dataset. By avoiding the need for training data to leave individual users' devices, it enhances privacy and scalability compared to conventional centralized learning, where all data has to be aggregated to a central server. However, decentralized optimization has traditionally been viewed as a necessary compromise, used only when centralized processing is impractical due to communication constraints or data privacy concerns. In this study, we show that decentralization can paradoxically accelerate convergence, outperforming centralized methods in the number of iterations needed to reach optimal solutions. Through examples in logistic regression and neural network training, we demonstrate that distributing data and computation across multiple agents can lead to faster learning than centralized approaches, even when each iteration is assumed to take the same amount of time, whether performed centrally on the full dataset or decentrally on local subsets. This finding challenges longstanding assumptions and reveals decentralization as a strategic advantage, offering new opportunities for more efficient optimization and machine learning.

researchagentsefficiencyinfra

#76

An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA

Evaluations & Benchmarks 2026-04-21 arXiv cs.CL (Computation & Language)

Saransh Sharma, Pritika Ramu, Aparna Garimella, Koyel Mukherjee

4.4

I 4.6 Im 4.4 P 3.7

Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.

evalsfrontier_llm

#77

Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean

Audio & Speech 2026-04-21 arXiv cs.CL (Computation & Language)

Hyunjung Joo, GyeongTaek Lee

4.4

I 4.8 Im 4.4 P 3.7

The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic $F_0$ contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous $F_0$ contours.

audiointerpretabilityevals

#78

Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Feihao Fang, My T. Thai, Yuanyuan Lei

4.4

I 4.0 Im 5.0 P 3.7

Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.

frontier_llmpost_trainingevalssafety_policy

#79

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)

Quang-Huy Nguyen, Thanh-Hai Nguyen, Khac-Manh Thai, Duc-Hoang Pham +5

4.4

I 4.8 Im 4.4 P 3.7

Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R-UET/CFExpRec.

evals

#80

Headlines You Won't Forget: Can Pronoun Insertion Increase Memorability?

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Selina Meyer, Magdalena Abel, Michael Roth

4.4

I 4.0 Im 5.2 P 3.7

For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.

frontier_llminterpretabilityevals

#81

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

Multimodal 2026-04-21 arXiv cs.CV (Computer Vision)

Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang +2

4.4

I 5.3 Im 4.5 P 3.0

Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.

multimodalresearchfrontier_llmagents

#82

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Rajveer Singh Pall

4.4

I 4.8 Im 4.4 P 3.7

We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench

frontier_llmevalsai_coding

#83

Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

Robotics 2026-04-21 alphaXiv trending

4.4

I 4.4 Im 4.0 P 4.4

A framework merging diffusion-based motion generation with RL-based tracking for humanoid robots on challenging terrains. Generator provides adaptive target motions; tracker executes.

roboticsgenerative_media

#84

Pentagon officials detail $55B drone plan under DAWG

Government & Defense 2026-04-21 Breaking Defense

4.4

I 5.4 Im 4.0 P 3.4

Pentagon officials outlined a $55 billion drone plan coordinated under DAWG (Defense Autonomy Working Group / equivalent program office). Anduril-Kraken USV partnership separately announced — part of broader push to field large unmanned systems across air and sea.

gov_defense

#85

Safe Continual Reinforcement Learning in Non-stationary Environments

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)

Austin Coursey, Abel Diaz-Gonzalez, Marcos Quinones-Grueiro, Gautam Biswas

4.4

I 4.0 Im 5.0 P 3.7

Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system's lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.

evalsrlsafety_policyinfra

#86

Safety-Critical Contextual Control via Online Riemannian Optimization with World Models

Research 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Tongxin Li

4.4

I 4.8 Im 4.6 P 3.4

Modern world models are becoming too complex to admit explicit dynamical descriptions. We study safety-critical contextual control, where a Planner must optimize a task objective using only feasibility samples from a black-box Simulator, conditioned on a context signal $ξ_t$. We develop a sample-based Penalized Predictive Control (PPC) framework grounded in online Riemannian optimization, in which the Simulator compresses the feasibility manifold into a score-based density $\hat{p}(u \mid ξ_t)$ that endows the action space with a Riemannian geometry guiding the Planner's gradient descent. The barrier curvature $κ(ξ_t)$, the minimum curvature of the conditional log-density $-\ln\hat{p}(\cdot\midξ_t)$, governs both convergence rate and safety margin, replacing the Lipschitz constant of the unknown dynamics. Our main result is a contextual safety bound showing that the distance from the true feasibility manifold is controlled by the score estimation error and a ratio that depends on $κ(ξ_t)$, both of which improve with richer context. Simulations on a dynamic navigation task confirm that contextual PPC substantially outperforms marginal and frozen density models, with the advantage growing after environment shifts.

researchsafety_policy

#87

Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

Frontier LLMs 2026-04-21 arXiv cs.LG (Machine Learning)

Thomas Zollo, Jimmy Wang, Richard Zemel

4.4

I 4.8 Im 4.4 P 3.7

Reasoning language models can solve increasingly complex tasks, but struggle to produce the calibrated confidence estimates necessary for reliable deployment. Existing calibration methods usually depend on labels or repeated sampling at inference time, making them impractical in many settings. We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.

frontier_llmevalsefficiencyindustry

#88

A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities

Evaluations & Benchmarks 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Aya Cherigui, Florent Guépin, Arnaud Legendre, Jean-François Couchot

4.3

I 4.0 Im 5.0 P 3.4

Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.

evalssafety_policyefficiency

#89

An AI Agent Execution Environment to Safeguard User Data

Evaluations & Benchmarks 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Robert Stanley, Avi Verma, Lillian Tsai, Konstantinos Kallas +1

4.3

I 4.8 Im 4.4 P 3.4

AI agents promise to serve as general-purpose personal assistants for their users, which requires them to have access to private user data (e.g., personal and financial information). This poses a serious risk to security and privacy. Adversaries may attack the AI model (e.g., via prompt injection) to exfiltrate user data. Furthermore, sharing private data with an AI agent requires users to trust a potentially unscrupulous or compromised AI model provider with their private data. This paper presents GAAP (Guaranteed Accounting for Agent Privacy), an execution environment for AI agents that guarantees confidentiality for private user data. Through dynamic and directed user prompts, GAAP collects permission specifications from users describing how their private data may be shared, and GAAP enforces that the agent's disclosures of private user data, including disclosures to the AI model and its provider, comply with these specifications. Crucially, GAAP provides this guarantee deterministically, without trusting the agent with private user data, and without requiring any AI model or the user prompt to be free of attacks. GAAP enforces the user's permission specification by tracking how the AI agent accesses and uses private user data. It augments Information Flow Control with novel persistent data stores and annotations that enable it to track the flow of private information both across execution steps within a single task, and also over multiple tasks separated in time. Our evaluation confirms that GAAP blocks all data disclosure attacks, including those that make other state-of-the-art systems disclose private user data to untrusted parties, without a significant impact on agent utility.

evalsagents

#90

Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies

Robotics 2026-04-21 arXiv cs.RO (Robotics)

Jess Jones, Raul Santos-Rodriguez, Sabine Hauert

4.3

I 4.4 Im 4.6 P 3.4

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding human-object interactions, but their application to robotic systems with non-humanoid morphologies remains largely unexplored. This work investigates whether VLMs can effectively infer affordances for robots with fundamentally different embodiments than humans, addressing a critical gap in the deployment of these models for diverse robotic applications. We introduce a novel hybrid dataset that combines annotated real-world robotic affordance-object relations with VLM-generated synthetic scenarios, and perform an empirical analysis of VLM performance across multiple object categories and robot morphologies, revealing significant variations in affordance inference capabilities. Our experiments demonstrate that while VLMs show promising generalisation to non-humanoid robot forms, their performance is notably inconsistent across different object domains. Critically, we identify a consistent pattern of low false positive rates but high false negative rates across all morphologies and object categories, indicating that VLMs tend toward conservative affordance predictions. Our analysis reveals that this pattern is particularly pronounced for novel tool use scenarios and unconventional object manipulations, suggesting that effective integration of VLMs in robotic systems requires complementary approaches to mitigate over-conservative behaviour while preserving the inherent safety benefits of low false positive rates.

roboticsmultimodalagentssafety_policy

#91

Budgeted Online Influence Maximization

Generative Media 2026-04-21 arXiv cs.LG (Machine Learning)

Pierre Perrault, Jennifer Healey, Zheng Wen, Michal Valko

4.3

I 4.8 Im 4.0 P 3.7

We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. Our approach better models the real-world setting where the cost of influencers varies and advertisers want to find the best value for their overall social advertising budget. We propose an algorithm assuming an independent cascade diffusion model and edge level semi-bandit feedback, and provide both theoretical and experimental results. Our analysis is also valid for the cardinality constraint setting and improves the state of the art regret bound in this case.

generative_media

#92

CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

Frontier LLMs 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Jianzhi Yan, Le Liu, Buzhou Tang, Yang Xiang +2

4.3

I 4.8 Im 4.4 P 3.4

Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model's performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model's ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.

frontier_llminterpretabilityevalsefficiency

#93

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

Efficiency 2026-04-21 arXiv cs.CL (Computation & Language)

Jinyu Guo, Zhihan Zhang, Yutong Li, Jiehui Xie +7

4.3

I 4.8 Im 4.0 P 3.7

The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH-KV significantly outperforms state-of-the-art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at https://github.com/Zhihan-Zh/DASH-KV

efficiencyfrontier_llmai_coding

#94

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

Evaluations & Benchmarks 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Vasundra Srininvasan

4.3

I 4.0 Im 5.0 P 3.4

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

evalsagentspost_trainingsafety_policy

#95

GenerativeMPC: VLM-RAG-guided Whole-Body MPC with Virtual Impedance for Bimanual Mobile Manipulation

Multimodal 2026-04-21 arXiv cs.RO (Robotics)

Marcelino Julio Fernando, Miguel Altamirano Cabrera, Jeffrin Sam, Yara Mahmoud +2

4.3

I 4.0 Im 5.1 P 3.4

Bimanual mobile manipulation requires a seamless integration between high-level semantic reasoning and safe, compliant physical interaction - a challenge that end-to-end models approach opaquely and classical controllers lack the context to address. This paper presents GenerativeMPC, a hierarchical cyber-physical framework that explicitly bridges semantic scene understanding with physical control parameters for bimanual mobile manipulators. The system utilizes a Vision-Language Model with Retrieval-Augmented Generation (VLM-RAG) to translate visual and linguistic context into grounded control constraints, specifically outputting dynamic velocity limits and safety margins for a Whole-Body Model Predictive Controller (MPC). Simultaneously, the VLM-RAG module modulates virtual stiffness and damping gains for a unified impedance-admittance controller, enabling context-aware compliance during human-robot interaction. Our framework leverages an experience-driven vector database to ensure consistent parameter grounding without retraining. Experimental results in MuJoCo, IsaacSim, and on a physical bimanual platform confirm a 60% speed reduction near humans and safe, socially-aware navigation and manipulation through semantic-to-physical parameter grounding. This work advances the field of human-centric cybernetics by grounding large-scale cognitive models into predictable, high-frequency physical control loops.

multimodalroboticsevalssafety_policy

#96

LASER: Learning Active Sensing for Continuum Field Reconstruction

Reinforcement Learning 2026-04-21 arXiv cs.LG (Machine Learning)

Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang +1

4.3

I 4.8 Im 4.0 P 3.7

High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

rlsafety_policy

#97

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

Evaluations & Benchmarks 2026-04-21 arXiv cs.RO (Robotics)

Yunfan Lou, Xiaowei Chi, Xiaojie Zhang, Zezhong Qian +8

4.3

I 4.8 Im 4.4 P 3.4

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

evalsroboticsresearchgenerative_media

#98

Meta will record employees’ keystrokes and use it to train its AI models

Industry 2026-04-21 TechCrunch — AI

Lucas Ropek

4.3

I 4.5 Im 4.7 P 3.4

Meta says that it has a new internal tool that is converting mouse movements and button clicks into data that can train its AI models.

#99

On Reasoning-Centric LLM-based Automated Theorem Proving

Frontier LLMs 2026-04-21 arXiv — Agents / Tool Use

Yican Sun, Chengwei Shi, Hangzhou Lyu, Yingfei Xiong

4.3

I 4.8 Im 4.4 P 3.4

Automated theorem proving is fundamental to formal methods, and the recent trend is to integrate large language models (LLMs) and proof assistants to form effective proof agents. While existing proof agents show promising performance, they inadequately leverage reasoning capabilities of modern LLMs in high-level planning and self-critique. We argue that proof agents should not merely generate tactics but also reason strategically about proof plans and critically evaluate their own proposals. This paper introduces ReCent-Prover, a reasoning-centric LLM-based proof agent for Rocq that addresses two critical limitations in current systems. First, we present validation with reflection, enabling LLMs to scrutinize their generated tactics and synthesize failure summaries when reflection identifies potential errors, filtering out potentially misapplied tactics earlier. Second, we propose retrieval with planning, which conditions retrieval on LLM-generated proof plans rather than subgoal similarity, retrieving lemmas and proofs that align with the anticipated proof strategy. Both techniques increase the number of invocations of LLMs. However, when evaluated on the CoqStoq benchmark, even under the same budget of LLM invocations, ReCent-Prover achieves a 22.58% relative improvement in the number of proved theorems over the previous state-of-the-art, demonstrating that our reasoning-centric design significantly enhances automated theorem proving capabilities.

frontier_llmevalsagents

#100

Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Yi Zhao, Di Yuan, Tao Deng, Suzhi Cao +1

4.3

I 4.0 Im 4.8 P 3.7

Federated learning (FL) is a key paradigm for distributed model learning across decentralized data sources. Communication in each FL round typically consists of two phases: (i) distributing the global model from a server to clients, and (ii) collecting updated local models from clients to the server for aggregation. This paper focuses on a type of FL where communication between a client and the server is relay-based over dynamic networks, making routing optimization essential. A typical scenario is in-orbit FL, where satellites act as clients and communicate with a server (which can be a satellite, ground station, or aerial platform) via multi-hop inter-satellite links. This paper presents a comprehensive tractability analysis of routing optimization for in-orbit FL under different settings. For global model distribution, these include the number of models, the objective function, and routing schemes (unicast versus multicast, and splittable versus unsplittable flow). For local model collection, the settings consider the number of models, client selection, and flow splittability. For each case, we rigorously prove whether the global optimum is obtainable in polynomial time or the problem is NP-hard. Together, our analysis draws clear boundaries between tractable and intractable regimes for a broad spectrum of routing problems for in-orbit FL. For tractable cases, the derived efficient algorithms are directly applicable in practice. For intractable cases, we provide fundamental insights into their inherent complexity. These contributions fill a critical yet unexplored research gap, laying a foundation for principled routing design, evaluation, and deployment in satellite-based FL or similar distributed learning systems.

researchevalsefficiencyindustry

#101

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Agents & Tool Use 2026-04-21 arXiv cs.CL (Computation & Language)

Xinlin Wang, Mats Brorsson

4.3

I 4.0 Im 4.9 P 3.7

Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.

agentsfrontier_llmpost_trainingefficiency

#102

Sam Altman throws shade at Anthropic’s cyber model, Mythos: ‘fear-based marketing’

Industry 2026-04-21 TechCrunch — AI

Lucas Ropek

4.3

I 4.0 Im 4.5 P 4.0

This week, during a podcast appearance, OpenAI CEO Sam Altman called out his competitor's new cybersecurity model, noting that the company was using fear to make its product sound more impressive than it actually is.

industry

#103

Unauthorized group has gained access to Anthropic’s exclusive cyber tool Mythos, report claims

Industry 2026-04-21 TechCrunch — AI

Lucas Ropek

4.3

I 4.0 Im 4.5 P 4.0

Anthropic told TechCrunch it is investigating the claims, but maintains that there is no evidence that its systems have been impacted.

#104

Unauthorized group reportedly accessed Anthropic's cyber tool Mythos

Government & Defense 2026-04-21 TechCrunch — AI

4.3

I 4.0 Im 4.5 P 4.0

TechCrunch report alleges an unauthorized group gained access to Mythos, Anthropic's exclusive cyber-defense model. Sam Altman publicly called Mythos "fear-based marketing" in a separate post, deepening the commercial rivalry.

gov_defense

#105

VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing

Multimodal 2026-04-21 arXiv cs.CL (Computation & Language)

Yanbin Huang, Yisen Li, Guiyao Tie, Xiaoye Qu +5

4.3

I 4.5 Im 4.4 P 3.7

Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model's response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model's activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model's original computational efficiency.

multimodalpost_trainingevalsrobotics

#106

Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

Efficiency 2026-04-21 arXiv cs.CV (Computer Vision)

Kadir Yilmaz, Adrian Kruse, Tristan Höfer, Daan de Geus +1

4.3

I 4.8 Im 4.8 P 3.0

Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome this, we introduce a data-efficient training recipe based on strong 3D augmentations, regularization, and distillation from a convolutional teacher, making Volt competitive with state-of-the-art methods. We then scale supervision through joint training on multiple datasets and show that Volt benefits more from increased scale than domain-specific 3D backbones, achieving state-of-the-art results across indoor and outdoor datasets. Finally, when used as a drop-in backbone in a standard 3D instance segmentation pipeline, Volt again sets a new state of the art, highlighting its potential as a simple, scalable, general-purpose backbone for 3D scene understanding.

efficiencyinfraai_codingevals

#107

seneca: A Personalized Conversational Planner

Evaluations & Benchmarks 2026-04-21 arXiv — Agents / Tool Use

Simon Bohnen, Gabriel Garbers, Lukas Ellinger, Georg Groh

4.3

I 4.0 Im 5.0 P 3.4

Knowledge work demands sustained self-regulation, prioritization, and reflection-yet existing planning tools only partially support these needs. Digital to-do list applications feature task persistence but lack goal representation. Paper-based planning frameworks offer effective planning strategies but cannot adapt to individual users. Conversational AI systems enable flexible reflection but lack persistence and accountability. Moreover, none of these tools address a fundamental challenge: users' expressed demands often diverge from their underlying needs. This paper introduces seneca, a conceptual framework for a personalized, AI-assisted planner that integrates the complementary strengths of these three approaches. seneca combines a conversational agent that scaffolds reflection and asks clarifying questions, a persistent database that tracks goals and behavioral patterns, and a processor that synchronizes information between them. We describe this architecture and outline a phased evaluation strategy combining automated testing with simulated users and longitudinal human studies measuring goal attainment, planning realism, and goal-value alignment.

evalssafety_policyagentsinterpretability

#108

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Evaluations & Benchmarks 2026-04-21 arXiv cs.CL (Computation & Language)

Peiqin Lin, Chenyang Lyu, Wenjiang Luo, Haotian Ye +21

4.2

I 4.0 Im 4.4 P 3.7

Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

evalsfrontier_llmindustry

#109

Detecting Data Contamination in Large Language Models

Frontier LLMs 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Juliusz Janicki, Savvas Chamezopoulos, Evangelos Kanoulas, Georgios Tsatsaronis

4.2

I 4.8 Im 4.0 P 3.4

Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumptions and compare them to each other using a unified set of datasets to determine if any of them can reliably detect membership under SOTA LLMs. In addition, a new method, called the Familiarity Ranking, was developed to showcase a possible approach to black-box MIAs, thereby giving LLMs more freedom in their expression to understand their reasoning better. The results indicate that none of the methods are capable of reliably detecting membership in LLMs, as shown by an AUC-ROC of approximately 0.5 for all methods across several LLMs. The higher TPR and FPR for more advanced LLMs indicate higher reasoning and generalizing capabilities, showcasing the difficulty of detecting membership in LLMs using black-box MIAs.

frontier_llmroboticsefficiencyinfra

#110

Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

Evaluations & Benchmarks 2026-04-21 arXiv cs.CL (Computation & Language)

Sho Hoshino, Ukyo Honda, Peinan Zhang

4.2

I 4.0 Im 4.4 P 3.7

While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89\% accuracy on MMLU, the best performance to date with the use of GPT-4o.

evalsfrontier_llm

#111

Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection

Frontier LLMs 2026-04-21 arXiv cs.LG (Machine Learning)

Divyesh Gabbireddy, Suman Saha

4.2

I 4.0 Im 4.4 P 3.7

Cross-site scripting (XSS) remains a persistent web security vulnerability, especially because obfuscation can change the surface form of a malicious payload while preserving its behavior. These transformations make it difficult for traditional and machine learning-based detection systems to reliably identify attacks. Existing approaches for generating obfuscated payloads often emphasize syntactic diversity, but they do not always ensure that the generated samples remain behaviorally valid. This paper presents a structured pipeline for generating and evaluating obfuscated XSS payloads using large language models (LLMs). The pipeline combines deterministic transformation techniques with LLM-based generation and uses a browser- based runtime evaluation procedure to compare payload behavior in a controlled execution environment. This allows generated samples to be assessed through observable runtime behavior rather than syntactic similarity alone. In the evaluation, an untuned baseline language model achieves a runtime behavior match rate of 0.15, while fine-tuning on behavior-preserving source-target obfuscation pairs improves the match rate to 0.22. Although this represents a measurable improvement, the results show that current LLMs still struggle to generate obfuscations that preserve observed runtime behavior. A downstream classifier evaluation further shows that adding generated payloads does not improve detection performance in this setting, although behavior- filtered generated samples can be incorporated without materially degrading performance. Overall, the study demonstrates both the promise and the limits of applying generative models to adversarial security data generation and emphasizes the importance of runtime behavior checks in improving the quality of generated data for downstream detection systems.

frontier_llmevalsagentspost_training

#112

Face Anything: 4D Face Reconstruction from Any Image Sequence

Evaluations & Benchmarks 2026-04-21 arXiv cs.CV (Computer Vision)

Umut Kocasari, Simon Giebenhain, Richard Shaw, Matthias Nießner

4.2

I 4.8 Im 4.4 P 3.0

Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.

evalsefficiency

#113

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)

Rudolf Debelak

4.2

I 4.0 Im 4.4 P 3.7

The evaluation of machine learning models typically relies mainly on performance metrics based on loss functions, which risk to overlook changes in performance in relevant subgroups. Auditing tools such as SliceFinder and SliceLine were proposed to detect such groups, but usually have conceptual disadvantages, such as the inability to directly address continuous covariates. In this paper, we introduce FairTree, a novel algorithm adapted from psychometric invariance testing. Unlike SliceFinder and related algorithms, FairTree directly handles continuous, categorical, and ordinal features without discretization. It further decomposes performance disparities into systematic bias and variance, allowing a categorization of changes in algorithm performance. We propose and evaluate two variations of the algorithm: a permutation-based approach, which is conceptually closer to SliceFinder, and a fluctuation test. Through simulation studies that include a direct comparison with SliceLine, we demonstrate that both approaches have a satisfactory rate of false-positive results, but that the fluctuation approach has relatively higher power. We further illustrate the method on the UCI Adult Census dataset. The proposed algorithms provide a flexible framework for the statistical evaluation of the performance and aspects of fairness of machine learning models in a wide range of applications even in relatively small data.

evalsinterpretability

#114

GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

Agents & Tool Use 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Joshua Niemeijer, Alaa Eddine Ben Zekri, Reza Bahmanyar, Philipp M. Schmälzle +2

4.2

I 4.0 Im 4.8 P 3.4

Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird's-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.

agentsinterpretabilityinfra

#115

GitHub Trending — agent-native repos dominate this week

Agents & Tool Use 2026-04-21 GitHub Trending

4.2

I 4.9 Im 4.0 P 3.6

GitHub weekly Python trending is dominated by self-evolving agents and agent frameworks: hermes-agent (NousResearch, +25k stars/week), GenericAgent (+4.2k), openai-agents-python (+3.5k), VoxCPM2 tokenizer-free TTS (+2.6k), DFlash block-diffusion speculative decoding (+909), Kronos financial markets foundation model (+2.5k), ppt-master native pptx generation (+1.9k).

agentsai_codingfrontier_llmgenerative_media

#116

HardNet++: Nonlinear Constraint Enforcement in Neural Networks

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Andrea Goertzen, Kaveh Alim, Navid Azizan

4.2

I 4.0 Im 4.6 P 3.7

Enforcing constraint satisfaction in neural network outputs is critical for safety, reliability, and physical fidelity in many control and decision-making applications. While soft-constrained methods penalize constraint violations during training, they do not guarantee constraint adherence during inference. Other approaches guarantee constraint satisfaction via specific parameterizations or a projection layer, but are tailored to specific forms (e.g., linear constraints), limiting their utility in other general problem settings. Many real-world problems of interest are nonlinear, motivating the development of methods that can enforce general nonlinear constraints. To this end, we introduce HardNet++, a constraint-enforcement method that simultaneously satisfies linear and nonlinear equality and inequality constraints. Our approach iteratively adjusts the network output via damped local linearizations. Each iteration is differentiable, admitting an end-to-end training framework, where the constraint satisfaction layer is active during training. We show that under certain regularity conditions, this procedure can enforce nonlinear constraint satisfaction to arbitrary tolerance. Finally, we demonstrate tight constraint adherence without loss of optimality in a learning-for-optimization context, where we apply this method to a model predictive control problem with nonlinear state constraints.

researchsafety_policyefficiencyinfra

#117

Safety, Policy & Regulation 2026-04-21 CSET — Center for Security and Emerging Technology

4.2

I 4.0 Im 4.6 P 3.7

CSET (Crichton, Reddy, Ji) published "Operationalizing AI Guidance," a reference guide drawing on 1,200+ resources to translate high-level AI principles into deployable practice across the adoption lifecycle. Target audience is implementers at agencies and large firms working through safety, security, and governance checkpoints.

safety_policyindustry

#122

Pentagon officials broadly detail $55 billion drone plan under DAWG

Government & Defense 2026-04-21 Breaking Defense

Ashley Roque

4.2

I 4.8 Im 4.0 P 3.4

The bulk of funding comes in the form of reconciliation, a bet the department also made for a proposed hike to its Office of Strategic Capital loan program.

gov_defenseindustry

#123

Quadruped Parkour Learning: Sparsely Gated Mixture of Experts with Visual Input

Efficiency 2026-04-21 arXiv cs.RO (Robotics)

Michael Ziegltrum, Jianhao Jiao, Tianhu Peng, Chengxu Zhou +1

4.2

I 4.0 Im 4.8 P 3.4

Robotic parkour provides a compelling benchmark for advancing locomotion over highly challenging terrain, including large discontinuities such as elevated steps. Recent approaches have demonstrated impressive capabilities, including dynamic climbing and jumping, but typically rely on sequential multilayer perceptron (MLP) architectures with densely activated layers. In contrast, sparsely gated mixture-of-experts (MoE) architectures have emerged in the large language model domain as an effective paradigm for improving scalability and performance by activating only a subset of parameters at inference time. In this work, we investigate the application of sparsely gated MoE architectures to vision-based robotic parkour. We compare control policies based on standard MLPs and MoE architectures under a controlled setting where the number of active parameters at inference time is matched. Experimental results on a real Unitree Go2 quadruped robot demonstrate clear performance gains, with the MoE policy achieving double the number of successful trials in traversing large obstacles compared to a standard MLP baseline. We further show that achieving comparable performance with a standard MLP requires scaling its parameter count to match that of the total MoE model, resulting in a 14.3\% increase in computation time. These results highlight that sparsely gated MoE architectures provide a favorable trade-off between performance and computational efficiency, enabling improved scaling of control policies for vision-based robotic parkour. An anonymized link to the codebase is https://osf.io/v2kqj/files/github?view_only=7977dee10c0a44769184498eaba72e44.

efficiencyroboticsfrontier_llmai_coding

#124

Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

Evaluations & Benchmarks 2026-04-21 arXiv cs.CL (Computation & Language)

Dmitry Pronin, Evgeny Kazartsev

4.2

I 4.0 Im 4.4 P 3.7

This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 755 works by 180 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus.

evals

#125

Revisiting Catastrophic Forgetting in Continual Knowledge Graph Embedding

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)

Gerard Pons, Carlos Escolano, Besim Bilalli, Anna Queralt

4.2

I 4.0 Im 4.4 P 3.7

Knowledge Graph Embeddings (KGEs) support a wide range of downstream tasks over Knowledge Graphs (KGs). In practice, KGs evolve as new entities and facts are added, motivating Continual Knowledge Graph Embedding (CKGE) methods that update embeddings over time. Current CKGE approaches address catastrophic forgetting (i.e., the performance degradation on previously learned tasks) primarily by limiting changes to existing embeddings. However, we show that this view is incomplete. When new entities are introduced, their embeddings can interfere with previously learned ones, causing the model to predict them in place of previously correct answers. This phenomenon, which we call entity interference, has been largely overlooked and is not accounted for in current CKGE evaluation protocols. As a result, the assessment of catastrophic forgetting becomes misleading, and CKGE methods performance is systematically overestimated. To address this issue, we introduce a corrected CKGE evaluation protocol that accounts for entity interference. Through experiments on multiple benchmarks, we show that ignoring this effect can lead to performance overestimation of up to 25%, particularly in scenarios with significant entity growth. We further analyze how different CKGE methods and KGE models are affected by the different sources of forgetting, and introduce a catastrophic forgetting metric tailored to CKGE.

evals

#126

Revisiting and Expanding the IPv6 Network Periphery: Global-Scale Measurement and Security Analysis

Interpretability 2026-04-21 arXiv — Mechanistic Interpretability

Zixuan Xie, Zitao Yang, Shurui Fang, Zhaoyang Li +4

4.2

I 4.0 Im 4.4 P 3.7

As IPv6 deployment accelerates, understanding the evolving security posture of network peripheries becomes increasingly important. A DSN 2021 study introduced the first large-scale discovery of IPv6 network peripheries, uncovering risks like service exposure and routing loops. However, its scope was limited to three regions and is now outdated. In this paper, we revisit and significantly expand upon that work, presenting a comprehensive, up-to-date security assessment of IPv6 network peripheries. To support efficient large-scale scanning, we propose a novel Response-Guided Prefix Selection (RGPS) strategy to identify high-value IPv6 prefixes for probing. Our global-scale measurement covers 73 countries/regions and identifies over 281.9M active IPv6 network peripheries, including a 371.2\% increase (245M) over the 52M reported in 2021 for India, China, and America. Our service exposure analysis shows that 2.5\% of reachable services are still dangerously exposed, including outdated administrative interfaces and misconfigured servers, while correlation with known CVEs reveals recurring software vulnerabilities. Building on this service-exposure perspective, we further design a Hierarchical LLM Exposure Verification (HLEV) framework to identify unauthorized-access risks in exposed LLM deployment tools, revealing multiple security weaknesses caused by insecure default configurations and missing authentication. Additionally, we revisit routing loop vulnerabilities and identify 4.5M loop-prone responses, confirming that flawed routing behaviors remain widespread across vendors and countries/regions. These findings suggest that while IPv6 adoption has surged, key security challenges persist and are structurally embedded.

interpretabilityresearchfrontier_llmefficiency

#127

Separating Geometry from Probability in the Analysis of Generalization

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Maxim Raginsky, Benjamin Recht

4.2

I 4.0 Im 4.4 P 3.7

The goal of machine learning is to find models that minimize prediction error on data that has not yet been seen. Its operational paradigm assumes access to a dataset $S$ and articulates a scheme for evaluating how well a given model performs on an arbitrary sample. The sample can be $S$ (in which case we speak of ``in-sample'' performance) or some entirely new $S'$ (in which case we speak of ``out-of-sample'' performance). Traditional analysis of generalization assumes that both in- and out-of-sample data are i.i.d.\ draws from an infinite population. However, these probabilistic assumptions cannot be verified even in principle. This paper presents an alternative view of generalization through the lens of sensitivity analysis of solutions of optimization problems to perturbations in the problem data. Under this framework, generalization bounds are obtained by purely deterministic means and take the form of variational principles that relate in-sample and out-of-sample evaluations through an error term that quantifies how close out-of-sample data are to in-sample data. Statistical assumptions can then be used \textit{ex post} to characterize the situations when this error term is small (either on average or with high probability).

researchevals

#128

SimDiff: Depth Pruning via Similarity and Difference

Frontier LLMs 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Yuli Chen, Shuhao Zhang, Fanshen Meng, Bo Cheng +3

4.2

I 4.8 Im 4.0 P 3.4

Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.

frontier_llmefficiencypost_trainingevals

#129

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Generative Media 2026-04-21 arXiv cs.CV (Computer Vision)

Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang +9

4.2

I 5.3 Im 4.0 P 3.0

Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.

generative_mediaresearchpost_trainingrl

#130

Space Force scrambles to repair workforce as massive budget increase looms

Government & Defense 2026-04-22 Defense One

Thomas Novelly

4.2

I 4.8 Im 4.0 P 3.4

The service is trying to recruit at a record pace even as Pentagon officials insist civilian departures didn't hurt acquisition.

gov_defenseindustry

#131

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Multimodal 2026-04-21 arXiv cs.CV (Computer Vision)

Zewei Zhou, Ruining Yang, Xuewei, Qi +7

4.2

I 4.4 Im 4.9 P 3.0

Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

multimodalroboticsefficiencypost_training

#132

Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping

Frontier LLMs 2026-04-21 arXiv cs.CV (Computer Vision)

Jienan Lyu, Miao Yang, Jinchen Cai, Yiwen Hu +3

4.2

I 5.3 Im 4.0 P 3.0

Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.

frontier_llmai_codinginterpretability

#133

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Evaluations & Benchmarks 2026-04-21 arXiv cs.CV (Computer Vision)

Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao +15

4.2

I 4.6 Im 4.8 P 3.0

Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.

evalsgenerative_mediaefficiencyinfra

#134

Ukraine launches interceptor drone from USV, destroys Shahed in first

Government & Defense 2026-04-21 Breaking Defense

4.2

I 4.8 Im 4.0 P 3.4

In a first reported by Breaking Defense, Ukraine's drone force launched an interceptor drone from a USV to destroy an incoming Shahed. Demonstrates cross-domain autonomy and targeting pipelines maturing in combat use.

gov_defense

#135

Ultrametric OGP - parametric RDT \emph{symmetric} binary perceptron connection

Evaluations & Benchmarks 2026-04-21 arXiv cs.LG (Machine Learning)

Mihailo Stojnic

4.2

I 4.0 Im 4.4 P 3.7

In [97,99,100], an fl-RDT framework is introduced to characterize \emph{statistical computational gaps} (SCGs). Studying \emph{symmetric binary perceptrons} (SBPs), [100] obtained an \emph{algorithmic} threshold estimate $α_a\approx α_c^{(7)}\approx 1.6093$ at the 7th lifting level (for $κ=1$ margin), closely approaching $1.58$ local entropy (LE) prediction [18]. In this paper, we further connect parametric RDT to overlap gap properties (OGPs), another key geometric feature of the solution space. Specifically, for any positive integer $s$, we consider $s$-level ultrametric OGPs ($ult_s$-OGPs) and rigorously upper-bound the associated constraint densities $α_{ult_s}$. To achieve this, we develop an analytical union-bounding program consisting of combinatorial and probabilistic components. By casting the combinatorial part as a convex problem and the probabilistic part as a nested integration, we conduct numerical evaluations and obtain that the tightest bounds at the first two levels, $\barα_{ult_1} \approx 1.6578$ and $\barα_{ult_2} \approx 1.6219$, closely approach the 3rd and 4th lifting level parametric RDT estimates, $α_c^{(3)} \approx 1.6576$ and $α_c^{(4)} \approx 1.6218$. We also observe excellent agreement across other key parameters, including overlap values and the relative sizes of ultrametric clusters. Based on these observations, we propose several conjectures linking $ult$-OGP and parametric RDT. Specifically, we conjecture that algorithmic threshold $α_a=\lim_{s\rightarrow\infty} α_{ult_s} = \lim_{s\rightarrow\infty} \barα{ult_s} = \lim_{r\rightarrow\infty} α_{c}^{(r)}$, and $α_{ult_s} \leq α_{c}^{(s+2)}$ (with possible equality for some (maybe even all) $s$). Finally, we discuss the potential existence of a full isomorphism connecting all key parameters of $ult$-OGP and parametric RDT.

evalsinterpretability

#136

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)arXiv cs.NE (Neural & Evolutionary Computing)

Xinhao Zhang, Xi Chen, François Portet, Maxime Peyrard

4.2

I 3.0 Im 4.0 P 5.1

Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.

How it was discussed across sources

arXiv cross-listings: Listed under cs.CL (Computation & Language), cs.NE (Neural & Evolutionary Computing) — spans multiple research subfields.

frontier_llmagentsresearchrobotics

#137

When Active Learning Falls Short: An Empirical Study on Chemical Reaction Extraction

AI Coding 2026-04-21 arXiv cs.LG (Machine Learning)

Simin Yu, Sufia Fathima

4.2

I 4.5 Im 4.0 P 3.7

The rapid growth of chemical literature has generated vast amounts of unstructured data, where reaction information is particularly valuable for applications such as reaction predictions and drug design. However, the prohibitive cost of expert annotation has led to a scarcity of training data, severely hindering the performance of automatic reaction extraction. In this work, we conduct a systematic study of active learning for chemical reaction extraction. We integrate six uncertainty- and diversity-based strategies with pretrained transformer-CRF architectures, and evaluate them on product extraction and role labeling task. While several methods approach full-data performance with fewer labeled instances, learning curves are often non-monotonic and task-dependent. Our analysis shows that strong pretraining, structured CRF decoding, and label sparsity limit the stability of conventional active learning strategies. These findings provide practical insights for the effective use of active learning in chemical information extraction.

ai_codingevalsai_scienceinfra

#138

ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications

Infrastructure 2026-04-21 arXiv cs.LG (Machine Learning)

Suvinava Basak

4.2

I 4.0 Im 4.4 P 3.7

Batch Normalization (BN) is a cornerstone of deep learning, yet it fundamentally breaks down in micro-batch regimes (e.g., 3D medical imaging) and non-IID Federated Learning. Removing BN from deep architectures, however, often leads to catastrophic training failures such as vanishing gradients and dying channels. We identify that standard activation functions, like Swish and ReLU, exacerbate this instability in BN-free networks due to their non-zero-centered nature, which causes compounding activation mean-shifts as network depth increases. In this technical communication, we propose Zero-Centered Swish (ZC-Swish), a drop-in activation function parameterized to dynamically anchor activation means near zero. Through targeted stress-testing on BN-free convolutional networks at depths 8, 16, and 32, we demonstrate that while standard Swish collapses to near-random performance at depth 16 and beyond, ZC-Swish maintains stable layer-wise activation dynamics and achieves the highest test accuracy at depth 16 (51.5%) with seed 42. ZC-Swish thus provides a robust, parameter-efficient solution for stabilizing deep networks in memory-constrained and privacy-preserving applications where traditional normalization is unviable.

infraresearchefficiency

#139

A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots

Multimodal 2026-04-21 arXiv cs.RO (Robotics)

Alex Lin, Lei Gao, Narsimlu Kemsaram, Sriram Subramanian

4.1

I 4.0 Im 4.4 P 3.4

AcoustoBots are mobile acoustophoretic robots capable of delivering mid-air haptics, directional audio, and acoustic levitation, but existing implementations rely on scripted commands and lack an intuitive interface for real-time human control. This work presents a gesture-based visual learning framework for contactless human-swarm interaction with a multimodal AcoustoBot platform. The system combines ESP32-CAM gesture capture, PhaseSpace motion tracking, centralized processing, and an OpenCLIP-based visual learning model (VLM) with linear probing to classify three hand gestures and map them to haptics, audio, and levitation modalities. Validation accuracy improved from about 67% with a small dataset to nearly 98% with the largest dataset. In integrated experiments with two AcoustoBots, the system achieved an overall gesture-to-modality switching accuracy of 87.8% across 90 trials, with an average end-to-end latency of 3.95 seconds. These results demonstrate the feasibility of using a vision-language-model-based gesture interface for multimodal human-swarm interaction. While the current system is limited by centralized processing, a static gesture set, and controlled-environment evaluation, it establishes a foundation for more expressive, scalable, and accessible swarm robotic interfaces.

multimodalevalsroboticsaudio

#140

A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems

Evaluations & Benchmarks 2026-04-21 arXiv cs.CV (Computer Vision)

Houchao Gan

4.1

I 4.0 Im 5.0 P 3.0

Distribution networks with high penetration of Distributed Energy Resources (DERs) increasingly rely on communication networks to coordinate grid-interactive control. While many distributed control schemes have been proposed, they are often evaluated under idealized communication assumptions, making it difficult to assess their performance under realistic network conditions. This work presents an implementation-driven evaluation of a representative virtual power plant (VPP) dispatch algorithm using a co-simulation framework that couples a linearized distribution-system model with packet-level downlink emulation in ns-3. The study considers a modified IEEE~37-node feeder with high photovoltaic penetration and a primal--dual VPP dispatch that simultaneously targets feeder-head active power tracking and voltage regulation. Communication effects are introduced only on the downlink path carrying dual-variable updates, where per-DER packet delays and a hold-last-value strategy are modeled. Results show that, under ideal communication, the dispatch achieves close tracking of the feeder-head power reference while maintaining voltages within the prescribed limits at selected buses. When realistic downlink delay is introduced, the same controller exhibits large oscillations in feeder-head power and more frequent voltage limit violations. These findings highlight that distributed DER control performance can be strongly influenced by communication behavior and motivate evaluation frameworks that explicitly incorporate network dynamics into the assessment of grid-interactive control schemes.

evalsresearchsafety_policy

#141

AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

Agents & Tool Use 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Xue Xia, Chengkai Yao, Mingyu Tsoi, Xinjie Mao +8

4.1

I 4.0 Im 4.4 P 3.4

Systematic ablations are essential to attribute performance gains in AI Virtual Cells, yet they are rarely performed because biological repositories are under-standardized and tightly coupled to domain-specific data and formats. While recent coding agents can translate ideas into implementations, they typically stop at producing code and lack a verifier that can reproduce strong baselines and rigorously test which components truly matter. We introduce AblateCell, a reproduce-then-ablate agent for virtual cell repositories that closes this verification gap. AblateCell first reproduces reported baselines end-to-end by auto-configuring environments, resolving dependency and data issues, and rerunning official evaluations while emitting verifiable artifacts. It then conducts closed-loop ablation by generating a graph of isolated repository mutations and adaptively selecting experiments under a reward that trades off performance impact and execution cost. Evaluated on three single-cell perturbation prediction repositories (CPA, GEARS, BioLORD), AblateCell achieves 88.9% (+29.9% to human expert) end-to-end workflow success and 93.3% (+53.3% to heuristic) accuracy in recovering ground-truth critical components. These results enable scalable, repository-grounded verification and attribution directly on biological codebases.

agentsai_codingevalsrl

#142

Government & Defense 2026-04-21 FedScoop — AI

mbracken

4.1

I 4.0 Im 4.5 P 3.6

In his Senate confirmation hearing, Kevin Warsh said the Federal Reserve should be forward-looking and reform-oriented when it comes to AI models like Anthropic’s Mythos. The post Fed needs to be proactive on cyber risks from AI models, nominee says appeared first on FedScoop.

#147

Goal-Oriented Semantic Communication for Logical Decision Making

Agents & Tool Use 2026-04-21 arXiv — Agents / Tool Use

Ahmet Faruk Saz, Faramarz Fekri

4.1

I 4.0 Im 4.6 P 3.4

This paper develops a principled foundation for goal-oriented semantic communication for logical decision-making. Consider a setting where autonomous agents engage in collaborative perception. In such settings, the volume of sensory data and limited bandwidth often make transmission of raw observations infeasible, requiring intelligent selection of task-relevant information. Because these scenarios are safety-critical, the selection and decision processes must also be transparent and verifiable. To address this, we propose an explainable semantic communication framework grounded in a First-Order Logic (FOL) hierarchical representation of the world. We define semantic information, entropy, conditional entropy, and mutual information by assigning an inductive logical probability measure over semantic structures in the language. Based on these definitions, we formulate a goal-oriented semantic communication objective through semantic rate-distortion theory and, equivalently, through the semantic information bottleneck principle. In this framework, task rules are represented as goal-oriented states, defined as a layer over the world states to capture decision-relevant abstractions. The resulting principle selects evidence that is most informative about these states, aiming to transmit only those FOL clauses most critical for decision-making while preserving logical verifiability. We demonstrate the effectiveness of the approach in a deduction-based safe path-following task within an FOL-based urban environment simulator with multiple dynamic agents.

agentsresearchsafety_policyinfra

#148

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

Agents & Tool Use 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Farbod Zorriassatine, Ahmad Lotfi

4.1

I 4.0 Im 4.6 P 3.4

Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.

agentssafety_policyindustry

#149

Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems

Research 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Hongwei Xu

4.1

I 4.0 Im 4.5 P 3.4

Teams of LLM agents increasingly collaborate on tasks spanning days or weeks: multi-day data-generation sprints where generator, reviewer, and auditor agents coordinate in real time on overlapping batches; specialists carrying findings forward across session restarts; product decisions compounding over many review rounds. This requires agents to share, evaluate, and combine each other's cognitive state in real time across sessions. We call this cross-session agent-to-agent cognitive collaboration, distinct from parallel agent execution. To enable it, three problems must be solved together. (P1) Each agent decides field by field what to accept from peers, not accept or reject whole messages. (P2) Every claim is traceable to source, so returning claims are recognised as echoes of the receiver's own prior thinking. (P3) Memory that survives session restarts is relevant because of how it was stored, not how it is retrieved. These are protocol-level properties at the semantic layer of agent communication, distinct from tool-access and task-delegation protocols at lower layers. We call this missing protocol layer "semantic infrastructure," and the Mesh Memory Protocol (MMP) specifies it. Four composable primitives work together: CAT7, a fixed seven-field schema for every Cognitive Memory Block (CMB); SVAF, which evaluates each field against the receiver's role-indexed anchors and realises P1; inter-agent lineage, carried as parents and ancestors of content-hash keys and realising P2; and remix, which stores only the receiver's own role-evaluated understanding of each accepted CMB, never the raw peer signal, realising P3. MMP is specified, shipped, and running in production across three reference deployments, where each session runs an autonomous agent as a mesh peer with its own identity and memory, collaborating with other agents across the network for collective intelligence.

researchfrontier_llmagentsevals

#150

Frontier LLMs 2026-04-22 Latent Space (swyx & Alessio)

4.1

I 4.0 Im 4.0 P 4.0

with Cursor getting a $10B contract with xAI and a right to acquire for $60B.

frontier_llm

#157

scosman/pelicans_riding_bicycles

Frontier LLMs 2026-04-21 Simon Willison's Weblog

4.1

I 4.0 Im 4.0 P 4.0

scosman/pelicans_riding_bicycles I firmly approve of Steve Cosman's efforts to pollute the training set of pelicans riding bicycles. (To be fair, most of the examples I've published count as poisoning too.) Via Hacker News comment Tags: ai, generative-ai, llms, training-data, pelican-riding-a-bicycle

frontier_llminfra

#158

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

4.0

I 3.0 Im 5.0 P 3.7

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

frontier_llmevalssafety_policyindustry

#159

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Generative Media 2026-04-21 arXiv cs.CV (Computer Vision)

Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo +2

4.0

I 4.8 Im 4.0 P 3.0

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

generative_mediaaudioefficiencyinfra

#160

Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Xudong Jian, Charikleia Stoura, Simon Scandella, Eleni Chatzi

4.0

I 4.0 Im 4.0 P 3.7

Damage identification is a core task in structural health monitoring. In practice, however, its reliability is often compromised by confounding non-damage effects, such as variations in excitation and environmental conditions, which can induce changes comparable to or larger than those caused by structural damage. To address this challenge, this study proposes a self-supervised label-free disentangled representation learning framework for robust vibration-based structural damage identification. The proposed framework employs an autoencoder with two latent representations to learn directly from raw vibration acceleration signals. A self-supervised invariance regularization, implemented via Variance-Invariance-Covariance Regularization (VICReg), is imposed on one latent representation using baseline data where structural damage is assumed constant but operational and environmental conditions vary. In addition, a frequency-domain constraint is introduced to enforce agreement between the power spectral density reconstructed from the latent representation and that computed from the corresponding input time series. Together, these mechanisms promote disentanglement, enabling the learned representation to be sensitive to damage-related characteristics while remaining invariant to nuisance variability. The framework is trained in a fully end-to-end and label-free manner, requiring no prior information on damage, excitation, or environmental conditions, making it well-suited for real-world applications. Its effectiveness is validated on two distinct real-world vibration datasets, including a bridge and a gearbox. The results demonstrate robustness to operational variability, strong generalization capability, and good performance in both damage detection and quantification.

researchai_codinginfra

#161

Enhancing Unsupervised Keyword Extraction in Academic Papers through Integrating Highlights with Abstract

AI Coding 2026-04-21 arXiv cs.CL (Computation & Language)

Yi Xiang, Chengzhi Zhang

4.0

I 4.0 Im 4.0 P 3.7

Automatic keyword extraction from academic papers is a key area of interest in natural language processing and information retrieval. Although previous research has mainly focused on utilizing abstract and references for keyword extraction, this paper focuses on the highlights section - a summary describing the key findings and contributions, offering readers a quick overview of the research. Our observations indicate that highlights contain valuable keyword information that can effectively complement the abstract. To investigate the impact of incorporating highlights into unsupervised keyword extraction, we evaluate three input scenarios: using only the abstract, the highlights, and a combination of both. Experiments conducted with four unsupervised models on Computer Science (CS), Library and Information Science (LIS) datasets reveal that integrating the abstract with highlights significantly improves extraction performance. Furthermore, we examine the differences in keyword coverage and content between abstract and highlights, exploring how these variations influence extraction outcomes. The data and code are available at https://github.com/xiangyi-njust/Highlight-KPE.

ai_codingevalsinfra

#162

Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Nurkhan Laiyk, Gerard I. Gállego, Javier Ferrando, Fajri Koto

4.0

I 4.0 Im 4.0 P 3.7

Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English$\rightarrow$Target direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. Ablation results show that removing the FV degrades translation across languages with limited impact on unrelated tasks. We further show that base-model FVs transfer to instruction-tuned variants and partially generalize from word-level to sentence-level translation.

frontier_llmai_codingpost_training

#163

FedSEA: Achieving Benefit of Parallelization in Federated Online Learning

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Harekrushna Sahu, Pratik Jawanpuria, Pranay Sharma

4.0

I 4.0 Im 4.0 P 3.7

Online federated learning (OFL) has emerged as a popular framework for decentralized decision-making over continuous data streams without compromising client privacy. However, the adversary model assumed in standard OFL typically precludes any potential benefits of parallelization. Further, it fails to adequately capture the different sources of statistical variation in OFL problems. In this paper, we extend the OFL paradigm by integrating a stochastically extended adversary (SEA). Under this framework, the loss function remains fixed across clients over time. However, the adversary dynamically and independently selects the data distribution for each client at each time. We propose the \algoOFL{} algorithm to solve this problem, which utilizes online stochastic gradient descent at the clients, along with periodic global aggregation via the server. We establish bounds on the global network regret over a time horizon $T$ for two classes of functions: (1) for smooth and convex losses, we prove an $\mathcal{O}(\sqrt{T})$ bound, and (2) for smooth and strongly convex losses, we prove an $\mathcal{O}(\log T)$ bound. Through careful analysis, we quantify the individual impact of both spatial (across clients) and temporal (over time) data heterogeneity on the regret bounds. Consequently, we identify a regime of mild temporal variation (relative to stochastic gradient variance), where the network regret improves with parallelization. Hence, in the SEA setting, our results improve the existing pessimistic worst-case results in online federated learning.

research

#164

GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

Research 2026-04-21 arXiv cs.CV (Computer Vision)

Pradyumna YM, Yuxuan Xue, Yue Chen, Nikita Kister +2

4.0

I 4.8 Im 4.0 P 3.0

Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ${\sim}50{\times}$ lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .

researchai_codinginterpretabilityefficiency

#165

Generative Drifting for Conditional Medical Image Generation

Generative Media 2026-04-21 arXiv cs.CV (Computer Vision)

Zirong Li, Siyuan Mei, Weiwen Wu, Andreas Maier +2

4.0

I 4.8 Im 4.0 P 3.0

Conditional medical image generation plays an important role in many clinically relevant imaging tasks. However, existing methods still face a fundamental challenge in balancing inference efficiency, patient-specific fidelity, and distribution-level plausibility, particularly in high-dimensional 3D medical imaging. In this work, we propose GDM, a generative drifting framework that reformulates deterministic medical image prediction as a multi-objective learning problem to jointly promote distribution-level plausibility and patient-specific fidelity while retaining one-step inference. GDM extends drifting to 3D medical imaging through an attractive-repulsive drift that minimizes the discrepancy between the generator pushforward and the target distribution. To enable stable drifting-based learning in 3D volumetric data, GDM constructs a multi-level feature bank from a medical foundation encoder to support reliable affinity estimation and drifting field computation across complementary global, local, and spatial representations. In addition, a gradient coordination strategy in the shared output space improves optimization balance under competing distribution-level and fidelity-oriented objectives. We evaluate the proposed framework on two representative tasks, MRI-to-CT synthesis and sparse-view CT reconstruction. Experimental results show that GDM consistently outperforms a wide range of baselines, including GAN-based, flow-matching-based, and SDE-based generative models, as well as supervised regression methods, while improving the balance among anatomical fidelity, quantitative reliability, perceptual realism, and inference efficiency. These findings suggest that GDM provides a practical and effective framework for conditional 3D medical image generation.

generative_mediaresearchai_codinginterpretability

#166

Heterogeneity-Aware Personalized Federated Learning for Industrial Predictive Analytics

Infrastructure 2026-04-21 arXiv cs.LG (Machine Learning)

Yuhan Hu, Xiaolei Fang

4.0

I 4.0 Im 4.0 P 3.7

Federated prognostics enable clients (e.g., companies, factories, and production lines) to collaboratively develop a failure time prediction model while keeping each client's data local and confidential. However, traditional federated models often assume homogeneity in the degradation processes across clients, an assumption that may not hold in many industrial settings. To overcome this, this paper proposes a personalized federated prognostic model designed to accommodate clients with heterogeneous degradation processes, allowing them to build tailored prognostic models. The prognostic model iteratively facilitates the underlying pairwise collaborations between clients with similar degradation patterns, which enhances the performance of personalized federated learning. To estimate parameters jointly using decentralized datasets, we develop a federated parameter estimation algorithm based on proximal gradient descent. The proposed approach addresses the limitations of existing federated prognostic models by simultaneously achieving model personalization, preserving data privacy, and providing comprehensive failure time distributions. The superiority of the proposed model is validated through extensive simulation studies and a case study using the turbofan engine degradation dataset from the NASA repository.

infra

#167

IR-Flow: Bridging Discriminative and Generative Image Restoration via Rectified Flow

Evaluations & Benchmarks 2026-04-21 arXiv cs.CV (Computer Vision)

Zihao Fan, Xin Lu, Jie Xiao, Dong Li +2

4.0

I 4.0 Im 4.8 P 3.0

In image restoration, single-step discriminative mappings often lack fine details via expectation learning, whereas generative paradigms suffer from inefficient multi-step sampling and noise-residual coupling. To address this dilemma, we propose IR-Flow, a novel image restoration method based on Rectified Flow that serves as a unified framework bridging the gap between discriminative and generative paradigms. Specifically, we first construct multilevel data distribution flows, which expand the ability of models to learn from and adapt to various levels of degradation. Subsequently, cumulative velocity fields are proposed to learn transport trajectories across varying degradation levels, guiding intermediate states toward the clean target, while a multi-step consistency constraint is presented to enforce trajectory coherence and boost few-step restoration performance. We show that directly establishing a linear transport flow between degraded and clean image domains not only enables fast inference but also improves adaptability to out-of-distribution degradations. Extensive evaluations on deraining, denoising and raindrop removal tasks demonstrate that IR-Flow achieves competitive quantitative results with only a few sampling steps, offering an efficient and flexible framework that maintains an excellent distortion-perception balance. Our code is available at https://github.com/fanzh03/IR-Flow.

evalsefficiencygenerative_mediaai_coding

#168

Improvements to the post-processing of weather forecasts using machine learning and feature selection

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Kazuma Iwase, Tomoyuki Takenawa

4.0

I 4.0 Im 4.0 P 3.7

This study aims to develop and improve machine learning-based post-processing models for precipitation, temperature, and wind speed predictions using the Mesoscale Model (MSM) dataset provided by the Japan Meteorological Agency (JMA) for 18 locations across Japan, including plains, mountainous regions, and islands. By incorporating meteorological variables from grid points surrounding the target locations as input features and applying feature selection based on correlation analysis, we found that, in our experimental setting, the LightGBM-based models achieved lower RMSE than the specific neural-network baselines tested in this study, including a reproduced CNN baseline, and also generally achieved lower RMSE than both the raw MSM forecasts and the JMA post-processing product, MSM Guidance (MSMG), across many locations and forecast lead times. Because precipitation has a highly skewed distribution with many zero cases, we additionally examined Tweedie-based loss functions and event-weighted training strategies for precipitation forecasting. These improved event-oriented performance relative to the original LightGBM model, especially at higher rainfall thresholds, although the gains were site dependent and overall performance remained slightly below MSMG.

researchinterpretabilityinfra

#169

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

Frontier LLMs 2026-04-21 arXiv cs.CL (Computation & Language)

Guy Mor-Lan, Omer Goldman, Matan Eyal, Adi Mayrav Gilady +5

4.0

I 4.0 Im 4.0 P 3.7

Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models' inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs' responses to LocQA locale-ambiguous questions thus reveal models' implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs' desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.

frontier_llmpost_trainingevalsinfra

#170

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

Generative Media 2026-04-21 arXiv cs.CV (Computer Vision)

Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng +3

4.0

I 4.0 Im 4.6 P 3.0

Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

generative_mediaaudiomultimodalpost_training

#171

MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation

AI Coding 2026-04-21 arXiv cs.CV (Computer Vision)

Xuejiao Wang, Bohao Zhang, Changbo Wang, Gaoqi He

4.0

I 4.0 Im 4.6 P 3.0

Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.

ai_codinginterpretabilitypost_trainingsafety_policy

#172

Micro Language Models Enable Instant Responses

Efficiency 2026-04-21 arXiv cs.CL (Computation & Language)

Wen Cheng, Tuochao Chen, Karim Helwani, Sriram Srinivasan +2

4.0

I 4.0 Im 4.0 P 3.7

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ($μ$LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that $μ$LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.

efficiencyinfra

#173

Model-independent consistency tests of DESI DR2 BAO and SN Ia

Efficiency 2026-04-21 arXiv — Mechanistic Interpretability

Hyeok Woo, William Luke Matthewson, Arman Shafieloo

4.0

I 4.0 Im 4.0 P 3.7

Cosmic distances can be measured using two complementary probes: Type Ia supernovae (SN Ia), serving as standard candles, and baryon acoustic oscillations (BAO), serving as standard rulers. The luminosity distance derived from supernovae and the angular diameter distance obtained from BAO must be mutually consistent if these data are to be combined for cosmological inference. Hence, the existence of potential discrepancies, whether arising from systematics in either dataset or from violation of the cosmic duality relation (in an unconventional cosmology), remains an important issue to address. Testing consistency under a particular cosmological model can be limiting, as the model may not be sensitive to every kind of inconsistency possible in the data. Thus, in this work we use a model-independent Crossing Statistics framework to test the consistency, using DESI DR2 BAO, and the Pantheon+ and Union3 SN Ia datasets. We find adding up to two additional degrees of freedom, using Crossing Statistics on the LambdaCDM distance-redshift relation, to be statistically justified. In these cases, the two probes remain mutually consistent at the 1-2 sigma level. Having established this statistical consistency, we combine the datasets to reconstruct the expansion history of the Universe and the inferred evolution of dark energy. The reconstructions obtained using different crossing variables show compatible behaviour where the data constraints are strongest, particularly at low redshift. Overall, the results are suggestive of a dark energy component that is evolving at low redshift, compatible with results from other reconstruction methods.

efficiencyinfra

#174

On the Conditioning Consistency Gap in Conditional Neural Processes

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Robin Young

4.0

I 4.0 Im 4.0 P 3.7

Neural processes are meta-learning models that map context sets to predictive distributions. While inspired by stochastic processes, NPs do not generally satisfy the Kolmogorov consistency conditions required to define a valid stochastic process. This inconsistency is widely acknowledged but poorly understood. Practitioners note that NPs work well despite the violation, without quantifying what this means. We address this gap by defining the conditioning consistency gap, a KL divergence measuring how much a conditional neural process's (CNP) predictions change when a point is added to the context versus conditioned upon. Our main results show that for CNPs with bounded encoders and Lipschitz decoders, the consistency gap is $O(1/n^2)$ in context size $n$, and that this rate is tight. These bounds establish the precise sense in which CNPs approximate valid stochastic processes. The inconsistency is negligible for moderate context sizes but can be significant in the few-shot regime.

researchai_coding

#175

On two ways to use determinantal point processes for Monte Carlo integration

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Guillaume Gautier, Rémi Bardenet, Michal Valko

4.0

I 4.0 Im 4.0 P 3.7

The standard Monte Carlo estimator $\widehat{I}_N^{\mathrm{MC}}$ of $\int fdω$ relies on independent samples from $ω$ and has variance of order $1/N$. Replacing the samples with a determinantal point process (DPP), a repulsive distribution, makes the estimator consistent, with variance rates that depend on how the DPP is adapted to $f$ and $ω$. We examine two existing DPP-based estimators: one by Bardenet & Hardy (2020) with a rate of $\mathcal{O}(N^{-(1+1/d)})$ for smooth $f$, but relying on a fixed DPP. The other, by Ermakov & Zolotukhin (1960), is unbiased with rate of order $1/N$, like Monte Carlo, but its DPP is tailored to $f$. We revisit these estimators, generalize them to continuous settings, and provide sampling algorithms.

#176

Pause or Fabricate? Training Language Models for Grounded Reasoning

Reinforcement Learning 2026-04-21 arXiv cs.CL (Computation & Language)

Yiwen Qiu, Linjuan Wu, Yizhou Liu, Yuchen Yan +8

4.0

I 4.0 Im 4.0 P 3.7

Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

rlfrontier_llmresearchefficiency

#177

Phase Transitions in the Fluctuations of Functionals of Random Neural Networks

Research 2026-04-21 arXiv cs.LG (Machine Learning)

Simmaco Di Lillo, Leonardo Maini, Domenico Marinucci

4.0

I 4.0 Im 4.0 P 3.7

We establish central and non-central limit theorems for sequences of functionals of the Gaussian output of an infinitely-wide random neural network on the d-dimensional sphere . We show that the asymptotic behaviour of these functionals as the depth of the network increases depends crucially on the fixed points of the covariance function, resulting in three distinct limiting regimes: convergence to the same functional of a limiting Gaussian field, convergence to a Gaussian distribution, convergence to a distribution in the Qth Wiener chaos. Our proofs exploit tools that are now classical (Hermite expansions, Diagram Formula, Stein-Malliavin techniques), but also ideas which have never been used in similar contexts: in particular, the asymptotic behaviour is determined by the fixed-point structure of the iterative operator associated with the covariance, whose nature and stability governs the different limiting regimes.

research

#178

Planning in entropy-regularized Markov decision processes and games

Frontier LLMs 2026-04-21 arXiv cs.LG (Machine Learning)

Jean-Bastien Grill, Omar Darwiche Domingues, Pierre Ménard, Rémi Munos +1

4.0

I 4.0 Im 4.0 P 3.7

We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O~(1/epsilon^4) for a desired accuracy epsilon, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.

frontier_llmrl

#179

Preventing extinction from ASI on a $50M yearly budget

Safety, Policy & Regulation 2026-04-21 AI Alignment Forum

Andrea_Miotti

4.0

I 3.0 Im 4.6 P 4.0

ControlAI's mission is to avert the extinction risks posed by superintelligent AI. We believe that in order to do this, we must secure an international prohibition on its development. We're working to make this happen through what we believe is the most natural and promising approach: helping decision-makers in governments and the public understand the risks and take action. We believe that ControlAI can achieve an international prohibition on ASI development if scaled sufficiently. We estimate that it would take approximately a $50 million yearly budget in funding to give us a concrete chance at achieving this in the next few years. In this post, we lay out some of the reasoning behind this estimate, and explain how additional funding past that threshold, including and beyond $500 million, would continue to significantly improve our chances of preventing extinction risk from ASI. Preventing ASI 101 Negotiating, implementing and enforcing an international prohibition on ASI is, in and of itself, not the work of a single non-profit. You need to have the weight of nations behind you to achieve this kind of goal. If humanity manages to achieve an international ban on ASI, it'll be through the efforts of a sufficiently motivated, sufficiently powerful initial coalition of countries. Assuming that we work in multiple countries in parallel, we could say the problem statement is: get each country to be motivated to achieve an international prohibition on ASI. It’s not obvious what it means for a country to be “motivated” to do something, so it’s worth taking a second to unpack. Our full theory of change chart, which backtracks from the desired outcome to our currently running workstreams. Normally, parts of a country's executive branch are responsible for international negotiations around urgent issues concerning national and global security. In practice, these are the groups who need to be sufficiently motivated to achieve the ban to throw their weight behind it. Branches of the government are generally not in the business of independently taking bold positions and then pursuing those positions to their logical ends. Instead, their stances and actions are mostly shaped by prevailing social currents. Some of these currents are informal. This includes things like the conversations they have with their colleagues, advisors, confidants and family members. It also includes any recent news cycles and the media they consume. Other parts of these currents operate through more formal channels, particularly in democracies. The legislative branch can influence the executive branch. [1] The public influences governments through elections, but also through polls, public discussions and common demands (at the very least because they affect the expectation of future election results). If enough of these inputs point in the same direction, pushing for an international ban on ASI can become one of the country’s top priorities. For this to work, we need pervasive awareness of the issue of extinction risk from ASI. This sentence makes two claims, both of which are fully necessary, so let us repeat them and expand them individually. Claim 1: The awareness of extinction risk needs to be pervasive throughout society. Prohibiting ASI development is not easy. It will require the relevant parts of the executive branch to take a great deal of initiative, and involve many hard tradeoffs. At a minimum, it will mean significantly slowing down improvements in general-purpose AI and thus forgoing economic and military advantages. If some countries are initially not willing to cooperate with a ban, proponents of the ban will need to apply an expensive combination of carrots and sticks to bring holdouts on board. For the relevant groups to push through these costs, it needs to feel like there is plenty of pressure to act, and like this pressure is coming from many places. If everyone who is asking for this is part of a specific, small faction, there will be a strong immune reaction and the faction will be ignored, or even purged in some cases. Claim 2: The awareness needs to be specifically about extinction risk from superintelligent AI. It is insufficient, and sometimes actively harmful, for people to vaguely dislike AI or only vaguely be aware that AI poses some scary risks. Due to the hard tradeoffs mentioned earlier, there will be pressure to take half-measures, at many layers, both internal and external. The only sufficient counterweight against this pressure is an understanding that ASI development must absolutely be prevented to ensure human survival. A lack of awareness of the specific issue will inevitably lead to anemic action and weak, unfocused policies that do not actually prevent the development of ASI. This is one of the reasons why, in our communications, we solely focus on extinction risk from ASI, and we do not work on raising awareness of other AI risks, or otherwise trying to get people to vaguely dislike all AI. [2] All of our efforts are specifically around raising awareness of extinction risk from ASI, and how it may be addressed. [3] Awareness is the bottleneck Chart synthesized from the section “The Simple Pipeline” of Gabriel Alfour's post on The Spectre haunting the “AI Safety” Community. It’s a common perception that one cannot communicate directly to lay people about extinction risks from ASI, because they would never get it. Instead, one must cook up sophisticated persuasion schemes. Based on our experience, this idea is just plainly wrong. Just tell the truth! We believe the primary bottleneck to getting an international prohibition on superintelligence is basic awareness of the issue. Most of the people we reach, for example among lawmakers and the media, have simply never been told about the problem in plain terms. We find that often, all it takes to bring someone on board is a single honest conversation. The fact that honestly explaining the concerns to people is such a low-hanging fruit is one of the reasons why we could get so much done in 2025. Politicians and the public simply don’t know that the most important figures in AI are literally worried about superintelligence causing human extinction. They simply don’t know that the only way to avoid human extinction on which experts can truly form a consensus is not to build ASI in the first place. [4] The reason why they are not aware of this is because they haven’t been told, not because they don’t understand the concepts involved. In our experience, most people find it intuitive that it is extremely dangerous to build something as powerful as ASI, that you don’t understand and can’t predict. They find it intuitive that you can’t control ASI, that it can very easily precipitate catastrophic scenarios, and that this means you should not build it in the first place. The reason why people are not aware of extinction risk from superintelligence is, simply put, because concerned experts have generally not been straightforward about their concern. The CAIS statement on AI risk is a rare exception to this, [5] but it’s starting to get old, and even then it’s just not enough. We’ve met with lawmakers over 300 times. Most of the time, they’ve never had someone explain extinction risk to them before, nor have they ever heard of the CAIS statement before the meeting. Even then, politicians don’t care about a person having signed a single statement once. That’s not how they’d expect someone who’s worried about the literal annihilation of the entire human race to behave. It sounds weak and almost fake to them. In a serious world, you’d expect every single AI expert who is worried about extinction to be loudly and consistently vocal about it, including to the public and decision-makers in governments. As it stands, this is simply not the case. AI companies and their leaders constantly soften their communications, avoiding clearly mentioning extinction and preferring to talk about euphemisms and other risks. Anthropic’s head of growth recently said that Anthropic constantly adjusts their communications to be “softer” and appear “less over the top”. Sam Altman, when asked by a US Senator whether he had jobs in mind when he said that “Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity.”, did not correct the senator and instead proceeded to talk about the possible effects of AI on employment. If you ever have the chance to attend a house party in the Bay Area, you will get a really good sense of this: many researchers at AI companies are worried about extinction risk, and significantly orient their lives around this. At the same time, they don’t talk about these risks publicly. It’s obvious to us that the reason so little progress has been made towards international agreements on ASI is exactly because experts have failed to be consistently open about their concerns. An asymmetric war While an international ASI ban is obviously a very ambitious goal, there is a sense in which advocacy about extinction risks from ASI means having the wind at one’s back. At a fundamental level, this is because approximately no one wants to die from ASI. Politics is often an adversarial tug-of-war between opposing interests. When it comes to high-profile issues in American politics (e.g. abortion, marijuana legalization, Prop 22), it can take hundreds of millions. [6] However, when it comes to extinction risk, there is little conflict between different interest groups. If extinction risk materializes then everyone dies, regardless of their wealth, political affiliation or other personal interests. It is only an extreme minority of people who, even after having the chance to consider the dilemma of extinction risk, decide they are willing to bet on humanity’s extinction in order to get to ASI. This is true at a fractal level. Not only does it mean that we expect the issue to be nonpartisan within countries, but we expect the interests of countries to be aligned with each other as long as there is a significant risk that building superintelligence will cause human extinction. [7] This is why we think that there is a good chance that achieving the same kind of success we’ve already achieved, but at a larger scale, will lead to an international ban on superintelligence. Since our approach is not about winning a political tug-of-war through sheer might, we expect that we have a shot (~10%) [8] at winning even with a budget as low as $50 million, which is at least an order of magnitude smaller than other political campaigns on major issues. It would be a long shot, and we think “good odds” (~30%) would require larger budgets in the order of $500 million. [9] Let us elaborate a bit more on what we mean when we refer to a “political tug-of-war”. A common tactic, especially when trying to prevent a law from being passed, is to deliberately confuse people, for example by loudly communicating only the upsides of your proposals and only the downsides of the opposition’s, or through personal attacks on the opposition’s character. Aside from the obvious moral issues with this strategy, they are much less effective when it comes to an issue that is so clear cut and of such universal concern as extinction risk from ASI. With an issue like extinction risk, it becomes much harder to pit people against each other or to execute confusion tactics in order to hinder efforts to establish restrictions. Scalable processes At its core, ControlAI is an effort to create a scalable, industrial approach to averting extinction risks from ASI. The field of AI risk mitigation has historically relied on what we could call a “bespoke” or “artisanal” approach. That is, it relies on exceptional individuals to achieve specific successes, such as publishing a successful book, or performing some impressive networking feats, all through following their personal taste. The definition of what it means for these “artisanal” workstreams to “succeed” is not written down anywhere, and not much effort goes into defining it and grounding it. For most people focused on AI risk, getting a sense of whether they’ve succeeded doesn’t look like measuring something as much as it looks like applying ad-hoc rationales that easily fall prey to galaxy-braining. Everything hinges on the quality of the person’s taste at best, and on sheer luck at worst. Even when you succeed at these endeavors, you’re not in a position to easily replicate this success. To make it more clear what we mean: Eliezer Yudkowsky and Nate Soares can't trivially replicate the success they had by publishing the book “If Anyone Builds It, Everyone Dies” by simply putting more resources into the same effort, let alone scale up the approach by building an organization around it. The book was excellent and has helped spread awareness, but you can’t publish a book every week. Similarly, the CAIS “Statement on AI Risk” was excellent for establishing common knowledge, and has greatly helped us in our endeavors. That said, this type of work is hard to replicate, and indeed has not been replicated: neither CAIS nor any other organization has since succeeded in getting all the CEOs of top AI companies to sign a similarly candid statement. [10] ControlAI takes a different approach, one that straightforwardly allows scaling up workstreams once they’ve been set up. Whenever we have a goal that is too far off to tackle directly, we break it down into the most ambitious possible intermediate goal that we think we can act on. Crucially, we choose intermediate goals whose progress we can measure as hard numbers. In this way, we’re approximating sales funnels, the gold standard for how companies handle sales. Here’s a couple of examples of how we apply this approach. One of the early challenges we faced was to crystalize our successful lawmaker briefings into something that would accumulate over time and generate momentum. Our answer to this was to create a campaign statement [11] and to ask lawmakers to publicly support it. We’ve already secured 120 such supporters! This solution satisfies a few important constraints: It moves the world into a state where it’s somewhat easier to achieve our overarching goal of an international ban on ASI: Each public supporter helps by creating common knowledge that lawmakers consider this an urgent issue that merits immediate attention. It gives us a clear, numeric measure of success for this workstream: the number of lawmakers who signed on to our campaign. We could tackle this challenge directly: at the end of each briefing, simply ask lawmakers to publicly support the campaign. Marginal inputs compound over time: each additional lawmaker publicly supporting the campaign helps increase the credibility of the issue, and makes it easier for more lawmakers to take a stance on it in the future. After a while, we were ready to push toward something more ambitious. So, while still working on growing the number of lawmakers supporting the campaign, we introduced a new metric: the number of public declarations, written or spoken by an individual lawmaker, [12] that explicitly reference AI extinction risk or preventing superintelligence, on the condition that this happened after we personally briefed them. In the UK, this metric is currently sitting at 21. These metrics are numerical and clearly defined, meaning that even a fresh graduate hire can be pointed at one and told to "make it go up" or to improve conversion rates between one step of the funnel and the next. There’s no danger that the person will fool themselves about how much progress they’re making. [13] In fact, most reasonably smart and motivated people, given a reasonable amount of mentorship, will naturally iterate on their approach and eventually achieve good results. This way, we don't need to hit the jackpot on hiring people who possess incredible taste right off the bat. The best proof for this claim is our success in Canada. In about half a year, with only 1 staff member who had no previous experience in policy, we managed to brief 89 lawmakers and spur multiple hearings in the Canadian Parliament about the risks of AI. These hearings included testimonies from many experts who expressed their concerns about extinction risks: ControlAI’s Andrea Miotti (CEO), Samuel Buteau (Canada Program Officer) and Connor Leahy (US Director) Malo Bourgon (MIRI) Max Tegmark and Anthony Aguirre (FLI) Steven Adler (ex-OpenAI) David Krueger (Evitable) The fact that our approach is easily scalable is precisely the reason why we can write, in the rest of this post, about how we plan to make productive use of funding much larger than we currently enjoy. It’s also why, in some cases, we are able to make tentative predictions about what kind of success we expect to achieve. What we’d do with $50 million or more per year Right now, we believe that we are underfunded compared to what it would take us to have an actual shot at achieving an international ban on superintelligence. Our estimate is that a $50 million yearly budget [14] would give us a chance to succeed, although it would be a long shot. [15] Here, we break down how we would allocate a budget of $50 million to maximize our chances at achieving an international ban on ASI development. We also show how more funding would further increase our chances of succeeding, giving a few examples of how we would make productive use of budgets as large as $500 million or $1 billion (roughly in line with major campaigns in the US, such as abortion, marijuana policy, and the presidential race). We’ll cover our plans to use funds for policy advocacy in the US and the rest of the world, public awareness campaigns, policy research, outreach to thought-leaders (such as journalists), grassroots mobilization, and more. US policy advocacy Within a $50 million yearly budget, we’d be able to hire ~18 full-time policy advocates dedicated to briefing US members of Congress. In principle, we’d have enough bandwidth to meet every member of Congress within 3 to 6 months, ensuring that they’ve been briefed at least once on extinction risk from superintelligence. While we are confident that we’d have the capacity for these meetings, it is less clear whether we’d be able to regularly brief with members of Congress face-to-face, or whether we’d spend a significant fraction of our time communicating with staffers. At the moment, we are cautiously optimistic: in the past 5 months, with ~1 staff member, [16] we’ve managed to personally meet with and brief 18 members of Congress, as well as over 90 Congressional offices. Additionally, we’d have the capacity to brief offices in the executive branch relevant to national security and international affairs. These agencies are trusted by many other actors to stay on top of security risks, especially drastic ones like extinction risks from superintelligence; it’s essential for large-scale coordination that members of these institutions have a good grasp on the issue. A budget of $50 million would also allow us to hire a small team of ~6 staff members focused on performing outreach to state legislators in a small number of high-priority states. The bread and butter of our work is to ensure that US decision-makers are properly informed about and understand: Concepts like superintelligence, recursive self-improvement, compute, etc; That superintelligence poses an extinction risk; That this can be addressed by an international agreement prohibiting ASI, and how such an agreement could be designed such that it is actually enforced. We expect that, to the degree that we succeed in informing decision-makers about these matters, we’ll be able to leverage this into measurable outcomes such as: Politicians make public statements about superintelligence and the extinction risks it poses. Politicians make public statements about the need for an international prohibition on superintelligence development. Hearings are held in Congress on the above topics. The US takes steps toward negotiating an international prohibition on superintelligence with other countries. Within a $500 million budget, we would not only double or triple the number of full-time staff dedicated to US policy advocacy, but we’d also be able to attract the best talent, and hire policy advocates with very strong pre-existing networks. Policy advocacy in the rest of the world In the UK, we’ve already moved the national conversation on superintelligence forward. In little more than a year, we’ve gathered 110 supporters on our campaign statement, and catalyzed two debates at the House of Lords on superintelligence and extinction risk. At a yearly budget of $50 million, we could afford to more than triple our efforts in the UK. Now that we’ve managed to get some attention, we’ll put more focus on the following: Get government to discuss bills, amendments and actions the UK could take to champion the establishment of an international prohibition on superintelligence; [17] Executive branch outreach. A coalition of countries sufficiently powerful to achieve a ban on ASI will likely need to include multiple powerful countries to participate. To maximize the probability that this happens, we plan to prioritize G7 in our policy advocacy efforts. This is because G7 includes all of the most powerful countries that we’re confident can be influenced democratically. Within a budget of $50 million, we’d be able to match our current UK efforts in all other G7 countries and in the EU’s institutions. This means we’d likely be able to replicate our UK successes in most of these places, even accounting for bad luck or for them being slightly more difficult. [18] With roughly an additional $5 million in our budget (on top of the previous $50 million), we’d be able to dedicate at least 1 policy advocate (in some cases 2) to many other countries in the rest of the world. For example, we could maintain a presence in almost all G20 countries. We don’t know in advance which countries will respond well to our efforts, so we think it would be useful to spread out and take as many chances as possible. Our previous experience shows that it’s at least possible to get good results with only 1 staff member in some G7 countries. In Canada, our only local staff member managed to hold more meetings with representatives than any other corporate lobbyists or advocates during February. It seems probable that we can replicate our results in Canada in at least some G20 countries, where the competition for the attention of decision-makers is less stiff. Public awareness Our theory of change hinges not only on key decision-makers understanding the issue, but also on the public doing so. Our key messages to the public are: [19] Top AI experts warn that AI poses an extinction risk. We can prevent this risk by prohibiting superintelligence. Superintelligence may come quickly, in a matter of 5 years or less. We believe our key messages are straightforward: you don’t need to be a genius or to be deeply familiar with AI to understand them. [20] The main bottleneck is making the public aware of the issue in the first place; after that, it’s getting them to take action about it. We roughly expect that the average person will need to see each of our key messages 7 to 10 times in order to remember them, at the bare minimum. [21] That said, we expect that even after the same person sees a message dozens of times, the marginal returns on delivering the same message to this same person once more have still not been saturated. For example, we expect each new view will make the person slightly more likely to bring up the issue spontaneously in conversation, or slightly more likely to change their vote based on this issue. [22] Within a budget of $50 million, we expect that we can achieve on the order of magnitude of 2 billion ad impressions in the US, [23] an order of magnitude increase over our current ~200M. [24] Various sources suggest that the average YouTube CPM is roughly $9, with a range between approximately $3 and $23 depending on the ad and campaign. Using this as a reference, and assuming we allocate $16 million to raw ad spend, we’d get somewhere between 700 million and 5.3 billion impressions. This is assuming that all of our ad spend is on a single platform, but we can easily improve this by spreading our ad spend across platforms. For context, a $16 million per year ads budget is comparable to the ad spend of companies like Shake Shack, but still two to three orders of magnitude away from presidential campaigns or Coca-Cola’s yearly ad spend. If this was spread uniformly across the US population, every US adult would see our ads at least ~3 times. [25] More realistically, if we targeted a narrower segment of the US population, we could be seen by 10% of US adults ~30 times, or 5% of US adults ~60 times. In other words, it becomes plausible that a sizable portion of the US population would remember our key messages: they would be aware that AI poses an extinction risk, they would remember as the main recommendation on how to fix this problem that we should prohibit the development of superintelligent AI. This level of awareness seems like it would be a great step forward, but we would not stop there. In addition to raising awareness, we’d also aim to help people to take action that helps move the world toward an international ban on superintelligence. So far, we think that the most useful CTA (call to action) is to ask people to email or call their lawmakers. Using this CTA allows us to build a base of supporters who are motivated enough to take this kind of action, who we can call upon again in the future. We have already built the online campaigning infrastructure for this, and our 180k email subscribers have already sent over 200k messages to their lawmakers about ASI. At this $50 million budget, we estimate that we could grow this base of supporters to 2 million citizens within 1 year. When we email this type of CTA, we currently get an action rate of around 2%. We think we can safely assume that this action rate will not degrade by a whole order of magnitude at this scale. Given these assumptions, we predict that if we target some carefully selected subset of US states, this would produce enough constituent pressure to get on the radar of key decision-makers and their staff purely through constituents emailing and calling lawmakers. For example, if we target swing states, we might be able to get electoral campaigns to at least be aware of our issue. Public awareness efforts can scale massively before saturating. There are straightforward, non-innovative ways to make productive use of budgets as large as $500 million or $1 billion: large-scale ad campaigns routinely do so. Coca Cola spent $5.15 billion in 2024, and Trump’s 2024 presidential campaign spent more than $425 million, or $1.4 billion including outside groups. This is also the scale at which, if we wanted to do so, we could spend $8 million on a Super Bowl ad about extinction risk from superintelligent AI! [26] A total budget of $500 million to $1 billion would allow us to scale our ad spend massively. At this point, even with extremely pessimistic assumptions, [27] we could reach each US citizen at least a dozen times. Alternatively, we could focus on the 10% most engaged segment of the US population, reaching each individual at least 100 times. As a lower bound, we are confident this is enough to make sure that every citizen in the US is at least somewhat aware of the issue. More importantly, we suspect that at this scale we could push the issue to the forefront of the public’s attention, and make it into one of the main topics in the national conversation. We acknowledge it’s really hard to predict the effects of a campaign at this scale [28] , but we think that it can help to anchor on other campaigns of similar scale in the US: abortion, marijuana policy, and the presidential race itself. As we argued in the section An asymmetric war, we see these campaigns as mostly a zero-sum game, in which both sides must burn as many resources as possible to be competitive. If we receive comparable funding, we feel confident in our chances, as we see an AI extinction risk awareness campaign as a much more positive sum game. One last point about ad spending: in order to run an ad campaign, we need not only to buy ad space, but we also need to expand our marketing team so that it has sufficient capacity to optimize the campaign. Within a budget of $50 million, we could afford to dedicate ~6 people to this, offering salaries roughly between $100k and $200k. This addresses basic needs, but it does not provide an appropriate amount of bandwidth for the task, nor does it allow us to attract and retain the best talent. Running an effective ad campaign is not a fire-and-forget operation. We’d need to continuously measure results, A / B test, experiment, brainstorm ads and concepts, research trends and audience behaviors, even come up with novel metrics and testing methodologies. All of this information needs to be collected, analyzed and fed into the next round of iteration. The rounds of iteration themselves need to be very fast if we want to improve in a relevant amount of time. Whereas less ambitious marketing teams may take ~3 months to go through an iteration cycle, we’d have to do it in ~2 weeks. To run this kind of operation, we would benefit immensely from hiring the most talented people, who can not only follow existing playbooks, but also innovate. These people are in extremely high demand, and we’re competing for them against the private sector. Within a budget of $500 million, we could afford to dedicate ~20 people to this, offering salaries roughly between $200 and $400k. This would allow us to attract top talent and compete with the private sector. [29] Grassroots mobilization We already have a base of motivated supporters. 180k people are subscribed to our mailing list. 30k of our supporters contacted their lawmakers about extinction risk from ASI, and ~2000 of our supporters are willing to commit 5 minutes per week to regularly take small actions to help with the issue. Dozens have shown up at our pilot in-person events. With more funding, we think we can turn this into a significant grassroots movement. We currently lack the capacity to properly organize and mobilize this community. We believe that we’d have sufficient capacity for this at a $50 million overall budget. Concretely, this work would consist of things like: Vetting local leaders, coaching them and helping them with their work. Organizing or providing funding for local events. Helping with initial set up of groups, legal entities, basic websites etc. Building and providing services like Microcommit and tools such as the “Contact your lawmakers” tool on our campaign website Providing educational materials like tutorials and scripts for contacting one’s lawmakers. Policy work As part of our work in policy advocacy, it is often useful to be able to show policymakers a concrete policy proposal. These proposals can take various forms: legal definitions of superintelligence, high-level proposals for an international agreement on prohibiting ASI, national bills implementing a country’s obligations in an international agreement. These proposals are not meant to be the exact, definitive version of the law that will eventually be implemented. It is understood that things will change as time passes, more parties weigh in, and negotiations unfold. That said, it helps in many ways to have initial, concrete proposals. It helps people to publicly discuss, red-team, and refine the proposals. But also, it helps to show policymakers a proof-of-concept that concrete measures can be taken to prevent extinction risk from superintelligence. The more countries we reach, the more complicated this work becomes. The legal landscape differs significantly between countries: they have different legal traditions, processes, institutions, constitutions, limits on power of governmental bodies, etc. It takes a team of policy researchers, and the help of parliamentary lawyers, to develop and propose such policy proposals. We estimate that we’d have sufficient capacity for this work at around a $50 million total yearly budget. Thought-leader advocacy Most people rely on trusted voices, across the political spectrum, to help them navigate complex issues rather than trying to form their view from scratch on every single topic. This is a normal and healthy part of how democracies function: just like representative democracy exists because we don’t expect every citizen to participate directly in the full political process, we don’t expect everyone to independently decide to pay attention to such a highly complex matter as extinction risk from ASI. Instead, people look to figures like journalists, academics and public intellectuals to help them understand which issues deserve their attention. One of our key workflows is outreach to these kinds of thought-leaders. At the moment, this mostly includes journalists, and sometimes content creators. This workstream has so far resulted in 22 media publications on risk from superintelligent AI including in TIME and The Guardian, and in 14 collaborations (a mix of paid and free) with content creators including popular science communicator Hank Green, Rational Animations, and more. With more funding, we could not only scale up these workstreams, but also extend this outreach effort to include NGOs other than those who focus on AI, academics, religious leaders, authors and other public intellectuals, CEOs of companies outside of tech, leaders of local communities, and others. If we want our society to develop a deep awareness of the extinction risk posed by ASI, we need to help these people understand the issue. At a $50 million total budget, we’d have enough bandwidth for a thought-leader outreach effort focused on the lowest-hanging fruits. In practice, this likely means having a single generalist team spread across every type of thought-leader, and covering only the Anglosphere. At a total budget of $500 million, we could afford to build strong dedicated teams, each focused on one of the most important thought-leader communities. At the same time, we could establish a presence in other major cultural regions outside the Anglosphere. Attracting and retaining the best talent Many in our organization are forsaking significant increases in compensation they could command in the private sector, purely because they are deeply committed to our mission. As we scale, it will become increasingly difficult to find talented people who are willing to take this kind of pay cut. This is especially true if we scale aggressively. To attract the caliber of talent that a problem of this importance deserves, we need to offer salaries that are as competitive as possible with the private sector. At a yearly budget of $50 million, we’d be able to slightly improve our compensation, though most of the increase would be eaten by scaling the number of staff rather than increasing pay. As a rough estimate, we could probably offer between $100k and $200k to people in the public awareness team (comparable to sales in the private sector), and ~$350k to principal staff. At $500 million, we think we could be truly competitive. While we would likely still be unable to match the salaries offered by AI corporations to staff who take part in their lobbying and marketing operations, we could significantly reduce the gap. Conclusion We want to be upfront: we don't know for sure if this will work. An international ban on ASI is an extraordinarily ambitious goal. But we believe that the structure of the problem gives us a fighting chance: approximately no one wants to play a game that risks wiping out humanity, regardless of the prize. In 2025, with a team of fewer than 15 people, we’ve built a coalition of over 110 UK lawmakers to support our campaign, with 1 in 2 lawmakers having supported our campaign after we briefed them. On top of this, we’ve catalyzed parliamentary debates on superintelligence and extinction risk. In the US, where competition for lawmakers' attention is the fiercest, we’ve personally met with 18 members of Congress with only a tiny number of staff on the ground. On the public awareness side, over 30k people have used our tools to send over 200k messages to their lawmakers about extinction risk from superintelligence, most of them in the US. This wasn't a fluke of exceptional talent or lucky connections; we’ve done this with remarkably junior staff, in little more than a year. It was the result of a straightforward, scalable process, and of building solid foundations that enable us to scale to meet the challenge. What’s standing between us and a real fighting chance is funding commensurate with the problem. If you are a major donor or a philanthropic institution, please get in touch at [email protected]. We’d be glad to walk you through our theory of change in more detail and discuss how additional funding would be deployed. If you know a major donor or someone at a philanthropic institution, please introduce us. A warm introduction from someone they trust goes much further than a cold email from us. You can loop us in at the same address. If you're an individual donor who is considering a gift of $100k or more, please reach out at the same address. Please only consider doing so if this wouldn't significantly impact your financial situation. We don't want anyone to overextend themselves on our behalf, no matter how much they care about the issue. We are a 501(c)(4) in the US and a nonprofit (not a registered charity) in the UK, so your donations are not tax deductible. We’re currently not set up to receive smaller donations. If you still want to contribute, you can check our careers page. If you see a role you could fill, please apply. If you know someone who'd be a good fit, send them our way. e.g. US Congress has the “power of the purse”, parliamentary systems can hold “votes of no confidence”. ↩︎ Between our founding in October 2023 and mid 2024, we ran 3 campaigns in rapid succession. One of these was a campaign against deepfakes. This was a sincere effort: we do believe that deepfakes are a problem that should be addressed with legislation, and we’re proud of our achievements as part of our campaign. That said, after refining our thinking and developing the ideas we’re espousing on this post, we’ve updated towards focusing exclusively on extinction risk from ASI. This is what we’ve been doing since the end of 2024. ↩︎ Consider the environmentalist movement as a cautionary example. Environmental efforts have generally failed to achieve their stated goals (e.g. reducing emissions, reversing climate change). Richard Ngo argues that they’ve caused serious collateral harms. We think this is partly because of their lack of focus. Rather than concentrating on a single core concern, environmental campaigns rummage around for anyone who, for any reason, feels good vibes toward the idea of the environment. As a result, the movement struggles to achieve good policies despite being enormously salient. Because of its lack of focus, it is interlinked with anti-capitalist groups, and so it tends to oppose interventions that would actually help with climate change, such as nuclear energy, as well as carbon capture and market-based solutions in general. Relevant posts on LessWrong: @habryka’s “Do not conquer what you cannot defend”, @Gabriel Alfour‘s “How to think about enemies: the example of Greenpeace”. ↩︎ To clarify: this doesn’t mean that everyone thinks the only way to avoid extinction is to not build ASI. Some do, while others have complicated ideas about how ASI can be built safely. The point is that none of those specific complex ideas benefit from a broad expert consensus. The only thing that most of us can agree on is that it won’t kill us if we don’t build it. ↩︎ There have been other statements, such as this great one from FLI, but none signed by *both* top AI scientists and CEOs of top AI companies. ↩︎ Sources: abortion was roughly $400 million in 2024, marijuana legalization was roughly $185 million in 2024, Prop 22 was roughly $220 million. ↩︎ See Annex 2 of our paper “How middle powers may prevent the development of ASI”. While the paper focuses on the perspective of middle powers, this section’s analysis extends to superpowers. ↩︎ The probabilities are produced mostly by gut feeling, but the major barriers that were considered are the following. 1) We are able to maintain a good internal culture as we scale extremely aggressively. 2) The lower bounds of our gears-level estimates mentioned in the second half of this post (e.g. ad impressions per dollar) hold. 3) We are able to validate our approach at scales of ~$50 million a year, and are able to continue raising at this scale if getting the agreement in place takes longer than a year. 4) The issue becomes a top 10 salient issue in the US and another 2~3 major countries. 5) The behavior of governments championing the ban is sufficiently connected to the right insights about extinction risk and ASI, requiring at the very least that public discourse about the ASI ban does not get distracted or confused in a way that makes the resulting actions ineffective. 6) This leads to an international ban on ASI in which major powers, including the US and China, conclude that participation serves their national interests and try to enforce globally. Alternatively, if China or other countries do not join, the coalition of countries behind the ASI ban is powerful enough to be able to deter non-participating countries and any rogue actors from developing ASI. ↩︎ We strongly believe in the principles we follow: honesty, openness, and democracy. Of course, we do think that our approach to averting extinction risks from ASI is the best; we wouldn’t pursue it if we didn’t think so. At a 500M budget level, we’d love to fund organizations that pursue different approaches, as long as they respect our basic principles. If we had that level of funding, we would seek to ensure that there are other organizations pursuing a candid approach to communication about ASI, and of organizations that directly tackle the need for strong international coordination. ↩︎ Notably, a statement like this one can generate a temporary spike of media coverage, but does not generate sustained attention by itself. Statements like this one need a sustained campaign (like the one we’re running) in order to receive sustained attention. ↩︎ The statement reads: “Nobel Prize winners, AI scientists, and CEOs of leading AI companies have stated that mitigating the risk of extinction from AI should be a global priority. Specialised AIs - such as those advancing science and medicine - boost growth, innovation, and public services. Superintelligent AI systems would compromise national and global security. The UK can secure the benefits and mitigate the risks of AI by delivering on its promise to introduce binding regulation on the most powerful AI systems.” ↩︎ Examples of this: a lawmaker giving a speech in parliament, writing an op-ed, or speaking in an interview to a major media outlet. ↩︎ Importantly, our metrics are strictly focused on AI extinction risk. This reduces the risk that the person working on them, or the organization as a whole, will fool themselves into pursuing issues other than preventing extinction risk from superintelligent AI. A “lawmaker public declaration” only counts if it covers extinction risk specifically. If people at ControlAI spend time trying to push topics such as “job loss”, “AI ethics” or “autonomous weapons”, we consider this a failure. This is how we fight The Spectre, and stay laser focused on addressing extinction risk from superintelligence. ↩︎ To be considered a very rough estimate, could be 30M to 80M. ↩︎ As we mentioned earlier, we feel that this is around a ~10% chance. ↩︎ 1 member for most of this period; the 2nd member joined in the past month. ↩︎ We’ve already fostered two debates about prohibiting ASI, and helped submit one amendment recognizing ASI and putting in place kill-switches for use in case of AI emergencies. To our knowledge, we are the first organization to successfully prompt a debate, in the parliament of a major country, focused specifically on prohibiting superintelligence. ↩︎ Consider that replicating a success should be much easier than doing it the first time. By design, our results are public, and so, produce common knowledge. Now that 100+ lawmakers support our campaign in the UK, it is easier for other lawmakers to take a similar stance, including in other countries. ↩︎ To a lesser degree, we would like people to remember our organization as a place where they can find trustworthy information on the issue and what they can do to help solve it. ↩︎ The vast majority of people will not feel the need to fully understand the technical and geopolitical details in order to buy into the concern. The important part is that most people can intuitively understand why and how ASI can cause human extinction, and are happy to defer to experts about the details. ↩︎ This is the most common rule of thumb in marketing, and is backed up by some academic research as well, e.g. see Advertising Repetition: A Meta-Analysis on Effective Frequency in Advertising. ↩︎ Unlike the previous one, this statement is not backed by academic research. While most academic research focuses on marketing aimed at selling products and services, our goals present quite a different challenge. There are two main differences that make us expect to keep getting returns after even hundreds of exposures. 1) Our messages are somewhat novel and complex to the audience. This complexity will have to be accounted for in some way: either the message is presented in a complex way that takes more exposures to remember, or the message is broken down into many building blocks, each of which needs to be shown many times. 2) The success bar is somewhat higher: we do benefit from people responding to CTAs similar in scope to “buying a product”, but we also benefit from deeper engagement (see the section on “Grassroots mobilization"), we benefit from people spontaneously bringing up the topic in conversations, which happens more if we create common knowledge that the topic exists. ↩︎ This section assumes that we will allocate 60% of our ad spend to the US. We expect it will be quite a bit easier to yield good results in other countries, mostly due to lower cost per impression. For example, if we put the remaining 40% in 3 G7 countries, we expect to roughly be able to replicate the same success as in the US across those 3 countries. ↩︎ Including both organic and paid reach. ↩︎ This corresponds to 800 million total impressions. ↩︎ Though it’s not clear to us at the moment if this would be a good use of money. ↩︎ In this paragraph, we use our worst case assumption that scaling ad-spend by x30 multiplies impressions by x4. We expect it’s much more likely that scaling x30 will yield x10 to x15 impressions. ↩︎ Simpler models and extrapolations that we think we can use at a $50 million budget will break at this scale. There are strong reasons to deviate from these, both in pessimistic and optimistic directions. At this scale, we’ve probably run out of people who can be mobilized solely through ads. At the same time, network effects come into play, where people hear about the issue from others, and they start to see it as a “normal” part of the political discourse. It seems to us that trying to model the net effect ahead of time would be a fool’s errand. ↩︎ For reference, here’s a job post by Anthropic for a marketing role, which they advertise as paying $255k to $320k. ↩︎ Discuss

safety_policyindustryresearchaudio

#180

Government & Defense 2026-04-21 Breaking Defense

Breaking Defense

3.9

I 4.0 Im 4.0 P 3.4

[Sponsored] As “dirty” RF and contested environments proliferate, autonomy increasingly depends on resilient positioning.

#188

A neural operator framework for data-driven discovery of stability and receptivity in physical systems

Research 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Chengyun Wang, Liwei Chen, Nils Thuerey

3.9

I 4.0 Im 4.0 P 3.4

Understanding how complex systems respond to perturbations, such as whether they will remain stable or what their most sensitive patterns are, is a fundamental challenge across science and engineering. Traditional stability and receptivity (resolvent) analyses are powerful but rely on known equations and linearization, limiting their use in nonlinear or poorly modeled systems. Here, we introduce a data-driven framework that automatically identifies stability properties and optimal forcing responses from observation data alone, without requiring governing equations. By training a neural network as a dynamics emulator and using automatic differentiation to extract its Jacobian, we can compute eigenmodes and resolvent modes directly from data. We demonstrate the method on both canonical chaotic models and high-dimensional fluid flows, successfully identifying dominant instability modes and input-output structures even in strongly nonlinear regimes. By leveraging a neural network-based emulator, we readily obtain a nonlinear representation of system dynamics while additionally retrieving intricate dynamical patterns that were previously difficult to resolve. This equation-free methodology establishes a broadly applicable tool for analyzing complex, high-dimensional datasets, with immediate relevance to grand challenges in fields such as climate science, neuroscience, and fluid engineering.

researchinfra

#189

AI Dungeon maker Latitude unveils Voyage, a platform for creating AI-powered RPGs

Reinforcement Learning 2026-04-21 TechCrunch — AI

Lauren Forristal

3.9

I 4.0 Im 4.0 P 3.4

Latitude's new AI-native platform, Voyage, aims to help gamers create their very own role-playing game.

#190

Industry 2026-04-21 TechCrunch — AI

Connie Loizos

3.9

I 4.0 Im 4.0 P 3.4

Apple's top job comes with almost unrivaled power and money, but it comes with plenty of baggage, too.

#197

Autonomous UAV Pipeline Near-proximity Inspection via Disturbance-Aware Predictive Visual Servoing

Interpretability 2026-04-21 arXiv cs.RO (Robotics)

Wen Li, Hui Wang, Jinya Su, Cunjia Liu +2

3.9

I 4.0 Im 4.0 P 3.4

Reliable pipeline inspection is critical to safe energy transportation, but is constrained by long distances, complex terrain, and risks to human inspectors. Unmanned aerial vehicles provide a flexible sensing platform, yet reliable autonomous inspection remains challenging. This paper presents an autonomous quadrotor near-proximity pipeline inspection framework for three-dimensional scenarios based on image-based visual servoing model predictive control (VMPC). A unified predictive model couples quadrotor dynamics with image feature kinematics, enabling direct image-space prediction within the control loop. To address low-rate visual updates, measurement noise, and environmental uncertainties, an extended-state Kalman filtering scheme with image feature prediction (ESKF-PRE) is developed, and the estimated lumped disturbances are incorporated into the VMPC prediction model, yielding the ESKF-PRE-VMPC framework. A terrain-adaptive velocity design is introduced to maintain the desired cruising speed while generating vertical velocity references over unknown terrain slopes without prior terrain information. The framework is validated in high-fidelity Gazebo simulations and real-world experiments. In real-world tests, the proposed method reduces RMSE by 52.63% and 75.04% in pipeline orientation and lateral deviation in the image, respectively, for straight-pipeline inspection without wind, and successfully completes both wind-disturbance and bend-pipeline tasks where baseline method fails. An open-source nano quadrotor is modified for indoor experimentation.

interpretability

#198

BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

AI Coding 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang

3.9

I 4.0 Im 4.0 P 3.4

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

ai_codingaudioevals

#199

Bond, a new social media platform, wants to use AI to help you kick your doomscrolling habit

Industry 2026-04-21 TechCrunch — AI

Lucas Ropek

3.9

I 4.0 Im 4.0 P 3.4

Bond wants you to get off the couch and get back into the real world, its creator says. The new platform's AI system is designed to motivate users to do things away from the app.

#200

Research 2026-04-21 arXiv cs.RO (Robotics)

Andreas Mueller

3.9

I 4.0 Im 4.0 P 3.4

Many mechanical systems exhibit changes in their kinematic topology altering the mobility. Ideal contact is the best known cause, but also stiction and controlled locking of parts of a mechanism lead to topology changes. The latter is becoming an important issue in human-machine interaction. Anticipating the dynamic behavior of variable topology mechanisms requires solving a non-smooth dynamic problem. The core challenge is a physically meaningful transition condition at the topology switching events. Such a condition is presented in this paper. Two versions are reported, one using projected motion equations in terms of redundant coordinates, and another one using the Voronets equations in terms of minimal coordinates. Their computational properties are discussed. Results are shown for joint locking of a planar 3R mechanisms and a 6DOF industrial manipulator.

#206

GRAI believes AI can make music more social, not replace artists

Industry 2026-04-21 TechCrunch — AI

Sarah Perez

3.9

I 4.0 Im 4.0 P 3.4

AI music startup GRAI says fans want to remix tracks, not generate songs from scratch.

industryaudio

#207

I’m Sorry, Dave. I’m Afraid I Can’t De-escalate: On (AI) Wargaming and Nuclear War

Frontier LLMs 2026-04-21 War on the Rocks

Ankit Panda, Andrew Reddie

3.9

I 4.5 Im 4.0 P 3.0

Recent experiments placing large language models in simulated nuclear crises have produced alarming headlines. “Bloodthirsty” AI systems escalate conflicts, threaten nuclear strikes, and behave erratically under simulated pressure. A recent set of experiments presented in a pre-print paper from Kenneth Payne at King’s College London finds that across 95 percent of simulated games across 21 match-ups between three frontier models, at least one side engaged in nuclear signaling — with subsequent tactical nuclear use occurring in 95 percent of games and strategic nuclear threats in 76 percent. The study’s author describes the results as “sobering” and frames them as a The post I’m Sorry, Dave. I’m Afraid I Can’t De-escalate: On (AI) Wargaming and Nuclear War appeared first on War on the Rocks.

frontier_llmrl

#208

Navy planning to spend more than $17B on first Trump-class battleship

Government & Defense 2026-04-21 DefenseScoop

Jon Harper

3.9

I 4.0 Im 4.0 P 3.4

The Navy revealed new details Tuesday about its procurement plans for a new Guided Missile Battleship (BBG(X)) program. The post Navy planning to spend more than $17B on first Trump-class battleship appeared first on DefenseScoop.

gov_defense

#209

PC2Model: ISPRS benchmark on 3D point cloud to model registration

Evaluations & Benchmarks 2026-04-21 arXiv cs.CV (Computer Vision)

Mehdi Maboudi, Said Harb, Jackson Ferrao, Kourosh Khoshelham +2

3.9

I 4.0 Im 4.4 P 3.0

Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR).With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: https://zenodo.org/uploads/17581812.

evalsroboticsmultimodalinfra

#210

Paparazzo: Active Mapping of Moving 3D Objects

Agents & Tool Use 2026-04-21 arXiv cs.CV (Computer Vision)

Davide Allegro, Shiyao Li, Stefano Ghidoni, Vincent Lepetit

3.9

I 4.0 Im 4.4 P 3.0

Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object's motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target's trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. Project page: https://davidea97.github.io/paparazzo-page/

agentsevals

#211

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Frontier LLMs 2026-04-21 HF Hugging Face Blog

3.9

I 4.0 Im 4.0 P 3.4

(No summary available.)

frontier_llmevals

#212

RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

Generative Media 2026-04-21 arXiv cs.CV (Computer Vision)

Ahmed Marouane Djouama, Abir Belaala, Abdellah Zakaria Sellam, Salah Eddine Bekhouche +2

3.9

I 4.0 Im 4.4 P 3.0

Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.

generative_mediaefficiencyai_codinginterpretability

#213

Roundtables: Unveiling The 10 Things That Matter in AI Right Now

Industry 2026-04-21 MIT Technology Review — AI

MIT Technology Review

3.9

I 4.0 Im 4.0 P 3.4

Listen to the session or watch below Watch a special edition of Roundtables simulcast live from EmTech AI, MIT Technology Review’s signature conference for AI leadership. Subscribers got an exclusive first look at a new list capturing 10 key technologies, emerging trends, bold ideas, and powerful movements in AI that you need to know about…

#214

Scheduling Analysis of UAV Flight Control Workloads using Raspberry Pi 5 Using PREEMPT_RT Linux

Infrastructure 2026-04-21 arXiv cs.RO (Robotics)

Luiz Giacomossi, Håkan Forsberg, Ivan Tomasic, Baran Çürüklü +1

3.9

I 4.0 Im 4.0 P 3.4

Modern UAV architectures increasingly aim to unify high-level autonomy and low-level flight control on a single General-Purpose Operating System (GPOS). However, complex multi-core System-on-Chips (SoCs) introduce significant timing indeterminism due to shared resource contention. This paper performs an architectural analysis of the PREEMPT RT Linux kernel on a Raspberry Pi 5, specifically isolating the impact of kernel activation paths (deferred execution SoftIRQs versus real-time direct activation) on a 250 Hz control loop. Results show that under heavy stress, the standard kernel is unsuitable, exhibiting worst-case latencies exceeding 9 ms. In contrast, PREEMPT RT reduced the worst-case latency by nearly 88 percent to under 225 microseconds, enforcing a direct wake-up path that mitigates OS noise. These findings demonstrate that while PREEMPT RT resolves scheduling variance, the residual jitter on modern SoCs is primarily driven by hardware memory contention.

infra

#215

SpaceX working with Cursor, has option to buy startup at $60B valuation

Industry 2026-04-21 TechCrunch — AI

3.9

I 4.0 Im 4.0 P 3.4

TechCrunch reports SpaceX has engaged Cursor for internal software work and negotiated an option to acquire the company for up to $60B — a sharp markup from Cursor's last private round. Deal terms and conversion triggers were not disclosed.

industry

#216

Stop tweaking your AI models. Do this instead.

Frontier LLMs 2026-04-21 Gradient Flow (Ben Lorica)

Ben Lorica

3.9

I 4.5 Im 4.0 P 3.0

Subscribe • Previous Issues The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It As organizations move autonomous AI agents from experimental sandboxes into live production, a critical bottleneck has emerged. Foundation models are remarkably capable but structurally unsuited to complex, multi-step work on their own. They have no persistent memory, no built-inContinue reading "Stop tweaking your AI models. Do this instead." The post Stop tweaking your AI models. Do this instead. appeared first on Gradient Flow.

frontier_llmgenerative_mediaagents

#217

The 10 Things That Matter in AI Right Now (MIT Tech Review Roundtables)

Industry 2026-04-21 MIT Technology Review — AI

3.9

I 4.0 Im 4.0 P 3.4

MIT Technology Review Roundtables surveyed the field to name the ten topics that define AI at this moment in 2026. Serves as a snapshot consensus view from editors and invited researchers; useful as a framing document for the state-of-AI conversation.

#218

The counterterrorism czar without a counterterrorism plan

Government & Defense 2026-04-21 Defense One

Hannah Allam, ProPublica

3.9

I 4.0 Im 4.0 P 3.4

Iranian threats against U.S. targets have brought renewed attention to the absence of Sebastian Gorka's long-promised doctrine.

#219

Wrench-Aware Admittance Control for Unknown-Payload Manipulation

Robotics 2026-04-21 arXiv cs.RO (Robotics)

Hossein Gholampour, Logan E. Beaver

3.9

I 4.0 Im 4.0 P 3.4

Unknown payloads can strongly affect compliant robotic manipulation, especially when the payload center of mass is not aligned with the tool center point. In this case, the payload generates an offset wrench at the robot wrist. During motion, this wrench is not only related to payload weight, but also to payload inertia. If it is not modeled, the compliant controller can interpret it as an external interaction wrench, which causes unintended compliant motion, larger tracking error, and reduced transport accuracy. This paper presents a wrench-aware admittance control framework for unknown-payload pick-and-place using a UR5e robot. The method uses force-torque measurements in two different roles. First, a three-axis translational excitation term is used to reduce payload-induced force effects during transport without making the robot excessively stiff. Second, after grasping, the controller first estimates payload mass for transport compensation and then estimates the payload CoM offset relative to the TCP using wrist force-torque measurements collected during the subsequent translational motion. This helps improve object placement and stacking behavior. Experimental results show improved transport and placement performance compared with uncorrected placement while preserving compliant motion.

roboticsinfra

#220

YouTube expands its AI likeness detection technology to celebrities

Industry 2026-04-21 TechCrunch — AI

Sarah Perez

3.9

I 4.0 Im 4.0 P 3.4

YouTube is expanding its AI likeness detection tool to celebrities, giving talent and their reps a way to find and remove deepfakes.

#221

3D vision redefining drone navigation without GPS

Government & Defense 2026-04-21 Breaking Defense

3.8

I 4.0 Im 4.0 P 3.0

Breaking Defense features systems using onboard stereo/depth-camera 3D vision for GPS-denied drone navigation — including terrain-relative localization and visual-inertial odometry running on edge SoCs.

gov_defenseinterpretability

#222

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Robotics 2026-04-21 arXiv cs.CV (Computer Vision)

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng +5

3.8

I 4.0 Im 4.0 P 3.0

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

roboticsgenerative_mediainfra

#223

Generative Media 2026-04-21 arXiv cs.AI (Artificial Intelligence)

Alessandro G. Buda, Giuseppe Primiero, Leonardo Ceragioli, Melissa Antonelli

3.6

I 3.0 Im 4.0 P 3.4

Generative AI systems are known to amplify biases present in their training data. While several inference-time mitigation strategies have been proposed, they remain largely empirical and lack formal guarantees. In this paper we introduce CTLF, a branching-time logic designed to reason about bias in series of generative AI outputs. CTLF adopts a counting worlds semantics where each world represents a possible output at a given step in the generation process and introduces modal operators that allow us to verify whether the current output series respects an intended probability distribution over a protected attribute, to predict the likelihood of remaining within acceptable bounds as new outputs are generated, and to determine how many outputs are needed to remove in order to restore fairness. We illustrate the framework on a toy example of biased image generation, showing how CTLF formulas can express concrete fairness properties at different points in the output series.

generative_mediaefficiencyinfra