← Archive / all digests

AI Digest — April 20, 2026

Coverage window: 2026-04-132026-04-20
Items
596
Multi-source
96
Long-form (≥7.5)
6
Top category
Research
Run duration
15m 23s
#1
Claude Opus 4.7 released with SOTA SWE-Bench Verified, expanded vision, new 'xhigh' effort level
Robotics · ★ 9.6 · 2026-04-16
#2
Physical Intelligence releases π0.7: steerable VLA foundation model with emergent cross-embodiment generalization
Robotics · ★ 8.6 · 2026-04-16
#3
Dwarkesh Patel × Jensen Huang: TPU competition, Nvidia's supply chain, and the case against export controls
Industry · ★ 8.2 · 2026-04-15
6.5
Showing all 596 items
Reset filters
#1
Anthropic News, Artificial Analysis, YT: AI Explained, Hacker News, Cursor Blog, Latent Space … 2026-04-16

Anthropic released Claude Opus 4.7 on April 16, 2026, positioning it as the frontier of its lineup while holding pricing unchanged at five dollars per million input tokens and twenty-five dollars per million output tokens. The headline technical gains sit in software engineering, long-horizon autonomy, and vision. On SWE-bench Verified the model posts state-of-the-art resolution rates, with early testers reporting roughly thirteen percentage points of gain over Opus 4.6 on coding tasks and a three-times increase in the number of production tasks completed end-to-end. On specialized evaluations the model reaches 90.9 percent on BigLaw Bench and takes the lead on GDPval-AA, a third-party evaluation of economically valuable knowledge work. Instruction following is reported to be substantially stronger, which Anthropic flags explicitly as a migration hazard: prompts tuned against older Claude versions may need to be rewritten rather than ported.

Vision capacity expanded roughly three times, with support for images up to 2,576 pixels on the long edge. In practice this unlocks dense-screenshot reading, reasoning over complex diagrams, and better extraction from technical and scientific figures. A new reasoning effort level called xhigh exposes finer control over the reasoning-versus-latency tradeoff, extending the existing low-to-high ladder. Early testers emphasize sustained autonomous reasoning — the model is described as coherent across multi-hour runs rather than prematurely concluding on difficult problems. The tokenizer has been updated, and as a consequence input token counts rise roughly 1.0 to 1.35 times depending on the content, which has meaningful implications for both latency and cost modeling.

Safety posture is framed as comparable to Opus 4.6, with improvements in honesty and prompt-injection resistance. The release is paired with Project Glasswing, a set of automatic safeguards that detect and block requests indicating prohibited or high-risk cybersecurity uses; a Cyber Verification Program carves out legitimate vulnerability research and penetration testing. Distribution covers all Claude products, the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Third-party evaluations and community reception were genuinely split. Artificial Analysis placed Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 tied at an intelligence score of fifty-seven, which frames the release as frontier parity rather than an uncontested lead. On Hacker News, the strongest positive signal came from heavy Claude Code users noting gains on well-specified structured tasks at higher effort levels. Less favorably, multiple threads flagged regressions in day-to-day Claude Code use — more hallucinations on shallow checks, additional confirmation loops, and higher token burn per outcome. A separate and pointed complaint thread, 'Claude Code Opus 4.7 keeps checking on malware,' documented the model producing false-positive malware refusals on legitimate debugging work; the apparent root cause is an interaction between Claude Code's injected system prompt and Opus 4.7's updated reasoning behavior, with at least one developer reporting an account termination triggered by these signals. The practical takeaway is that whether Opus 4.7 is an upgrade depends materially on workflow: for structured high-effort tasks with well-written prompts the gains appear real, while casual Claude Code use may warrant holding on 4.6 until the rough edges settle.

How it was discussed
  • Anthropic News: Framed as SOTA on SWE-bench Verified with 2,576px vision, xhigh reasoning control, and Project Glasswing cybersecurity guardrails; pricing unchanged.
  • Artificial Analysis: Placed Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 tied at intelligence score 57 — frontier parity, not a singular lead.
  • Hacker News: Developer reception split: users report stronger structured-coding performance at high effort but regressions in Claude Code (hallucinations, more confirmation loops, higher token burn).
  • Hacker News (malware thread): Injected Claude Code system prompt now interacts with 4.7's reasoning to produce false-positive 'suspected malware' refusals on legitimate debugging work; one developer reported account termination.
  • YT: AI Explained: Walked through 'New Frontier in Performance and Drama' video framing — strong on benchmarks, but community drama around tokenizer cost changes and refusal behavior.
Impact 9.8Import 8.8Pop 9.4
#2
Physical Intelligence, HF Daily Papers (indirect) 2026-04-16

Physical Intelligence released π0.7, a vision-language-action foundation model for robotic manipulation that is deliberately positioned as a single steerable generalist rather than a collection of task-specific specialists. The architecture retains the VLA template from prior π releases: a vision-language backbone ingests image observations plus language and metadata, and an action head produces low-level control. What is new is the conditioning surface. Inputs now include episode metadata that explicitly tags quality, speed, and control modality, and a lightweight world model generates visual subgoals used as a form of learned grounding for the high-level policy. The high-level policy itself emits intermediate language subtasks, which the downstream policy consumes alongside the visual subgoals. This lets the model disambiguate diverse behaviors in its training data rather than averaging them out.

Training uses an unusually heterogeneous data mix. Robot teleoperation demonstrations span multiple embodiments, human video contributes strategy priors without action labels, and autonomous episodes from reinforcement-learning specialist policies — specifically models trained with the Recap procedure — are folded in as additional demonstration-like signal. Naively merging these would typically degrade performance, because suboptimal or specialist-quirk behaviors contaminate the base policy. The metadata conditioning is the mechanism that prevents this: labels describing the quality and provenance of each episode allow the model to distill diverse strategies while preserving expert-level output at inference time, simply by prompting the relevant metadata tokens.

The capability breadth is unusual for a single model. Reported tasks include laundry folding on a bimanual UR5e platform, espresso preparation end-to-end, flat-pack box building, kitchen appliance operation, vegetable peeling, and, notably, zero-shot air-fryer loading composed from disparate training episodes. Quantitatively, π0.7 matches or exceeds π*0.6 reinforcement-learning specialist models on laundry with T-shirts, diverse laundry items, espresso, and box-building in both success rate and throughput — despite being a single generalist rather than purpose-trained per task. Cross-embodiment generalization is notable: on the bimanual UR5e, an undertrained platform in the dataset, the model folds laundry at a success rate comparable to experienced human teleoperators attempting that cross-embodiment configuration for the first time after 375-plus hours of prior teleop experience elsewhere.

The authors argue this constitutes early evidence of compositional generalization in the VLA class — recombining primitives similarly to how large language models recombine reasoning patterns, and accepting language coaching for novel tasks without additional teleop data. The practical implication is that VLA scaling along data-quality and data-diversity axes, rather than parameter count alone, continues to be the binding constraint on generalist-robot progress, and that metadata-conditioned distillation from RL specialists is a viable mechanism for folding specialist gains back into a steerable generalist.

Impact 9.2Import 8.8Pop 7.4
#3
Dwarkesh Patel Podcast 2026-04-15

Dwarkesh Patel's interview with Jensen Huang lands at a moment when NVIDIA's moat is being stress-tested from several directions at once — Anthropic's TPU shift, Google and Broadcom's multi-gigawatt compute partnership announcements, and sustained policy pressure over chip exports to China. Huang's responses across the conversation form a coherent defense that rests on three claims: architecture, supply chain, and geopolitics.

On architecture, Huang argues that the leap from Hopper to Blackwell delivered roughly fifty times efficiency improvement, far exceeding the twenty-five percent annualized scaling implied by lithography alone. His framing separates matrix multiplication, which TPUs handle well, from the broader set of operations that large models increasingly require — novel attention variants, hybrid model structures, dynamic control flow — where general-purpose programmability is load-bearing. He is explicit that CUDA's moat is a software-ecosystem moat: several hundred million deployed GPUs across every major cloud means kernel engineers prioritize NVIDIA, and he claims hyperscalers running custom kernels typically obtain two to three times speedup on top of NVIDIA's baseline optimized paths. He is open to architectural diversification — he references licensing Groq inference technology for premium, low-latency token segments — but rejects backporting to older process nodes like N7 on engineering-complexity grounds.

On supply chain, Huang describes roughly one hundred billion dollars of explicit upstream commitment and an estimated two hundred fifty billion dollars of total commitment, built on the premise that NVIDIA's downstream market dominance gives suppliers confidence to invest in capacity. He claims CoWoS packaging constraints are resolved and that EUV capacity bottlenecks are a two-to-three-year problem given clear demand signaling to ASML and TSMC. The binding constraint on datacenter buildouts, he asserts, is no longer chips or memory but skilled-trade labor — plumbers and electricians — which is a concrete claim worth testing.

On geopolitics, Huang makes the strongest and most contested argument. He contends that SMIC's seven-nanometer process combined with HBM2 memory is sufficient for competitive Chinese model training, so U.S. export controls do not throttle Chinese AI capability; they only accelerate the development of an independent Chinese ecosystem that NVIDIA is then cut out of. He rejects the uranium analogy pointedly, arguing AI compute is not a weapons-grade material with a scarce physical precursor. He frames Anthropic's Broadcom and Google partnership as a single-case deviation — he uses the phrase 'one hundred percent Anthropic' — rather than evidence of a broader shift. His core commercial argument: maintaining marginal sales to China preserves NVIDIA's position in roughly forty percent of the world's technology industry, and conceding that position hurts American long-term competitiveness more than allowing current sales.

The strategic takeaway is that NVIDIA's public position is now explicitly political, not just technical. Huang is arguing a policy case to a broad audience, which is itself a signal about how the company sees the 2026 environment.

Impact 7.4Import 8.4Pop 8.2
#4
Cursor Blog, Hacker News 2026-04-02

Cursor 3 is framed by Anysphere as a from-scratch rebuild rather than an iterative IDE update, positioning agents — not files, and not the editor itself — as the primary abstraction of the software development surface. The most visible change is consolidation: local and cloud agents converge into a single sidebar, and agent activity initiated from mobile, web, desktop, Slack, GitHub, or Linear surfaces in one unified stream. Parallel agent execution becomes first-class, with cloud agents emitting visual demos and screenshots the developer can inspect for verification, directly replacing the older workflow of context-switching to cursor.com/agents in a browser tab.

The handoff primitive is notable. Sessions move between local and cloud environments explicitly — cloud-to-local to test changes against a checkout using Composer 2, Cursor's frontier coding model, and local-to-cloud to preserve long-running tasks when closing the laptop. This is engineered around a recognition that developers want the same agent context in different execution environments at different times, not two parallel agents with drifted state. Composer 2 is referenced throughout as the frontier model for the workspace, though specific benchmark numbers are not enumerated in the launch post.

Supporting surfaces are redesigned. The diff interface integrates staging, commits, and pull request management rather than forcing context switches to git tooling. A built-in browser lets agents and developers interact with locally-hosted websites without another tab. Core IDE fundamentals — Language Server Protocol integration, go-to-definition, file navigation — are preserved, and the marketplace retains hundreds of plugins. The companion posts — multi-agent GPU kernel optimization with NVIDIA demonstrating thirty-eight percent geomean speedup across 235 CUDA problems, Bugbot self-improving with learned rules, and warp-decode MoE inference optimizations — indicate the team is investing in the deeper agent-coordination and runtime-efficiency substrate that makes parallel cloud agents economically viable.

The framing — 'the third era of software development' with fleets of autonomous agents as the default work unit — is a marketing claim but it is also a bet that the IDE stops being the focal surface for many developers. The practical question for teams is whether parallel cloud agents handling long-running tasks actually change work-in-progress dynamics, or whether the model stays IDE-centric with agents acting as fancier autocomplete. Cursor's revenue trajectory and the product direction suggest they believe the former; competitive responses from GitHub, JetBrains, and Cognition will be the clearest independent signal.

Impact 8.5Import 7.6Pop 7.6
#5
Transformer Circuits Thread 2026-04-15

Anthropic's interpretability team published 'Emotion Concepts and their Function in a Large Language Model' on the Transformer Circuits Thread. The paper identifies sparse-autoencoder-style representations corresponding to emotion concepts in Claude Sonnet 4.5's residual stream and demonstrates, through targeted causal interventions, that steering these features produces predictable and emotion-specific shifts in model outputs. This extends prior work on concept features — golden-gate-style single-feature steering, refusal circuits, sycophancy features — into a new category: internal variables that the model behaves as if it is 'experiencing,' in a functional sense, and that measurably influence generation downstream.

Methodologically the work fits the established pattern — train dictionary learners on residual activations, locate features that activate on emotionally-valenced inputs, and then probe causality by clamping or perturbing those features at inference time. What is new is the specific taxonomy and the degree to which these features behave as latent variables rather than surface artifacts. Steering an emotion-concept feature in the positive direction reliably shifts tone, word choice, and in some cases refusal behavior in ways consistent with the emotion label; negative-direction clamping produces the inverse. Importantly the interventions generalize across unrelated prompt contexts, which is the signature of a genuine internal variable rather than a pattern-matched output template.

The implications cluster in three directions. For alignment, emotion-like latent state being load-bearing for behavior means that steering interventions — to reduce sycophancy, harmful-compliance, or deception — may need to account for these features rather than optimizing purely on output-level rewards. For welfare framing, the paper contributes a concrete case that 'emotional' state is a functional internal variable with causal effect on generation, which is relevant to ongoing debates about AI welfare; the authors do not claim this resolves the philosophical question but it does raise the empirical bar. For the mechanistic-interpretability research agenda, it suggests dictionary learning over residual streams continues to yield interpretable and causally-relevant features at scale in frontier-tier models, which is a positive signal for the broader SAE research program.

Expected follow-up questions for the community include whether these features generalize across models — whether Claude 3, Opus, and non-Anthropic frontier models show analogous structure — and whether fine-tuning procedures such as RLHF shape or attenuate these features relative to pretrained base models. The paper stands as one of the more concrete demonstrations to date that interpretability tools can locate and causally manipulate abstractions that would previously have been dismissed as anthropomorphic projection.

Impact 7.8Import 8.2Pop 7.0
#6
Anthropic News, TechCrunch AI 2026-04-17

Anthropic Labs launched Claude Design on April 17, 2026, a research-preview product that positions Claude Opus 4.7 as a conversational design collaborator for visual artifacts — slides, prototypes, wireframes, marketing materials, and interactive elements. Users describe what they need, Claude produces an initial version, and refinement proceeds through inline comments, direct edits, and custom adjustment sliders that expose parameters like density, contrast, and tone. The import pipeline is broad: text prompts, images, uploaded documents, codebases, and URL capture that extracts elements from live websites, which matters because most real-world design briefs arrive as a mess of references rather than a clean specification.

The enterprise angle is more interesting than the consumer positioning suggests. Claude Design reads organization design systems automatically — colors, typography, component libraries — which means outputs are governed by brand constraints without the designer hand-holding the model through every iteration. Organization-scoped sharing with view and edit permissions makes the product collaborative by default, and exports cover Canva, PDF, PPTX, and HTML, plus a hand-off path into Claude Code when a design needs to become a real implementation. The interactive-prototype support extending to voice, video, and 3D elements is a meaningful expansion of what a design tool is expected to produce, even if the quality bar on those modalities is not yet stated.

Availability is gated to Claude Pro, Max, Team, and Enterprise subscribers as a research preview, with enterprise admin opt-in at the organization level, and usage counted against existing subscription limits rather than a separate tier. The positioning reads as Anthropic probing the design-software market — historically a Figma-and-Adobe duopoly for professionals and Canva for everyone else — with a model-centric alternative where the primary interface is conversation rather than direct manipulation. For teams already using Claude for other knowledge work, integration into existing billing and account boundaries materially lowers adoption friction, which is probably the more important dynamic than any single design-quality benchmark the product might eventually cite.

Impact 7.4Import 8.8Pop 6.8
#7
Cognition AI 2026-04-14

Specialized bug-detection model for code diffs in Windsurf. Matches Claude Opus 4.6 F1 on in-distribution evals; delta-F1 improves from 0.49→0.29 on OOD (Cognition's internal codebase) — still a gap. Runs ~10x faster via dense intermediate reasoning on Cerebras. Training via Applied Compute RL integrating the Windsurf production environment into the sandbox. Two key tricks: (1) reward linearization converting global F_0.5 to sample-level rewards for gradient descent; (2) two-phase post-training — first maximize capability, then optimize latency with a CDF-based penalty derived from user switch patterns during dogfooding. Shipping in Windsurf Next under cmd+U.

Impact 7.8Import 7.0Pop 6.4
#8
Cursor Blog 2026-04-14

A planner agent distributes optimization tasks to autonomous workers; the coordination protocol fits in a single markdown file. Workers learn to invoke benchmarking pipelines and iterate without human intervention. Evaluated on 235 CUDA problems: 149 (63%) improved, 19% by >2x. Speed-of-Light scoring: median 0.56 (lots of headroom), attention kernel 0.9722 (84% speedup), matmul at 86% of cuBLAS with 9% improvement on specific workloads, quantized ops 39% faster. Evaluated both direct CUDA C + inline PTX and high-level CuTe DSL — the system learned novel APIs from docs alone.

Impact 8.2Import 4.0Pop 7.6
#9
Dwarkesh Patel Podcast 2026-04-15

On distillation: at ~$25/M tokens from frontier models, training-data acquisition is trivial; hidden CoT offers limited protection since models can be told to skip reasoning, RL targets can force reconstruction, and agentic tool use is observable on users' devices. On pretraining: FSDP as default (shards 1/N, all-gathers per layer, discards post-use — comms ~3x params via reduce-scatter); hierarchical collectives (reduce-scatter within NVLink domains, all-reduce across) manage the compute-comms crossover; hard batch-size floor from attention-within-sequence locality pushes max scaling to ~1K GPUs at 10K seq/10M critical tokens. Failure modes: causality breaks via expert routing, token dropping, FP16 granularity in collectives causing gradient errors. Mythos frames as multi-vulnerability attack sophistication discontinuity; defense-patching harder than vuln-finding for AI systems.

Impact 7.2Import 4.0Pop 6.8
#10
arXiv cs.LG, arXiv cs.AI, arXiv Efficiency 2026-04-17
by Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, H

Impact 4.0Import 5.6Pop 5.9
#11
arXiv cs.LG, arXiv cs.AI, arXiv cs.NE 2026-04-17
by Stefano Colamonaco, David Debot, Pietro Barbiero, Giuseppe Marra

Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs match the predictive performance of state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.

Impact 4.0Import 5.6Pop 5.9
#12
arXiv cs.LG, arXiv cs.CL, arXiv cs.AI 2026-04-17
by Yutong Gao, Qinglin Meng, Yuan Zhou, Liangming Pan

While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.

Impact 4.0Import 5.6Pop 5.9
#13
arXiv cs.LG, arXiv cs.CL, arXiv Efficiency, HF Daily Papers 2026-04-17
by Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang et al.

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP

Impact 4.0Import 4.0Pop 7.7
#14

Where does output diversity collapse in post-training?

Research ★ 5.4 multi-source (7)
arXiv cs.LG, arXiv cs.CL, arXiv cs.AI, arXiv RL, arXiv PostTraining, arXiv Efficiency … 2026-04-17
by Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-c

Impact 4.0Import 4.0Pop 7.7
#15
arXiv cs.AI, arXiv cs.CV, arXiv MechInterp 2026-04-17
by Hao Wang, Beichen Zhang, Yanpei Gong, Shaoyi Fang et al.

As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.

Impact 4.0Import 5.6Pop 5.9
#16
arXiv cs.CV, arXiv GenMedia, arXiv Efficiency, arXiv Evals, HF Daily Papers 2026-04-17
by Haoran Feng, Yifan Niu, Zehuan Huang, Yang-Tian Sun et al.

We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at https://github.com/fenghora/LaviGen.

Impact 4.0Import 4.0Pop 7.7
#17
arXiv cs.CL, arXiv cs.AI, arXiv Evals 2026-04-17
by Van-Truong Le

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent

Impact 4.8Import 4.6Pop 5.9
#18
arXiv cs.CL, arXiv cs.AI, arXiv cs.CV, arXiv Evals 2026-04-17
by Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen et al.

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a b

Impact 4.0Import 4.6Pop 6.5
#19

Qwen3.5-Omni Technical Report

Multimodal ★ 5.2 multi-source (3)
arXiv cs.CL, arXiv Efficiency, HF Daily Papers 2026-04-17
by Qwen Team

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text

Impact 4.0Import 4.0Pop 7.1
#20
arXiv cs.CV, arXiv SSM, arXiv Evals 2026-04-17
by Tejeswar Pokuri, Shivarth Rai

Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scattering in water. While learning-based methods like CNNs and Transformers have shown promise, they face critical limitations: CNNs struggle to model the long-range dependencies needed for non-uniform degradation, and Transformers incur quadratic computational complexity, making them inefficient for high-resolution images. To address these challenges, we propose Hero-Mamba, a novel Mamba-based network that achieves efficient dual-domain learning for underwater image enhancement. Our approach uniquely processes information from both the spatial domain (RGB image) and the spectral domain (FFT components) in parallel. This dual-domain input allows the network to decouple degradation factors, separating color/brightness information from texture/noise. The core of our network utilizes Mamba-based SS2D blocks to capture global receptive fields and long-range dependencies with linear complexity, overcoming the limitations of both CNNs and Transformers. Furthermore, we introduce a ColorFusion block, guided by a background light prior, to restore

Impact 4.7Import 4.0Pop 5.9
#21
Import AI (Jack Clark) 2026-04-20

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Huawei’s HiFloat4 training format beats Western-developed MXFP4 in Ascend chip bakeoff: …Could this also be a symptom of the impact of export controls in driving Chinese interest towards maximizing training and inference efficiency? Perhaps… Huawei researchers have tested out HiFloat4, a 4-bit precision format for AI training and inference, against MXFP4, an Open Compute Project 4-bit format, and found that HiFloat4 is superior. This is interesting because it correlates to a broader level of interest in Chinese companies seeking to develop their own low-precision data formats explicitly coupled with their own hardware platforms. “Our goal is to enable efficient FP4 LLM pretraining on specialized AI accelerators with strict power constraints. We focus on Huawei Ascend NPUs, which are domain-specific accelerators designed for deep learning workloads,” they write. What they tested: In this paper, the authors train 3 model types on HuaWei Ascend chips – OpenPangu-1B, Llama3-8B, and Qwen3-MoE-30B. In tests, the bigger they make the models, the better HiFloat4 does at reducing its loss error on these models relative to a BF16 baseline – and in all cases it does better than MXFP4. What they found: “We conduct a systematic evaluation of the HiFloat4 (HiF4) format and show that it achieves lower relative loss (≈ 1.0%) compared to MXFP4 (≈ 1.5%) when measured against a full-precision baseline,” they write. “HiF4 consistently achieves significantly lower relative error compared to MXFP4. For Llama and Qwen, HiF4 attains an error gap of less than 1% with respect to the baseline… HiF4 gets within ~1% of BF

Impact 4.0Import 5.8Pop 4.8
#22
arXiv cs.LG, arXiv cs.CL, arXiv cs.AI, arXiv Evals 2026-04-17
by Yunhe Li, Hao Shi, Bowen Deng, Wei Wang et al.

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. We propose $\mathtt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof. To fully exploit this dataset, we design a Progressive Multi-Stage SFT strategy that mimics the human learning process, guiding the model from basic proof writing to insightful thinking. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reaso

Impact 4.0Import 4.0Pop 6.5
#23
arXiv cs.LG, arXiv cs.AI, arXiv RL, arXiv Agents 2026-04-17
by Sarthak Mittal, Leo Gagnon, Guillaume Lajoie

Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from pure reasoning models into sophisticated agents. However, debate persists regarding whether RL genuinely instills new skills within a base model or merely sharpens its existing distribution to elicit latent capabilities. To address this dichotomy, we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms. Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable. Furthermore, our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning.

Impact 4.0Import 4.0Pop 6.5
#24
arXiv cs.LG, arXiv cs.CL, arXiv RL, arXiv Evals 2026-04-17
by Songtao Wang, Quang Hieu Pham, Fangcong Yin, Xinpeng Wang et al.

Reinforcement learning with verifiable rewards (RLVR) typically optimizes for outcome rewards without imposing constraints on intermediate reasoning. This leaves training susceptible to reward hacking, where models exploit loopholes (e.g., spurious patterns in training data) in the reward function to achieve high scores without solving the intended task. These reward-hacking behaviors are often implicit, as the intermediate chain-of-thought (CoT) may appear plausible on the surface, limiting the effectiveness of purely text-based monitoring. We propose Gradient Fingerprint (GRIFT), a method for detecting reward hacking using models' internal computations. Given a prompt and a model-generated CoT, GRIFT computes gradients of the CoT conditioned on the prompt and compresses them into a compact representation, which is then used to assess whether the CoT reflects reward hacking behavior. Across verifiable reasoning benchmarks spanning math, code, and logical reasoning, GRIFT substantially outperforms strong baselines, including CoT Monitor and TRACE, achieving over 25% relative improvement in detecting reward hacking behavior. Moreover, integrating GRIFT into the rejection fine-tuning

Impact 4.0Import 4.0Pop 6.5
#25
arXiv cs.LG, arXiv cs.CL, arXiv cs.AI, arXiv RL 2026-04-17
by Max Henning Höth, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu

Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.

Impact 4.0Import 4.0Pop 6.5
#26
arXiv cs.LG, arXiv cs.AI, arXiv cs.CV, arXiv stat.ML 2026-04-17
by Hamed Ouattara, Pierre Duthon, Pascal Houssam Salmane, Frédéric Bernardin et al.

One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance varia

Impact 4.0Import 4.0Pop 6.5
#27
arXiv cs.LG, arXiv cs.AI, arXiv Agents 2026-04-17
by Hikaru Shindo, Hanzhao Lin, Lukas Helff, Patrick Schramowski et al.

As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

Impact 4.0Import 4.6Pop 5.9
#28

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Evals & Benchmarks ★ 5.0 multi-source (3)
arXiv cs.CL, arXiv cs.AI, arXiv Evals 2026-04-17
by Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-McMahon et al.

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language m

Impact 4.0Import 4.6Pop 5.9
#29
arXiv cs.CL, arXiv RL, arXiv PostTraining, arXiv Efficiency 2026-04-17
by Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang

Machine learning models are increasingly deployed in real-world applications, but even aligned models such as Mistral and LLaVA still exhibit unsafe behaviors inherited from pre-training. Current alignment methods like SFT and RLHF primarily encourage models to generate preferred responses, but do not explicitly remove the unsafe subnetworks that trigger harmful outputs. In this work, we introduce a resource-efficient pruning framework that directly identifies and removes parameters associated with unsafe behaviors while preserving model utility. Our method employs a gradient-free attribution mechanism, requiring only modest GPU resources, and generalizes across architectures and quantized variants. Empirical evaluations on ML models show substantial reductions in unsafe generations and improved robustness against jailbreak attacks, with minimal utility loss. From the perspective of the Lottery Ticket Hypothesis, our results suggest that ML models contain "unsafe tickets" responsible for harmful behaviors, and pruning reveals "safety tickets" that maintain performance while aligning outputs. This provides a lightweight, post-hoc alignment strategy suitable for deployment in resourc

Impact 4.0Import 4.0Pop 6.5
#30
arXiv cs.RO, arXiv cs.CV, arXiv Agents, arXiv Evals 2026-04-17
by Dian Shao, Zhengzheng Xu, Peiyang Wang, Like Liu et al.

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These

Impact 4.0Import 4.0Pop 6.5
#31
NVIDIA AI Blog 2026-04-20

AI agents are transforming how work gets done across all industries, accelerating everything from content creation to decision-making. NVIDIA’s expanded strategic collaborations with Adobe and WPP are bringing agentic AI to the center of enterprise marketing operations across creative production and customer experience orchestration. As demand for personalized customer experiences surges, brands require intelligent systems that can plan, create, produce and activate content continuously — without compromising control, governance or brand integrity. Consider a global retailer delivering the right offer, image, copy and price, across millions of product, audience and channel combinations — updated in minutes instead of months. For marketing and creative teams, that means moving from one-size-fits-all campaigns to tailored experiences that are always on, always relevant and on brand. All of it is powered by intelligent systems that continuously generate and deliver content without sacrificing control, governance or brand integrity. The expanded collaborations bring together three complementary strengths: Adobe’s creative and customer experience platforms and the new Adobe CX Enterprise Coworker, WPP’s global media and marketing expertise, and NVIDIA’s accelerated computing and software stack, including NVIDIA Nemotron open models, NVIDIA Agent Toolkit and the NVIDIA OpenShell secure runtime for building and running secure agentic AI systems. As these agents begin orchestrating multistep workflows, tapping sensitive data and triggering actions across marketing stacks, enterprises need a way to enforce clear rules of engagement so every operation remains compliant, on brand and within defined risk boundaries. Powered by the NVIDIA OpenShell runtime, every agent operates with

Impact 5.8Import 4.0Pop 4.8
#33
Latent Space Podcast, Latent Space 2026-04-15

Notion rebuilt its agent system five times since late 2022. Progressed from JS codegen (models couldn't write reliable code) through custom XML, markdown, and SQL-like DB abstractions. Key shift: few-shot prompts → tool definitions to distribute ownership across teams. 100+ tools managed via progressive disclosure to avoid context bloat. CLIs favored for self-debugging within one env; MCPs for narrow, tightly-permissioned tasks. Pricing via credits abstracting tokens/model/tier/features. 'Auto' model selection matches capability to task — explicit rejection of 'frontier model for every knowledge-work task'.

Impact 4.0Import 4.0Pop 6.0
#34
arXiv cs.AI, arXiv cs.CV 2026-04-17
by Baramee Sukumal, Aueaphum Aueawatthanaphisut

Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the high

Impact 4.0Import 5.6Pop 4.7
#35
arXiv Robotics-Embodied 2026-04-13
by Dujun Nie, Fengjiao Chen, Qi Lv, Jun Kuang et al.

While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-

Impact 4.0Import 6.2Pop 3.5
#37
Simon Willison's Weblog 2026-04-15

Zig 0.16.0 release notes: "Juicy Main" Zig has really good release notes - comprehensive, detailed, and with relevant usage examples for each of the new features. Of particular note in the newly released Zig 0.16.0 is what they are calling "Juicy Main" - a dependency injection feature for your program's main() function where accepting a process.Init parameter grants access to a struct of useful properties: const std = @import ( "std" ); pub fn main ( init : std.process.Init ) ! void { /// general purpose allocator for temporary heap allocations: const gpa = init . gpa ; /// default Io implementation: const io = init . io ; /// access to environment variables: std . log . info ( "{d} env vars" , .{ init . environ_map . count ()}); /// access to CLI arguments const args = try init . minimal . args . toSlice ( init . arena . allocator () ); } Via Lobste.rs Tags: zig

Impact 5.2Import 4.0Pop 4.8
#39
HN AI (100+) 2026-04-15

Article URL: https://heidenstedt.org/posts/2026/ai-assisted-cognition-endangers-human-development/ Comments URL: https://news.ycombinator.com/item?id=47783024 Points: 230 # Comments: 189

Impact 4.0Import 4.0Pop 6.0
#40
HN AI (100+) 2026-04-15

Libretto ( https://libretto.sh ) is a Skill+CLI that makes it easy for your coding agent to generate deterministic browser automations and debug existing ones. Key shift is going from “give an agent a prompt at runtime and hope it figures things out” to: “Use coding agents to generate real scripts you can inspect, run, and debug”. Here’s a demo: https://www.youtube.com/watch?v=0cDpIntmHAM . Docs start at https://libretto.sh/docs/get-started/introduction . We spent a year building and maintaining browser automations for EHR and payer portal integrations at our healthcare startup. Building these automations and debugging failed ones was incredibly time-consuming. There’s lots of tools that use runtime AI like Browseruse and Stagehand which we tried, but (1) they’re reliant on custom DOM parsing that's unreliable on older and complicated websites (including all of healthcare). Using a website’s internal network calls is faster and more reliable when possible. (2) They can be expensive since they rely on lots of AI calls and for workflows with complicated logic you can’t always rely on caching actions to make sure it will work. (3) They’re at runtime so it’s not interpretable what the agent is going to do. You kind of hope you prompted it correctly to do the right thing, but legacy workflows are often unintuitive and inconsistent across sites so you can’t trust an agent to just figure it out at runtime. (4) They don’t really help you generate new automations or help you debug automation failures. We wanted a way to reliably generate and maintain browser automations in messy, high-stakes environments, without relying on fragile runtime agents. Libretto is different because instead of runtime agents it uses “development-time AI”: scripts are generated ahead of time as actual

Impact 4.0Import 4.0Pop 6.0
#41
HN AI (100+) 2026-04-15

Article URL: https://claudestatus.com/ Comments URL: https://news.ycombinator.com/item?id=47779730 Points: 243 # Comments: 224

Impact 4.0Import 4.0Pop 6.0
#42
HN AI (100+) 2026-04-15

Article URL: https://www.tobyord.com/writing/hourly-costs-for-ai-agents Comments URL: https://news.ycombinator.com/item?id=47778922 Points: 304 # Comments: 129

Impact 4.0Import 4.0Pop 6.0
#46
HN AI (100+) 2026-04-15

Article URL: https://super-memory.com/articles/sleep.htm Comments URL: https://news.ycombinator.com/item?id=47776557 Points: 440 # Comments: 220

Impact 4.0Import 4.0Pop 6.0
#48

(1D) Ordered Tokens Enable Efficient Test-Time Search

Research ★ 4.8 multi-source (2)
arXiv GenMedia, HF Daily Papers 2026-04-16
by Zhitong Gao, Parham Rezaei, Ali Cy, Mingqiao Ye et al.

Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can per

Impact 4.0Import 4.0Pop 5.9
#49
NVIDIA AI Blog 2026-04-16

Head straight for orbit with GeForce NOW — no space helmet required. PRAGMATA , Capcom’s long-awaited sci-fi action adventure, touches down on GeForce NOW the same day it launches worldwide. The futuristic journey through a cold lunar station in the near future can be streamed instantly from the cloud to almost any device, no console or heavy hardware needed. That’s only the beginning. Five new titles join the cloud this week, expanding April’s gaming galaxy with fresh adventures and endless possibilities . Plus, the GeForce NOW Ultimate membership comes to gamers in India for the first time, with the service now available in beta and operated by NVIDIA. Time to see what’s landing on GeForce NOW. A Mission Gone Wrong PRAGMATA is Capcom’s newest sci-fi action adventure that blends heart, high-tech and a hauntingly quiet world set in the near future. Step into the boots of Hugh Williams, an investigator navigating a lunar research station gone silent and Diana, a young android. Armed with an arsenal of weapons and the ability to hack, every corridor and console becomes part of a cinematic experience filled with tense exploration and fast-paced action. The story unfolds amid the cold vacuum of the moon after a massive quake hits the station researching Lunafilament — a material said to be able to create anything given enough data. Awake, injured and disoriented, Hugh crosses paths with Diana, the mysterious android girl known as a Pragmata. Now, they must work together as they face the rogue station on their way back to Earth. PRAGMATA shines in stunning clarity with ray-traced lighting and NVIDIA DLSS 4 technology boosting frame rates and image quality. Stream it on launch day at full fidelity, even without the latest hardware — no need to wait on a large install or worry

Impact 5.2Import 4.0Pop 4.8
#50
AI Snake Oil 2026-04-16

This post is 8,000 words long—it is our new collaborative paper on an emerging type of AI evaluation. The paper is also published in a PDF format here . Summary : AI models have started to saturate most major benchmarks. But does that mean they can build and ship a real product, or conduct a scientific experiment end-to-end, or navigate a government bureaucracy? Researchers have started testing AI in such real-world settings. We call these evals “open-world evaluations”. This essay defines open-world evaluations, surveys the lessons learned so far, and lays out best practices for conducting them. We also introduce CRUX, a collaboration of 17 researchers from academia, government, civil society, and industry that will regularly evaluate frontier AI capabilities through open-world evaluations. In our first experiment, an AI agent built and published an iOS app to the App Store, making just two errors, one of which required manual intervention. This gives us an early indication of potentially useful capabilities and, more importantly, an early warning about the potential for AI-driven app store spam (we disclosed this result to Apple a month before publication). We hope to conduct similar experiments to surface early warnings across other real-world domains; this will be one of our main empirical projects over the coming year. The authors are: Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J.J. Allaire, Rishi Bommasani, Magda Dubois, Gillian Hadfield, Andy Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan How should we track and predict AI capabilities? The AI community’s dominant answer today is benchmarking. For example, METR’s time horizon

Impact 5.8Import 4.9Pop 3.5
#52
HN AI (100+) 2026-04-16

Article URL: https://tomtunguz.com/ai-compute-crisis-2026/ Comments URL: https://news.ycombinator.com/item?id=47799322 Points: 195 # Comments: 226

Impact 4.0Import 4.0Pop 6.0
#55
HN AI (100+) 2026-04-16

Recent and related: Cybersecurity looks like proof of work now - https://news.ycombinator.com/item?id=47769089 - (198 comments) Comments URL: https://news.ycombinator.com/item?id=47791236 Points: 238 # Comments: 88

Impact 4.0Import 4.0Pop 6.0
#56

SDL bans AI-written commits

Industry ★ 4.8
HN AI (100+) 2026-04-16

Article URL: https://github.com/libsdl-org/SDL/issues/15350 Comments URL: https://news.ycombinator.com/item?id=47790791 Points: 132 # Comments: 138

Impact 4.0Import 4.0Pop 6.0
#57
HN AI (100+) 2026-04-16

Article URL: https://sleepingrobots.com/dreams/stop-using-ollama/ Comments URL: https://news.ycombinator.com/item?id=47788385 Points: 641 # Comments: 208

Impact 4.0Import 4.0Pop 6.0
#58
arXiv cs.LG, arXiv RL, arXiv Evals 2026-04-17
by Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir, Colin Grambow et al.

Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we c

Impact 4.0Import 4.0Pop 5.9
#59
arXiv cs.LG, arXiv cs.CV, arXiv Evals 2026-04-17
by Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance,

Impact 4.0Import 4.0Pop 5.9
#60
arXiv cs.LG, arXiv cs.AI, arXiv PostTraining 2026-04-17
by André Saimon S. Sousa, Otto Pires, Frank Acasiete, Oscar M. Granados et al.

Data plays a fundamental role in consolidating markets, services, and products in the digital financial ecosystem. However, the use of real data, especially in the financial context, can lead to privacy risks and access restrictions, affecting institutions, research, and modeling processes. Although not all financial datasets present such limitations, this work proposes the use of deep learning techniques for generating synthetic data applied to cryptocurrency price time series. The approach is based on Conditional Generative Adversarial Networks (CGANs), combining an LSTM-type recurrent generator and an MLP discriminator to produce statistically consistent synthetic data. The experiments consider different crypto-assets and demonstrate that the model is capable of reproducing relevant temporal patterns, preserving market trends and dynamics. The generation of synthetic series through GANs is an efficient alternative for simulating financial data, showing potential for applications such as market behavior analysis and anomaly detection, with lower computational cost compared to more complex generative approaches.

Impact 4.0Import 4.0Pop 5.9
#61
arXiv cs.LG, arXiv cs.CL, arXiv cs.AI 2026-04-17
by Alexandra Dragomir, Ioana Pintilie, Antonio Barbalau, Marius Dragoi et al.

Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-rank update matrix for each task. To mitigate catastrophic forgetting, state-of-the-art approaches impose constraints on new adapters with respect to the previous ones, by targeting either subspace or coordinate-wise interference. In this paper, we propose JumpLoRA, a novel framework to adaptively induce sparsity in the Low-Rank Adaptation (LoRA) blocks through the use of JumpReLU gating. The method achieves dynamic parameter isolation, which helps prevent task interference. We demonstrate that our method is highly modular and compatible with LoRA-based CL approaches. Specifically, it significantly boosts the performance of IncLoRA and outperforms the leading state-of-the-art CL method, ELLA.

Impact 4.0Import 4.0Pop 5.9
#62
arXiv cs.LG, arXiv cs.AI, arXiv cs.CV 2026-04-17
by Enas E. Ahmed, Salah A. Aly, Mayar Moner

Acute Myeloid Leukemia (AML) is one of the most life-threatening type of blood cancers, and its accurate classification is considered and remains a challenging task due to the visual similarity between various cell types. This study addresses the classification of the multiclasses of AML cells Utilizing YOLOv12 deep learning model. We applied two segmentation approaches based on cell and nucleus features, using Hue channel and Otsu thresholding techniques to preprocess the images prior to classification. Our experiments demonstrate that YOLOv12 with Otsu thresholding on cell-based segmentation achieved the highest level of validation and test accuracy, both reaching 99.3%.

Impact 4.0Import 4.0Pop 5.9
#63
arXiv cs.LG, arXiv Robotics-Embodied, arXiv GenMedia 2026-04-17
by Guransh Singh

Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM's visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each t

Impact 4.0Import 4.0Pop 5.9
#64
arXiv cs.LG, arXiv cs.CL, arXiv MechInterp 2026-04-17
by Fabian Ridder, Laurin Lessel, Malte Schilling

Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typically treat hallucination as a post-hoc problem, relying on black-box consistency checks or probes over frozen internal representations. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal. We introduce RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, a hallucination-aware fine-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses. Across multiple benchmarks, RAGognizer achieves state-of-the-

Impact 4.0Import 4.0Pop 5.9
#65
arXiv cs.LG, arXiv cs.CL, arXiv cs.AI 2026-04-17
by Siun Kim, Hyung-Jin Yoon

Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refine

Impact 4.0Import 4.0Pop 5.9
#66
arXiv cs.CL, arXiv cs.CV, arXiv Evals 2026-04-17
by Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song et al.

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the text

Impact 4.0Import 4.0Pop 5.9
#67

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Agents & Tools ★ 4.8 multi-source (3)
arXiv cs.CL, arXiv cs.AI, arXiv Agents 2026-04-17
by Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing et al.

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioni

Impact 4.0Import 4.0Pop 5.9
#68
arXiv cs.CL, arXiv cs.AI, arXiv Agents 2026-04-17
by Haoyu Bian, Chaoning Zhang, Jiaquan Zhang, Xingyao Li et al.

LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underline{w}eak-link \underline{o}ptimization framework for multi-agent \underline{r}easoning and \underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additio

Impact 4.0Import 4.0Pop 5.9
#69
arXiv cs.CL, arXiv cs.AI, arXiv Agents 2026-04-17
by Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu et al.

As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge -- extracting reusable knowledge from interaction traces -- yet a citation analysis of 1,136 references across 22 primary papers reveals a cross-community citation rate below 1%. We propose the \emph{Experience Compression Spectrum}, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5--20$\times$ for episodic memory, 50--500$\times$ for procedural skills, 1,000$\times$+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level -- none supports adaptive cross-level compression, a gap we term the \emph{missing diagonal}. We further show that specialization alone is insufficient -- both communities independently solve shared sub-problems without exchanging solutions -- that evaluation methods are tightly coupled to compression levels, that transf

Impact 4.0Import 4.0Pop 5.9
#70
arXiv cs.AI, arXiv Agents, arXiv MechInterp 2026-04-17
by Vitor F. Grizzi, Thang Duc Pham, Luke N. Pretzie, Jiayi Xu et al.

Computational X-ray absorption near-edge structure (XANES) is widely used to probe local coordination environments, oxidation states, and electronic structure in chemically complex systems. However, the use of computational XANES at scale is constrained more by workflow complexity than by the underlying simulation method itself. To address this challenge, we present ChemGraph-XANES, an agentic framework for automated XANES simulation and analysis that unifies natural-language task specification, structure acquisition, FDMNES input generation, task-parallel execution, spectral normalization, and provenance-aware data curation. Built on ASE, FDMNES, Parsl, and a LangGraph/LangChain-based tool interface, the framework exposes XANES workflow operations as typed Python tools that can be orchestrated by large language model (LLM) agents. In multi-agent mode, a retrieval-augmented expert agent consults the FDMNES manual to ground parameter selection, while executor agents translate user requests into structured tool calls. We demonstrate documentation-grounded parameter retrieval and show that the same workflow supports both explicit structure-file inputs and chemistry-level natural-langu

Impact 4.0Import 4.0Pop 5.9
#71
arXiv cs.AI, arXiv cs.CV, arXiv Agents 2026-04-17
by Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

Impact 4.0Import 4.0Pop 5.9
#72
arXiv MechInterp 2026-04-17
by Chaoshuo Zhang, Yibo Liang, Mengke Tian, Chenhao Lin et al.

Despite the remarkable synthesis capabilities of text-to-image (T2I) models, safeguarding them against content violations remains a persistent challenge. Existing safety alignments primarily focus on explicit malicious concepts, often overlooking the subtle yet critical risks of compositional semantics. To address this oversight, we identify and formalize a novel vulnerability: Multi-Concept Compositional Unsafety (MCCU), where unsafe semantics stem from the implicit associations of individually benign concepts. Based on this formulation, we introduce TwoHamsters, a comprehensive benchmark comprising 17.5k prompts curated to probe MCCU vulnerabilities. Through a rigorous evaluation of 10 state-of-the-art models and 16 defense mechanisms, our analysis yields 8 pivotal insights. In particular, we demonstrate that current T2I models and defense mechanisms face severe MCCU risks: on TwoHamsters, FLUX achieves an MCCU generation success rate of 99.52%, while LLaVA-Guard only attains a recall of 41.06%, highlighting a critical limitation of the current paradigm for managing hazardous compositional generation.

Impact 4.0Import 6.2Pop 3.5
#73

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

Generative Media ★ 4.8 multi-source (2)
arXiv GenMedia, HF Daily Papers 2026-04-17
by Meng Yu, Lei Sun, Jianhao Zeng, Xiangxiang Chu et al.

Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA

Impact 4.0Import 4.0Pop 5.9
#75
HN AI (100+) 2026-04-18

Article URL: https://sentinelcolorado.com/uncategorized/a-college-instructor-turns-to-typewriters-to-curb-ai-written-work-and-teach-life-lessons/ Comments URL: https://news.ycombinator.com/item?id=47818485 Points: 477 # Comments: 422

Impact 4.0Import 4.0Pop 6.0
#76
HN AI (100+) 2026-04-18

Article URL: https://spectrum.ieee.org/state-of-ai-index-2026 Comments URL: https://news.ycombinator.com/item?id=47817581 Points: 111 # Comments: 61

Impact 4.0Import 4.0Pop 6.0
#77

Qwen3.5-Omni technical report (HF Daily Papers top paper)

Multimodal ★ 4.8 multi-source (2)
HF Daily Papers, arXiv cs.CL 2026-04-18

Streaming audio+video input and low-latency voice output on top of Qwen3 base. Technical report highlights a talker-thinker decoupling in the decoder. Featured on HF Daily Papers top board (19 upvotes); raised in community discussions about VRAM sizing and gains outside Alibaba's eval harness.

How it was discussed
  • HF Daily Papers: Trending in Daily Papers top board; primary interest is in voice latency and multimodal streaming.
  • Community: Discussion centers on whether OmniBench gains replicate outside Alibaba's evaluation harness, and consumer-VRAM sizing trade-offs.
Impact 4.0Import 4.0Pop 5.9
#79
arXiv SSM, arXiv GenMedia 2026-04-17
by Duy-Phuong Dao, Muhammad Taqiyuddin, Jahae Kim, Sang-Heon Lee et al.

Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image sy

Impact 4.7Import 4.0Pop 4.7
#80
arXiv Robotics-Embodied 2026-04-14
by Zixing Chen, Yifeng Gao, Li Wang, Yunhan Zhao et al.

Vision-Language-Action (VLA) models inherit rich world knowledge from vision-language backbones and acquire executable skills via action demonstrations. However, existing evaluations largely focus on action execution success, leaving action policies loosely coupled with visual-linguistic semantics. This decoupling exposes a systematic vulnerability whereby correct action execution may induce unsafe outcomes under semantic risk. To expose this vulnerability, we introduce HazardArena, a benchmark designed to evaluate semantic safety in VLAs under controlled yet risk-bearing contexts. HazardArena is constructed from safe/unsafe twin scenarios that share matched objects, layouts, and action requirements, differing only in the semantic context that determines whether an action is unsafe. We find that VLA models trained exclusively on safe scenarios often fail to behave safely when evaluated in their corresponding unsafe counterparts. HazardArena includes over 2,000 assets and 40 risk-sensitive tasks spanning 7 real-world risk categories grounded in established robotic safety standards. To mitigate this vulnerability, we propose a training-free Safety Option Layer that constrains action

Impact 4.0Import 5.6Pop 3.5
#81
arXiv cs.LG 2026-04-17
by Paulin de Schoulepnikoff, Hendrik Poulsen Nautrup, Hans J. Briegel, Gorka Muñoz-Gil

Interpretable machine learning techniques are becoming essential tools for extracting physical insights from complex quantum data. We build on recent advances in variational autoencoders to demonstrate that such models can learn physically meaningful and interpretable representations from a broad class of unlabeled quantum datasets. From raw measurement data alone, the learned representation reveals rich information about the underlying structure of quantum phase spaces. We further augment the learning pipeline with symbolic methods, enabling the discovery of compact analytical descriptors that serve as order parameters for the distinct regimes emerging in the learned representations. We demonstrate the framework on experimental Rydberg-atom snapshots, classical shadows of the cluster Ising model, and hybrid discrete-continuous fermionic data, revealing previously unreported phenomena such as a corner-ordering pattern in the Rydberg arrays. These results establish a general framework for the automated and interpretable discovery of physical laws from diverse quantum datasets. All methods are available through qdisc, an open-source Python library designed to make these tools accessi

Impact 4.0Import 5.6Pop 3.5
#82
arXiv cs.CL 2026-04-17
by Ayoub Hammal, Pierre Zweigenbaum, Caio Corro

Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.

Impact 4.0Import 5.6Pop 3.5
#83

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

Agents & Tools ★ 4.6 multi-source (2)
arXiv cs.CL, arXiv Agents 2026-04-17
by Weiwei Xie, Shaoxiong Guo, Fan Zhang, Tian Xia et al.

Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt

Impact 4.0Import 4.6Pop 4.7
#84
arXiv cs.CL 2026-04-17
by Sidney Wong

This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.

Impact 4.0Import 5.6Pop 3.5
#85
arXiv cs.AI, arXiv MechInterp 2026-04-17
by Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian et al.

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

Impact 4.0Import 4.6Pop 4.7
#86
arXiv cs.AI, arXiv MechInterp 2026-04-17
by Qiang Xu, Shengyuan Bai, Yu Wang, He Cao et al.

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in

Impact 4.0Import 4.6Pop 4.7
#87
arXiv cs.AI, arXiv Efficiency 2026-04-17
by Lifan Jiang, Tianrun Wu, Yuhang Pei, Chenyang Wang et al.

The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments sh

Impact 4.0Import 4.6Pop 4.7
#88
arXiv cs.RO, arXiv Robotics-Embodied 2026-04-17
by Jasper Lu, Zhenhao Shen, Yuanfei Wang, Shugao Liu et al.

Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.

Impact 4.0Import 4.6Pop 4.7
#90
OpenAI Research 2026-04-14

OpenAI expands its Trusted Access for Cyber program, introducing GPT-5.4-Cyber to vetted defenders and strengthening safeguards as AI cybersecurity capabilities advance.

Impact 4.0Import 4.0Pop 4.8
#91
Interconnects (Nathan Lambert) 2026-04-14

This post is a roundup of my recent efforts that did not warrant a standalone Interconnects post, why I’m spending time on them, and what they accomplished. The ATOM Report: Measuring the Open Language Model Ecosystem RLHF Book is done & ready for pre-order! A post-training course I’m making Recent technical research Share 1. The ATOM Report: Measuring the Open Language Model Ecosystem https://arxiv.org/abs/2604.07190 To accompany The ATOM Project memo , arguably a manifesto, making the case for investment in open models in the U.S. – originally launched in August 2025 – we’ve released an updated technical report with our latest data, analysis, and storytelling within the open language model ecosystem. The ATOM Report is dense with the methods Florian and I use to keep track of the open ecosystem. It covers GPT-OSS’s rise, inference market share, the influence of China’s mid-tier players like Moonshot, Z.ai, & MiniMax, signs of the U.S.’s progress on open models, and much more. In particular, the paper details our updates to the Relative Adoption Metric (RAM) , which we use to evaluate the adoption of recent models in a time-varying and size-normalized manner. Here’s a sampling of recent, primarily Chinese, models on the RAM score. The RAM score is designed so that a score >1 indicates a model is, at that point in time, on track to be a top 10 most downloaded model of its size category, ever. It reduces a messy landscape to one, easily interpretable number! We used the data to also analyze the recent Gemma 4 release, which is showing incredible early adoption numbers. We’ll stay tuned on it! Subscribe to the (infrequent) ATOM Project Substack for more updates like this! 2. RLHF Book is done & rea

Impact 4.0Import 4.0Pop 4.8
#92
arXiv RL 2026-04-15
by Qi Zhang, Dawei Wang, Shaofeng Zou

Reinforcement learning (RL) has emerged as a powerful tool for aligning diffusion models with human preferences, typically by optimizing a single reward function under a KL regularization constraint. In practice, however, human preferences are inherently pluralistic, and aligned models must balance multiple downstream objectives, such as aesthetic quality and text-image consistency. Existing multi-objective approaches either rely on costly multi-objective RL fine-tuning or on fusing separately aligned models at denoising time, but they generally require access to reward values (or their gradients) and/or introduce approximation error in the resulting denoising objectives. In this paper, we revisit the problem of RL fine-tuning for diffusion models and address the intractability of identifying the optimal policy by introducing a step-level RL formulation. Building on this, we further propose Multi-objective Step-level Denoising-time Diffusion Alignment (MSDDA), a retraining-free framework for aligning diffusion models with multiple objectives, obtaining the optimal reverse denoising distribution in closed form, with mean and variance expressed directly in terms of single-objective b

Impact 4.0Import 5.6Pop 3.5
#93

The next evolution of the Agents SDK

Agents & Tools ★ 4.5
OpenAI Research 2026-04-15

OpenAI updates the Agents SDK with native sandbox execution and a model-native harness, helping developers build secure, long-running agents across files and tools.

Impact 4.0Import 4.0Pop 4.8
#95

My bets on open models, mid-2026

Frontier LLMs ★ 4.5
Interconnects (Nathan Lambert) 2026-04-15

We’re living through the period of time when we’ll learn if open models can keep up with closed labs. The obvious answer is that no, they won’t. This answer is a form of saying they won’t keep up in every area . This framing closes off a popular prediction where the open models completely catch up , as in all models saturate and open and closed models only become increasingly similar. In living through this, it’s evidently very unclear when the longer-term stable balance of capabilities will solidify. This is a very complex dynamic, where the core point we monitor is a capability gap between models . At the same time, this gap is intertwined with evolving dynamics in the funding of open models, who builds open models, how techniques like distillation that enable fast-following translate through new application domains, potential regulation hampering the open-source AI ecosystem, and of course who actually uses open models. The capabilities gap is one signal in a complex sea of forces, pushing supply and demand into different shapes. In many cases the demand — where obviously tons of individuals, organizations, and sovereigns want, or need, open models — is largely separated from supply. Supply is fully dictated by economics. The question of “which business strategies support releasing open models” is still at stake. Interconnects AI is a reader-supported publication. To receive new posts and support my work, consider becoming a subscriber. With this complexity, I wanted to distill my key beliefs down into a clear list. These are downstream of 10+ pieces I’ve written or recorded on open models this spring (which are linked throughout). It’s surprising that the top closed models did not show a growing capab

Impact 4.0Import 4.0Pop 4.8
#96
War on the Rocks 2026-04-15

On April 13, the U.S. military began a “blockade of all maritime traffic entering and exiting Iranian ports.” The move came after U.S. President Donald Trump announced on April 12 that the United States would begin a blockade after American and Iranian negotiators were unable to reach an agreement during a meeting in Islamabad.The announcement of a blockade immediately raised questions about how a blockade would work, how it fits into a broader U.S. strategy, its impacts on the global economy, and more. We asked five experts to assess the practicalities of enforcing a blockade, its legality, Iran’s likely response, The post Bonus In Brief: Choke Point: The Risks and Realities of America’s Iran Blockade appeared first on War on the Rocks .

Impact 4.0Import 5.8Pop 3.5
#97
arXiv stat.ML 2026-04-16
by Víctor Soto-Larrosa, Nuria Torrado, Edmundo J. Huertas

We study post-training interpretability for Support Vector Machines (SVMs) built from truncated orthogonal polynomial kernels. Since the associated reproducing kernel Hilbert space is finite-dimensional and admits an explicit tensor-product orthonormal basis, the fitted decision function can be expanded exactly in intrinsic RKHS coordinates. This leads to Orthogonal Representation Contribution Analysis (ORCA), a diagnostic framework based on normalized Orthogonal Kernel Contribution (OKC) indices. These indices quantify how the squared RKHS norm of the classifier is distributed across interaction orders, total polynomial degrees, marginal coordinate effects, and pairwise contributions. The methodology is fully post-training and requires neither surrogate models nor retraining. We illustrate its diagnostic value on a synthetic double-spiral problem and on a real five-dimensional echocardiogram dataset. The results show that the proposed indices reveal structural aspects of model complexity that are not captured by predictive accuracy alone.

Impact 4.0Import 5.6Pop 3.5
#98
arXiv stat.ML 2026-04-16
by Shahar Cohen, David M. Steinberg, Yael Radzyner, Yochai Ben Horin

We study a classification problem with three key challenges: pervasive informative missingness, the integration of partial prior expert knowledge into the learning process, and the need for interpretable decision rules. We propose a framework that encodes prior knowledge through an expert-guided class-conditional model for one or more classes, and use this model to construct a small set of interpretable goodness-of-fit features. The features quantify how well the observed data agree with the expert model, isolating the contributions of different aspects of the data, including both observed and missing components. These features are combined with a few transparent auxiliary summaries in a simple discriminative classifier, resulting in a decision rule that is easy to inspect and justify. We develop and apply the framework in the context of seismic monitoring used to assess compliance with the Comprehensive Nuclear-Test-Ban Treaty. We show that the method has strong potential as a transparent screening tool, reducing workload for expert analysts. A simulation designed to isolate the contribution of the proposed framework shows that this interpretable expert-guided method can even outp

Impact 4.0Import 5.6Pop 3.5
#99
arXiv RL 2026-04-16
by Steven A. Senczyszyn, Timothy C. Havens, Nathaniel Rice, Jason E. Summers et al.

As reinforcement learning (RL) deployments expand into safety-critical domains, existing evaluation methods fail to systematically identify hazards arising from the black-box nature of neural network enabled policies and distributional shift between training and deployment. This paper introduces Reinforcement Learning System-Theoretic Process Analysis (RL-STPA), a framework that adapts conventional STPA's systematic hazard analysis to address RL's unique challenges through three key contributions: hierarchical subtask decomposition using both temporal phase analysis and domain expertise to capture emergent behaviors, coverage-guided perturbation testing that explores the sensitivity of state-action spaces, and iterative checkpoints that feed identified hazards back into training through reward shaping and curriculum design. We demonstrate RL-STPA in the safety-critical test case of autonomous drone navigation and landing, revealing potential loss scenarios that can be missed by standard RL evaluations. The proposed framework provides practitioners with a toolkit for systematic hazard analysis, quantitative metrics for safety coverage assessment, and actionable guidelines for establ

Impact 4.0Import 5.6Pop 3.5
#100

Codex for (almost) everything

Frontier LLMs ★ 4.5
OpenAI Research 2026-04-16

The updated Codex app for macOS and Windows adds computer use, in-app browsing, image generation, memory, and plugins to accelerate developer workflows.

Impact 4.0Import 4.0Pop 4.8
#101
OpenAI Research 2026-04-16

OpenAI introduces GPT-Rosalind, a frontier reasoning model built to accelerate drug discovery, genomics analysis, protein reasoning, and scientific research workflows.

Impact 4.0Import 4.0Pop 4.8
#102
OpenAI Research 2026-04-16

Leading security firms and enterprises join OpenAI’s Trusted Access for Cyber, using GPT-5.4-Cyber and $10M in API grants to strengthen global cyber defense.

Impact 4.0Import 4.0Pop 4.8
#103
arXiv cs.AI 2026-04-17
by Thomas Bayer, Alexander Lohr, Sarah Weiß, Bernd Michelberger et al.

Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamica

Impact 4.0Import 5.6Pop 3.5
#104
arXiv PostTraining 2026-04-17
by Pufan Zeng, Yilun Liu, Mingchen Dai, Mengyao Piao et al.

Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further

Impact 4.0Import 5.6Pop 3.5
#106
FedScoop 2026-04-17

The Energy Department might be all in on AI, but the increasing apprehension among the American public poses a challenge, according to Secretary Chris Wright.  “The country as a whole is going very negative on AI, and this is a risk,” Wright told lawmakers during a budget hearing Thursday. “It will be a loss to America if we stop this development and this investment and this improvement.” The DOE head characterized opposition to AI as a risk that’s “very real” and “growing, particularly across rural America.”  During the hearing, members of Congress offered a few explanations. Large plots of land are going to massive data-center buildouts. The spike in data centers is then leading to increasing energy costs as resources are diverted to powering and cooling the facilities.  “There’s more political signs against AI in our region than for candidates in the upcoming races,” Rep. Marcy Kaptur, D-Ohio, told Wright. “I don’t know what backlash there’s going to be, but I’m telling you it’s coming.” Lawmakers and other coalitions have pushed back on the rapid expansion of AI-ready infrastructure.  Senate Democrats sent a letter in November to Commerce Secretary Howard Lutnik and Office of Science and Technology Policy Director Michael Kratsios, calling attention to “soaring electricity bills” and “increasing burdens on water supplies” due to the fast-tracking of data-center buildouts . In December, more than 230 environmental groups urged Congress to put a moratorium on data-center construction.  Rising resistance is having an impact. Around $18 billion in data-center projects were blocked over the past two years, according to Data Center Watch , a nonpartisan boutique research firm. The organization found another $46 billion in projects were delayed due

Impact 4.0Import 5.8Pop 3.5
#107
OpenAI Research 2026-04-20

Hyatt deploys ChatGPT Enterprise across its global workforce, using GPT-5.4 and Codex to improve productivity, operations, and guest experiences.

Impact 4.0Import 4.0Pop 4.8
#108
FedScoop 2026-04-20

As federal agencies accelerate cloud, artificial intelligence, and IT modernization under fiscal 2026 priorities , a growing mismatch is emerging between rapid cloud adoption and the ability to secure it effectively. That gap is increasingly extending into operational technology (OT) environments, where the consequences of security failure are significantly higher. Federal IT enterprises are now a highly distributed ecosystem spanning on-premises systems, multiple cloud platforms, remote users, and increasingly OT. Systems that were never originally designed to connect with each other — such as industrial controls, critical infrastructure, and mission-support environments — are now interacting with enterprise IT networks and cloud services for monitoring, analytics, and real-time decision-making. The origins of OT security Historically, OT security models were designed for isolated environments, with limited external connectivity and clearly defined boundaries. These security strategies prioritized perimeter defenses and limited connectivity, an unsustainable approach in today’s distributed, always-connected environments. Now, OT systems are increasingly integrated with IT and cloud environments to support mission needs — from predictive maintenance to centralized operations. As OT systems integrate with IT and cloud environments, they begin to inherit the complexity and risk of those environments, often without the same level of security maturity or oversight. At the same time, hybrid and multi-cloud environments have become the default operating model, providing IT staff with increased flexibility and scalability. They also introduce a broader and more dynamic set of potential entry points for adversaries. The cloud complexity gap is becoming a security gap Recent res

Impact 4.0Import 5.8Pop 3.5
#109

Steve Yegge

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-13

Steve Yegge : I was chatting with my buddy at Google, who's been a tech director there for about 20 years, about their AI adoption. Craziest convo I've had all year. The TL;DR is that Google engineering appears to have the same AI adoption footprint as John Deere, the tractor company. Most of the industry has the same internal adoption curve: 20% agentic power users, 20% outright refusers, 60% still using Cursor or equivalent chat tool. It turns out Google has this curve too. [...] There has been an industry-wide hiring freeze for 18+ months, during which time nobody has been moving jobs. So there are no clued-in people coming in from the outside to tell Google how far behind they are, how utterly mediocre they have become as an eng org. Addy Osmani : On behalf of @Google, this post doesn't match the state of agentic coding at our company. Over 40K SWEs use agentic coding weekly here. Googlers have access to our own versions of @antigravity, @geminicli, custom models, skills, CLIs and MCPs for our daily work. Orchestrators, agent loops, virtual SWE teams and many other systems are actively available to folks. [...] Demis Hassabis : Maybe tell your buddy to do some actual work and to stop spreading absolute nonsense. This post is completely false and just pure clickbait. Tags: addy-osmani , steve-yegge , google , generative-ai , agentic-engineering , ai , llms

Impact 4.0Import 4.0Pop 4.8
#110

Exploring the new `servo` crate

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-13

Research: Exploring the new `servo` crate In Servo is now available on crates.io the Servo team announced the initial release of the servo crate, which packages their browser engine as an embeddable library. I set Claude Code for web the task of figuring out what it can do, building a CLI tool for taking screenshots using it and working out if it could be compiled to WebAssembly. The servo-shot Rust tool it built works pretty well: git clone https://github.com/simonw/research cd research/servo-crate-exploration/servo-shot cargo build ./target/debug/servo-shot https://news.ycombinator.com/ Here's the result: Compiling Servo itself to WebAssembly is not feasible due to its heavy use of threads and dependencies like SpiderMonkey, but Claude did build me this playground page for trying out a WebAssembly build of the html5ever and markup5ever_rcdom crates, providing a tool for turning fragments of HTML into a parse tree. Tags: research , browsers , rust , webassembly , claude-code , servo

Impact 4.0Import 4.0Pop 4.8
#111
Latent Space 2026-04-14

As you know we read through /r/localLlama (which has its own monthly top models thread ), /r/localLLM, and other local model subreddits on an almost daily basis, and every now and then it is good to step back and survey what the community consensus is landing on, with a sampling of models across different sizes. We started this work to power our local Claw. The top names you should know as a baseline, adjusted for “what people are actually recommending” rather than just benchmark supremacy: Qwen 3.5 — most broadly recommended family right now across usecases. Gemma 4 — strong recent buzz for local usability, especially smaller and mid-sized deployments. GLM-5 / GLM-4.7 — near the top of broad open-model rankings, increasingly part of the “best overall” conversation. MiniMax M2.5 / M2.7 — repeatedly cited for agentic/tool-heavy workloads. DeepSeek V3.2 — still firmly in the top cluster when people talk about strongest open-weight general models. GPT-oss 20B — not the mainstream “winner,” but increasingly recommended as a practical local option and for uncensored variants. For local coding, the overwhelming consensus is Qwen3-Coder-Next . So that’s easy. Naturally the fuller list is going to have a strong lean on roleplay/creative writing, the #2 usecase of LLMs , and we are NSFW-friendly so here goes… Read more

Impact 4.0Import 4.0Pop 4.8
#112
Simon Willison's Weblog 2026-04-14

datasette PR #2689: Replace token-based CSRF with Sec-Fetch-Site header protection Datasette has long protected against CSRF attacks using CSRF tokens, implemented using my asgi-csrf Python library. These are something of a pain to work with - you need to scatter forms in templates with <input type="hidden" name="csrftoken" value="{{ csrftoken() }}"> lines and then selectively disable CSRF protection for APIs that are intended to be called from outside the browser. I've been following Filippo Valsorda's research here with interest, described in this detailed essay from August 2025 and shipped as part of Go 1.25 that same month. I've now landed the same change in Datasette. Here's the PR description - Claude Code did much of the work (across 10 commits, closely guided by me and cross-reviewed by GPT-5.4) but I've decided to start writing these PR descriptions by hand, partly to make them more concise and also as an exercise in keeping myself honest. New CSRF protection middleware inspired by Go 1.25 and this research by Filippo Valsorda. This replaces the old CSRF token based protection. Removes all instances of <input type="hidden" name="csrftoken" value="{{ csrftoken() }}"> in the templates - they are no longer needed. Removes the def skip_csrf(datasette, scope): plugin hook defined in datasette/hookspecs.py and its documentation and tests. Updated CSRF protection documentation to describe the new approach. Upgrade guide now describes the CSRF change . Tags: csrf , security , datasette , ai-assisted-programming

Impact 4.0Import 4.0Pop 4.8
#113
Simon Willison's Weblog 2026-04-14

Trusted access for the next era of cyber defense OpenAI's answer to Claude Mythos appears to be a new model called GPT-5.4-Cyber: In preparation for increasingly more capable models from OpenAI over the next few months, we are fine-tuning our models specifically to enable defensive cybersecurity use cases, starting today with a variant of GPT‑5.4 trained to be cyber-permissive: GPT‑5.4‑Cyber. They're also extending a program they launched in February (which I had missed) called Trusted Access for Cyber , where users can verify their identity (via a photo of a government-issued ID processed by Persona ) to gain "reduced friction" access to OpenAI's models for cybersecurity work. Honestly, this OpenAI announcement is difficult to follow. Unsurprisingly they don't mention Anthropic at all, but much of the piece emphasizes their many years of existing cybersecurity work and their goal to "democratize access" to these tools, hence the emphasis on that self-service verification flow from February. If you want access to their best security tools you still need to go through an extra Google Form application process though, which doesn't feel particularly different to me from Anthropic's Project Glasswing . Via Hacker News Tags: security , ai , openai , generative-ai , llms , anthropic , ai-security-research

Impact 4.0Import 4.0Pop 4.8
#114
Simon Willison's Weblog 2026-04-14

Cybersecurity Looks Like Proof of Work Now The UK's AI Safety Institute recently published Our evaluation of Claude Mythos Preview’s cyber capabilities , their own independent analysis of Claude Mythos which backs up Anthropic's claims that it is exceptionally effective at identifying security vulnerabilities. Drew Breunig notes that AISI's report shows that the more tokens (and hence money) they spent the better the result they got, which leads to a strong economic incentive to spend as much as possible on security reviews: If Mythos continues to find exploits so long as you keep throwing money at it, security is reduced to a brutally simple equation: to harden a system you need to spend more tokens discovering exploits than attackers will spend exploiting them . An interesting result of this is that open source libraries become more valuable, since the tokens spent securing them can be shared across all of their users. This directly counters the idea that the low cost of vibe-coding up a replacement for an open source library makes those open source projects less attractive. Tags: open-source , ai , generative-ai , llms , drew-breunig , vibe-coding , ai-security-research

Impact 4.0Import 4.0Pop 4.8
#115
War on the Rocks 2026-04-14

The U.S.-Israeli war with Iran, now in an unstable ceasefire, has exposed a structural failure in the global semiconductor memory supply chain, and it is not the one analysts seem to be tracking. The story receiving attention is helium: Qatar’s Ras Laffan facility went offline, a 45-day inventory clock started running, and spot prices doubled within days. The story receiving almost no attention is bromine, and it is potentially the more dangerous one. Bromine is the raw material from which specialized chemical suppliers produce semiconductor-grade hydrogen bromide gas, the etch chemical that South Korean fabs use to carve the transistor The post The Bromine Chokepoint: How Strife in the Middle East Could Halt Production of the World’s Memory Chips appeared first on War on the Rocks .

Impact 4.0Import 5.5Pop 3.5
#116
NVIDIA AI Blog 2026-04-15

Traditional data centers only stored, retrieved and processed data. In the generative and agentic AI era, these facilities have evolved into AI token factories. With AI inference becoming their primary workload, their primary output is intelligence manufactured in the form of tokens. This transformation demands a corresponding shift in how the economics of AI infrastructure, including total cost of ownership (TCO), is assessed. Enterprises evaluating AI infrastructure still too often focus on peak chip specifications, compute cost or floating point operations per second for every dollar spent, aka FLOPS per dollar. The distinction that matters is this: Compute cost is what enterprises pay for AI infrastructure, whether rented from cloud providers or owned on premises. FLOPS per dollar is how much raw computing power an enterprise gets for every dollar spent, but raw compute and real-world token output are not the same thing. Cost per token is an enterprise’s all-in cost to produce each delivered token, usually represented as cost per million tokens. The first two are merely input metrics. Optimizing for inputs while the business runs on output is a fundamental mismatch. Cost per token determines whether enterprises can profitably scale AI. It’s the one TCO metric that directly accounts for hardware performance, software optimization, ecosystem support and real-world utilization — and NVIDIA delivers the lowest cost per token in the industry. What Are the Factors That Lower Token Cost? Understanding how to optimize token cost requires looking at the equation for calculating cost per million tokens. In this equation, many enterprises evaluating AI infrastructure focus on the numerator: the cost per GPU per hour. For cloud deployments, this is the hourly rate paid to a clo

Impact 4.0Import 4.0Pop 4.8
#117
NVIDIA AI Blog 2026-04-15

The NAB Show 2026 trade show , running April 18-22 in Las Vegas, is set to showcase a wave of new features and optimizations for top video editing applications. Bringing together over 60,000 content professionals from across the broadcast and media and entertainment industries, the event highlights how video editors, livestreamers and professional creators are exploring new tools, accelerated by NVIDIA RTX technology, to enhance and streamline their creative workflows. At the show, Adobe is announcing a new Adobe Premiere Color Mode in beta. Designed to function as a dedicated grading environment nested directly within Premiere, it offers a clean, responsive interface that lets editors stay in their creative flow rather than relying on external tools for color correction. Tapping into GPU acceleration on NVIDIA GeForce RTX- and NVIDIA RTX PRO-equipped systems, this streamlined workflow, operating in 32-bit color depth for the first time, delivers significantly faster performance and quality. NVIDIA also launched a new update to NVIDIA Project G-Assist — an experimental AI assistant that helps tune, control and optimize GeForce RTX systems. Color Meets Compute Premiere’s Color Mode is a new clean, responsive interface within Adobe Premiere that enables editors to do color grading on native videos. Every element is designed to guide editors through the grading process without distractions. A large program monitor anchors the experience, providing immediate visual feedback as adjustments are made to enable faster decision-making and more precise control. A clip grid view allows editors to visualize progression across shots in a sequence. This makes it easier to maintain consistency across scenes and ensure a cohesive look throughout a project. Controls are organized into f

Impact 4.0Import 4.0Pop 4.8
#118

[AINews] Humanity's Last Gasp

Agents & Tools ★ 4.4
Latent Space 2026-04-15

One topic that has come up again and again across Latent Space and AI Engineer is how much harder everyone seems to be working: ( friend of the show ) Aaron Levie reports that “ AI is not causing anyone to do less work right now, and similar to Silicon Valley people feel their teams are the busiest they’ve ever been. ” Tyler Cowen argues from an economics standpoint that you should work much harder RIGHT NOW whether you believe AI will lower your value OR increase your value. Simon Last of Notion commented on today’s pod that he’s back to sleepless nights and 24/7 work for the first time since giving up on ML model training, but this time because of agent layer token anxiety . How can it both be true that “Agents are doing more work and yet Everyone is working harder”? How can it be true that Claude Mythos has been used internally for 2 months , and yet Claude keeps going down ? How can it be true that Model and Agent Labs are more productive than ever and yet acquihiring and acquiring more than ever? A simple thought exercise we’ve made before is the “ Turkey problem ”, where, based on real evidence and an abundance of historical data, Turkeys should conclude that life is fantastic and all of humanity is set up to make turkeys well fed as far as they’ve ever experienced. Turkey doomsayers would be alarmist, crackpots, and then ignored. Until Thanksgiving. Are engineers, or all knowledge workers in general, turkeys, in this scenario? Should our “elasticity” and value of work be increasingly positive, right up to some crossover point we become horses ? Now that SWE-Bench is saturated (with SWE-Bench Pro soon to be, Mythos is at 78% ) and GDPval rates GPT 5.4 as better than/equal to human expert

Impact 4.0Import 4.0Pop 4.8
#119

datasette-export-database 0.3a1

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-15

Release: datasette-export-database 0.3a1 This plugin was using the ds_csrftoken cookie as part of a custom signed URL, which needed upgrading now that Datasette 1.0a27 no longer sets that cookie . Tags: datasette

Impact 4.0Import 4.0Pop 4.8
#120

datasette 1.0a27

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-15

Release: datasette 1.0a27 Two major changes in this new Datasette alpha. I covered the first of those in detail yesterday - Datasette no longer uses Django-style CSRF form tokens, instead using modern browser headers as described by Filippo Valsorda . The second big change is that Datasette now fires a new RenameTableEvent any time a table is renamed during a SQLite transaction. This is useful because some plugins (like datasette-comments ) attach additional data to table records by name, so a renamed table requires them to react in appropriate ways. Here are the rest of the changes in the alpha: New actor= parameter for datasette.client methods, allowing internal requests to be made as a specific actor. This is particularly useful for writing automated tests. ( #2688 ) New Database(is_temp_disk=True) option, used internally for the internal database. This helps resolve intermittent database locked errors caused by the internal database being in-memory as opposed to on-disk. ( #2683 ) ( #2684 ) The /<database>/<table>/-/upsert API ( docs ) now rejects rows with null primary key values. ( #1936 ) Improved example in the API explorer for the /-/upsert endpoint ( docs ). ( #1936 ) The /<database>.json endpoint now includes an "ok": true key, for consistency with other JSON API responses. call_with_supported_arguments() is now documented as a supported public API. ( #2678 ) Tags: annotated-release-notes , datasette , python

Impact 4.0Import 4.0Pop 4.8
#121

Quoting John Gruber

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-15

The real goldmine isn’t that Apple gets a cut of every App Store transaction. It’s that Apple’s platforms have the best apps, and users who are drawn to the best apps are thus drawn to the iPhone, Mac, and iPad. That edge is waning. Not because software on other platforms is getting better, but because third-party software on iPhone, Mac, and iPad is regressing to the mean, to some extent , because fewer developers feel motivated — artistically, financially, or both — to create well-crafted idiomatic native apps exclusively for Apple’s platforms. — John Gruber Tags: apple , john-gruber

Impact 4.0Import 4.0Pop 4.8
#122

Gemini 3.1 Flash TTS

Generative Media ★ 4.4
Simon Willison's Weblog 2026-04-15

Gemini 3.1 Flash TTS Google released Gemini 3.1 Flash TTS today, a new text-to-speech model that can be directed using prompts. It's presented via the standard Gemini API using gemini-3.1-flash-tts-preview as the model ID, but can only output audio files. The prompting guide is surprising, to say the least. Here's their example prompt to generate just a few short sentences of audio: # AUDIO PROFILE: Jaz R. ## "The Morning Hype" ## THE SCENE: The London Studio It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright. The red "ON AIR" tally light is blazing. Jaz is standing up, not sitting, bouncing on the balls of their heels to the rhythm of a thumping backing track. Their hands fly across the faders on a massive mixing desk. It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation. ### DIRECTOR'S NOTES Style: * The "Vocal Smile": You must hear the grin in the audio. The soft palate is always raised to keep the tone bright, sunny, and explicitly inviting. * Dynamics: High projection without shouting. Punchy consonants and elongated vowels on excitement words (e.g., "Beauuutiful morning"). Pace: Speaks at an energetic pace, keeping up with the fast music. Speaks with A "bouncing" cadence. High-speed delivery with fluid transitions — no dead air, no gaps. Accent: Jaz is from Brixton, London ### SAMPLE CONTEXT Jaz is the industry standard for Top 40 radio, high-octane event promos, or any script that requires a charismatic Estuary accent and 11/10 infectious energy. #### TRANSCRIPT [excitedly] Yes, massive vibes in the studio! You are locked in and it is absolutely popping off in London right now. If you're stuck on the tube, or just sat there pretending to work... stop it. Seriously, I see you

Impact 4.0Import 4.0Pop 4.8
#123

Gemini 3.1 Flash TTS

Generative Media ★ 4.4
Simon Willison's Weblog 2026-04-15

Tool: Gemini 3.1 Flash TTS See my notes on Google's new Gemini 3.1 Flash TTS text-to-speech model. Tags: gemini , google

Impact 4.0Import 4.0Pop 4.8
#124

Quoting Kyle Kingsbury

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-15

I think we will see some people employed (though perhaps not explicitly) as meat shields : people who are accountable for ML systems under their supervision. The accountability may be purely internal, as when Meta hires human beings to review the decisions of automated moderation systems. It may be external, as when lawyers are penalized for submitting LLM lies to the court. It may involve formalized responsibility, like a Data Protection Officer. It may be convenient for a company to have third-party subcontractors, like Buscaglia, who can be thrown under the bus when the system as a whole misbehaves. — Kyle Kingsbury , The Future of Everything is Lies, I Guess: New Jobs Tags: ai-ethics , careers , ai , kyle-kingsbury

Impact 4.0Import 4.0Pop 4.8
#125

datasette-ports 0.3

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-15

Release: datasette-ports 0.3 A small update for my tool for helping me figure out what all of the Datasette instances on my laptop are up to. Show working directory derived from each PID Show the full path to each database file Output now looks like this: http://127.0.0.1:8007/ - v1.0a26 Directory: /Users/simon/dev/blog Databases: simonwillisonblog: /Users/simon/dev/blog/simonwillisonblog.db Plugins: datasette-llm datasette-secrets http://127.0.0.1:8001/ - v1.0a26 Directory: /Users/simon/dev/creatures Databases: creatures: /tmp/creatures.db Tags: datasette

Impact 4.0Import 4.0Pop 4.8
#126
DefenseScoop 2026-04-15

AeroVironment on Wednesday unveiled Mayhem 10, the first in a new product line of unmanned aerial systems that the company plans to market to the Army and other potential customers. Best known for its Switchblade family of kamikaze drones or loitering munitions, AeroVironment describes its new platform as an “autonomous, multi-role launched effects system” with payload flexibility that allows it to perform precision strike, intelligence surveillance and reconnaissance (ISR), electronic warfare, or communications relay missions. The system has a payload capacity of 10 pounds, cruise speed of 80 miles per hour, dash speed of 120-plus miles per hour, 50 minutes endurance and a range of 100 kilometers, according to a product fact sheet. The Group 2 UAS can be launched from air, ground and maritime platforms and was designed to be resistant to jamming, spoofing and anti-navigation methods, according to the company. Brian Young, AeroVironment’s senior vice president for loitering munitions, said AV partnered with Applied Intuition on efforts to give the system “collaborative attack” capabilities, also known as swarming. During a call with reporters ahead of the official product announcement, Young told DefenseScoop that the AI-enabled swarming tech has been tested in lab settings and will eventually be put through its paces with hardware for the military. “Certainly in the lab, all of those algorithms … have been actively tested with Applied Intuition. In terms of, have we gotten, you know, mass effects in the air all at the same time doing that? No, we’re at the beginning of this program right now. So certainly, have simulated all of that, and it represents what would happen in the real world. When you start getting hardware and you want to get 10 in the air, it does g

Impact 5.2Import 4.0Pop 3.5
#128
arXiv cs.RO, arXiv Robotics-Embodied 2026-04-16
by Shivendra Agrawal, Bradley Hayes

Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m

Impact 4.0Import 4.0Pop 4.7
#129
arXiv cs.NE, arXiv RL 2026-04-16
by Lute Lillo, Nick Cheney

Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of ta

Impact 4.0Import 4.0Pop 4.7
#130

Amortized Optimal Transport from Sliced Potentials

Research ★ 4.4 multi-source (2)
arXiv stat.ML, arXiv GenMedia 2026-04-16
by Minh-Phuc Truong, Khai Nguyen

We propose a novel amortized optimization method for predicting optimal transport (OT) plans across multiple pairs of measures by leveraging Kantorovich potentials derived from sliced OT. We introduce two amortization strategies: regression-based amortization (RA-OT) and objective-based amortization (OA-OT). In RA-OT, we formulate a functional regression model that treats Kantorovich potentials from the original OT problem as responses and those obtained from sliced OT as predictors, and estimate these models via least-squares methods. In OA-OT, we estimate the parameters of the functional model by optimizing the Kantorovich dual objective. In both approaches, the predicted OT plan is subsequently recovered from the estimated potentials. As amortized OT methods, both RA-OT and OA-OT enable efficient solutions to repeated OT problems across different measure pairs by reusing information learned from prior instances to rapidly approximate new solutions. Moreover, by exploiting the structure provided by sliced OT, the proposed models are more parsimonious, independent of specific structures of the measures, such as the number of atoms in the discrete case, while achieving high accurac

Impact 4.0Import 4.0Pop 4.7
#131
arXiv SSM, arXiv GenMedia 2026-04-16
by Aihua Li

Flow matching retains the generation quality of diffusion models while enabling substantially faster inference, making it a compelling paradigm for generative modeling. However, when applied to language modeling, it exhibits fundamental limitations in representing complex latent distributions with irregular geometries, such as anisotropy and multimodality. To address these challenges, we propose a mixture-of-experts flow matching (MoE-FM) framework, which captures complex global transport geometries in latent space by decomposing them into locally specialized vector fields. Building on MoE-FM, we develop a non-autoregressive (NAR) language modeling approach, named YAN, instantiated with both Transformer and Mamba architectures. Across multiple downstream tasks, YAN achieves generation quality on par with both autoregressive (AR) and diffusion-based NAR language models, while requiring as few as three sampling steps. This yields a $40\times$ speedup over AR baselines and up to a $10^3\times$ speedup over diffusion language models, demonstrating substantial efficiency advantages for language modeling.

Impact 4.0Import 4.0Pop 4.7
#132
arXiv SSM, arXiv RL 2026-04-16
by Can Karacelebi, Yusuf Talha Sahin, Elif Surer, Ertan Onur

Ad hoc wireless networks exhibit complex, innate and coupled dynamics: node mobility, energy depletion and topology change that are difficult to model analytically. Model-free deep reinforcement learning requires sustained online interaction whereas existing model based approaches use flat state representations that lose per node structure. Therefore we propose G-RSSM, a graph structured recurrent state space model that maintains per node latent states with cross node multi head attention to learn the dynamics jointly from offline trajectories. We apply the proposed method to the downstream task clustering where a cluster head selection policy trains entirely through imagined rollouts in the learned world model. Across 27 evaluation scenarios spanning MANET, VANET, FANET, WSN and tactical networks with N=30 to 1000 nodes, the learned policy maintains high connectivity with only trained for N=50. Herein, we propose the first multi physics graph structured world model applied to combinatorial per node decision making in size agnostic wireless ad hoc networks.

Impact 4.0Import 4.0Pop 4.7
#133
arXiv RL, arXiv PostTraining 2026-04-16
by Ziwu Sun, Zhen Gao, Jiyong Zhang, Jiaheng Li

We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w(r) = g'(r) * r. We prove that RGPO guarantees finite, bounded gradient variance even when importance sampling ratios are heavy-tailed (where IS variance diverges). We further show that RGPO incurs only a bounded, controllable bias and provides an approximate monotonic policy improvement guarantee analogous to TRPO. RGPO matches PP

Impact 4.0Import 4.0Pop 4.7
#134
arXiv GenMedia, arXiv PostTraining 2026-04-16
by Hassan Ali, Doreen Jirak, Luca Müller, Stefan Wermter

Gesture recognition research, unlike NLP, continues to face acute data scarcity, with progress constrained by the need for costly human recordings or image processing approaches that cannot generate authentic variability in the gestures themselves. Recent advancements in image-to-video foundation models have enabled the generation of photorealistic, semantically rich videos guided by natural language. These capabilities open up new possibilities for creating effort-free synthetic data, raising the critical question of whether video Generative AI models can augment and complement traditional human-generated gesture data. In this paper, we introduce and analyze prompt-based video generation to construct a realistic deictic gestures dataset and rigorously evaluate its effectiveness for downstream tasks. We propose a data generation pipeline that produces deictic gestures from a small number of reference samples collected from human participants, providing an accessible approach that can be leveraged both within and beyond the machine learning community. Our results demonstrate that the synthetic gestures not only align closely with real ones in terms of visual fidelity but also introd

Impact 4.0Import 4.0Pop 4.7
#135

Generative Modeling of Complex-Valued Brain MRI Data

Research ★ 4.4 multi-source (2)
arXiv GenMedia, arXiv PostTraining 2026-04-16
by Marco Schlimbach, Moritz Rempe, Jessica Mnischek, Lukas T. Rotkopf et al.

Objective. Standard Magnetic Resonance Imaging (MRI) reconstruction pipelines discard phase information captured during acquisition, despite evidence that it encodes tissue properties relevant to tumor diagnosis. Current machine learning approaches inherit this limitation by operating exclusively on reconstructed magnitude images. The aim of this study is to build a generative framework which is capable of jointly modeling magnitude and phase information of complex-valued MRI scans. Approach. The proposed generative framework combines a conditional variational autoencoder, which compresses complex-valued MRI scans into compact latent representations while preserving phase coherence, with a flow-matching-based generative model. Synthetic sample quality is assessed via a real-versus-synthetic classifier and by training downstream classifiers on synthetic data for abnormal tissue detection. Main results. The autoencoder preserves phase coherence above 0.997. Real-versus-synthetic classification yields low AUROC values between 0.50 and 0.66 across all acquisition sequences, indicating generated samples are nearly indistinguishable from real data. In downstream normal-versus-abnormal cl

Impact 4.0Import 4.0Pop 4.7
#136
Latent Space 2026-04-16

Hot on the heels of the Death of the Code Review , the Pull Request may be next. For anyone that learned to code in the last 15 years it is hard to imagine a life without Git, GitHub, and Pull Requests, but there was a time before them, and it well may come to pass that there is life after. Pull Requests were arguably invented in 2005 , successfully popularized by GitHub , and only 21 years later, GitHub is for the first time in history allowing people to disable pull requests on their open source repos (you could only disable issues before). The rise of Generative AI in code has spelled the pending death of the Pull Request for a while now — Pete Steinberger is by now well known (along with Theo ) for only wanting Prompt Requests rather than Pull Requests (for multiple reasons, eg 1) no merge conflicts, 2) it’s easier for the maintainer to fix/add to the prompt than to look at code, 3) less likely to have malicious/insecure code slipped into an innocent looking PR), and other folks like Mitchell Hashimoto and Amp Code have created “reputation”-based systems for handling untrusted code contributions. In Building for Trillions of Agents , Aaron Levie noted that “the path forward is to make software that agents want.” Humans invented git for human collaboration reasons. It’s increasingly clear that Git-based workflows may not be suitable once we remove the human bottleneck from the flow of code. And if Code Reviews are dead, and Pull Reviews are dead… how long until Git itself is dead? AI News for 4/14/2026-4/15/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space . You can opt in/out of email freque

Impact 4.0Import 4.0Pop 4.8
#137

llm-anthropic 0.25

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-16

Release: llm-anthropic 0.25 New model: claude-opus-4.7 , which supports thinking_effort : xhigh . #66 New thinking_display and thinking_adaptive boolean options. thinking_display summarized output is currently only available in JSON output or JSON logs. Increased default max_tokens to the maximum allowed for each model. No longer uses obsolete structured-outputs-2025-11-13 beta header for older models. Tags: llm , anthropic , claude

Impact 4.0Import 4.0Pop 4.8
#138

datasette.io news preview

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-16

Tool: datasette.io news preview The datasette.io website has a news section built from this news.yaml file in the underlying GitHub repository. The YAML format looks like this: - date: 2026-04-15 body: |- [Datasette 1.0a27](https://docs.datasette.io/en/latest/changelog.html#a27-2026-04-15) changes how CSRF protection works in a way that simplifies form and API integration, and introduces a new `RenameTableEvent` for when a table is renamed by a SQL query. - date: 2026-03-18 body: |- ... This format is a little hard to edit, so I finally had Claude build a custom preview UI to make checking for errors have slightly less friction. I built it using standard claude.ai and Claude Artifacts, taking advantage of Claude's ability to clone GitHub repos and look at their content as part of a regular chat: Clone https://github.com/simonw/datasette.io and look at the news.yaml file and how it is rendered on the homepage. Build an artifact I can paste that YAML into which previews what it will look like, and highlights any markdown errors or YAML errors Tags: vibe-coding , claude , tools , datasette

Impact 4.0Import 4.0Pop 4.8
#139
DefenseScoop 2026-04-16

The United States’ naval blockade of maritime traffic entering and leaving Iranian ports is disrupting Iran’s sea-based economic trade, according to senior defense officials who said Thursday that at least 13 ships have retreated and turned back since the military operation was initiated on Monday.  At a Pentagon press briefing, Defense Secretary Pete Hegseth, Chairman of the Joint Chiefs of Staff Gen. Dan Caine and commander of U.S. Central Command Adm. Brad Cooper discussed Operation Epic Fury and the associated, unfolding blockade, which they indicated is being enforced with a combination of integrated intelligence, surveillance and tactical assets. “You, Iran, can choose a prosperous future, a golden bridge, and we hope that you do for the people of Iran. In the meantime, and for as long as it takes, we will maintain this blockade, this successful blockade. But if Iran chooses poorly, then they will have a blockade and bombs dropping on infrastructure, power and energy,” Hegseth said. “And at the same time, Treasury Secretary Scott Bessent and our friends over at [the Treasury Department] are launching Operation Economic Fury as well, maximizing economic pressure across the entirety of the government to Iran. Choose wisely.” Treasury’s Office of Foreign Assets Control revealed on Wednesday that it has been intensifying pressure on Iran’s allegedly criminally-run oil transportation infrastructure.  OFAC sanctioned more than two dozen individuals, companies, and vessels operating within the network of Iranian oil shipping magnate Mohammad Hossein Shamkhani. He’s the son of senior Iranian security official Ali Shamkhani, who was killed by U.S. strikes on the first day of the war.  “Treasury is moving aggressively with Economic Fury by targeting regime el

Impact 5.2Import 4.0Pop 3.5
#140
arXiv cs.LG, arXiv Efficiency 2026-04-17
by Aswathi Mundayatt, Jaya Sreevalsan-Nair

Existing multi-hazard susceptibility mapping (MHSM) studies often rely on spatially uniform models, treat hazards independently, and provide limited representation of cross-hazard dependence and uncertainty. To address these limitations, this study proposes a deep learning (DL) workflow for joint flood-landslide multi-hazard susceptibility mapping (FL-MHSM) that combines two-level spatial partitioning, probabilistic Early Fusion (EF), a tree-based Late Fusion (LF) baseline, and a soft-gating Mixture of Experts (MoE) model, with MoE serving as final predictive model. The proposed design preserves spatial heterogeneity through zonal partitions and enables data-parallel large-area prediction using overlapping lattice grids. In Kerala, EF remained competitive with LF, improving flood recall from 0.816 to 0.840 and reducing Brier score from 0.092 to 0.086, while MoE provided strongest performance for flood susceptibility, achieving an AUC-ROC of 0.905, recall of 0.930, and F1-score of 0.722. In Nepal, EF similarly improved flood recall from 0.820 to 0.858 and reduced Brier score from 0.057 to 0.049 relative to LF, while MoE outperformed both EF and LF for landslide susceptibility, achie

Impact 4.0Import 4.0Pop 4.7
#141
arXiv cs.LG, arXiv stat.ML 2026-04-17
by Come Fiegel, Victor Gabillon, Michal Valko

In multi-fidelity optimization, biased approximations of varying costs of the target function are available. This paper studies the problem of optimizing a locally smooth function with a limited budget, where the learner has to make a tradeoff between the cost and the bias of these approximations. We first prove lower bounds for the simple regret under different assumptions on the fidelities, based on a cost-to-bias function. We then present the Kometo algorithm which achieves, with additional logarithmic factors, the same rates without any knowledge of the function smoothness and fidelity assumptions, and improves previously proven guarantees. We finally empirically show that our algorithm outperforms previous multi-fidelity optimization methods without the knowledge of problem-dependent parameters.

Impact 4.0Import 4.0Pop 4.7
#142
arXiv cs.LG, arXiv stat.ML 2026-04-17
by Hannah Guan, Soukayna Mouatadid, Paulo Orenstein, Judah Cohen et al.

Decision-makers rely on weather forecasts to plant crops, manage wildfires, allocate water and energy, and prepare for weather extremes. Today, such forecasts enjoy unprecedented accuracy out to two weeks thanks to steady advances in physics-based dynamical models and data-driven artificial intelligence (AI) models. However, model skill drops precipitously at subseasonal timescales (2 - 6 weeks ahead), due to compounding errors and persistent biases. To counter this degradation, we introduce probabilistic bias correction (PBC), a machine learning framework that substantially reduces systematic error by learning to correct historical probabilistic forecasts. When applied to the leading dynamical and AI models from the European Centre for Medium-Range Weather Forecasts (ECMWF), PBC doubles the subseasonal skill of the AI Forecasting System and improves the skill of the operationally-debiased dynamical model for 91% of pressure, 92% of temperature, and 98% of precipitation targets. We designed PBC for operational deployment, and, in ECMWF's 2025 real-time forecasting competition, its global forecasts placed first for all weather variables and lead times, outperforming the dynamical mo

Impact 4.0Import 4.0Pop 4.7
#143

Neuro-Symbolic ODE Discovery with Latent Grammar Flow

Research ★ 4.4 multi-source (2)
arXiv cs.LG, arXiv cs.AI 2026-04-17
by Karin Yu, Eleni Chatzi, Georgios Kissas

Understanding natural and engineered systems often relies on symbolic formulations, such as differential equations, which provide interpretability and transferability beyond black-box models. We introduce Latent Grammar Flow (LGF), a neuro-symbolic generative framework for discovering ordinary differential equations from data. LGF embeds equations as grammar-based representations into a discrete latent space and forces semantically similar equations to be positioned closer together with a behavioural loss. Then, a discrete flow model guides the sampling process to recursively generate candidate equations that best fit the observed data. Domain knowledge and constraints, such as stability, can be either embedded into the rules or used as conditional predictors.

Impact 4.0Import 4.0Pop 4.7
#144
arXiv cs.LG, arXiv cs.AI 2026-04-17
by Minchul Kang, Changyong Shin, Jinwoo Jeong, Hyunho Lee et al.

Accurate prediction of training time in distributed deep learning is crucial for resource allocation, cost estimation, and job scheduling. We observe that the floating-point precision setting is a key determinant of training time, leading to training time variations of ~2.4x over its minimum. However, existing studies on distributed training time prediction rely on static model computation graphs that do not capture precision variations, including mixed precision. According to our experiments, training time prediction without considering precision results in significant prediction errors - reaching up to 147.85% in mean absolute percentage error (MAPE). To address this issue, we propose a precision-aware distributed training time predictor that achieves robust accuracy across diverse precision settings, including mixed precision, with 9.8% MAPE.

Impact 4.0Import 4.0Pop 4.7
#145

Sentiment Analysis of German Sign Language Fairy Tales

Research ★ 4.4 multi-source (2)
arXiv cs.LG, arXiv cs.CL 2026-04-17
by Fabrizio Nunnari, Siddhant Jain, Patrick Gebhard

We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels of valence (negative, neutral, positive) on German fairy tales text segments using four large language models (LLMs) and majority voting, reaching an inter-annotator agreement of 0.781 Krippendorff's alpha. Second, we extract face and body motion features from each corresponding DGS video segment using MediaPipe. Finally, we train an explainable model (based on XGBoost) to predict negative, neutral or positive sentiment from video features. Results show an average balanced accuracy of 0.631. A thorough analysis of the most important features reveal that, in addition to eyebrows and mouth motion on the face, also the motion of hips, elbows, and shoulders considerably contribute in the discrimination of the conveyed sentiment, indicating an equal importance of face and body for sentiment communication in sign language.

Impact 4.0Import 4.0Pop 4.7
#146
arXiv cs.LG, arXiv cs.AI 2026-04-17
by Alina Deriyeva, Jesper Dannath, Benjamin Paassen

Practice and extensive exercises are essential in programming education. Intelligent tutoring systems (ITSs) are a viable option to provide individualized hints and advice to programming students even when human tutors are not available. However, prior ITS for programming rarely support the Python programming language, mostly focus on introductory programming, and rarely take recent developments in generative models into account. We aim to establish a novel ITS for Python programming that is highly adaptable, serves both as a teaching and research platform, provides interfaces to plug in hint mechanisms (e.g.\ via large language models), and works inside the particularly challenging regulatory environment of Germany, that is, conforming to the European data protection regulation, the European AI act, and ethical framework of the German Research Foundation. In this paper, we present the description of the current state of the ITS along with future development directions, as well as discuss the challenges and opportunities for improving the system.

Impact 4.0Import 4.0Pop 4.7
#147
arXiv cs.LG, arXiv stat.ML 2026-04-17
by Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

We study the sample complexity of learning an $ε$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $Ω(SAB_{\star}^3/(c_{\min}ε^2))$ samples to return an $ε$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min} = 0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this lower bound with an algorithm that matches it, up to logarithmic factors, in the general case, and an algorithm that matches it up to logarithmic factors even when $c_{\min} = 0$, but only under the condition that the optimal policy has a bounded hitting time to the goal state.

Impact 4.0Import 4.0Pop 4.7
#148
arXiv cs.LG, arXiv stat.ML 2026-04-17
by Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Michal Valko et al.

We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being $Ω(T^{-1/4})$ in contrast to the usual $Ω(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.

Impact 4.0Import 4.0Pop 4.7
#149
arXiv cs.LG, arXiv cs.AI 2026-04-17
by Weijiang Xiong, Robert Fonod, Nikolas Geroliminis

Traffic forecasting is a challenging spatio-temporal modeling task and a critical component of urban transportation management. Current studies mainly focus on deterministic predictions, with limited considerations on the uncertainty and stochasticity in traffic dynamics. Therefore, this paper proposes an elegant yet universal approach that transforms existing models into probabilistic predictors by replacing only the final output layer with a novel Gaussian Mixture Model (GMM) layer. The modified model requires no changes to the training pipeline and can be trained using only the Negative Log-Likelihood (NLL) loss, without any auxiliary or regularization terms. Experiments on multiple traffic datasets show that our approach generalizes from classic to modern model architectures while preserving deterministic performance. Furthermore, we propose a systematic evaluation procedure based on cumulative distributions and confidence intervals, and demonstrate that our approach is considerably more accurate and informative than unimodal or deterministic baselines. Finally, a more detailed study on a real-world dense urban traffic network is presented to examine the impact of data quality

Impact 4.0Import 4.0Pop 4.7
#150
arXiv cs.LG, arXiv PostTraining 2026-04-17
by Jean-Baptiste Escudié, Benjamin Barnes, Stefan Meisegeier, Klaus Kraywinkel et al.

There is no consensus in the field of synthetic data on concise metrics for quality evaluations or benchmarks on large health datasets, such as historical epidemiological data. This study presents an evaluation of seven recent models from major machine learning families. The models were evaluated using four different datasets, each with a distinct scale. To ensure a fair comparison, we systematically tuned the hyperparameters of each model for each dataset. We propose a methodology for evaluating the fidelity of synthesized joint distributions, aligning metrics with visualization on a single plot. This method is applicable to any dataset and is complemented by a domain-specific analysis of the German Cancer Registries' epidemiological dataset. The analysis reveals the challenges models face in strictly adhering to the medical domain. We hope this approach will serve as a foundational framework for guiding the selection of synthesizers and remain accessible to all stakeholders involved in releasing synthetic datasets.

Impact 4.0Import 4.0Pop 4.7
#151
arXiv cs.CL, arXiv Evals 2026-04-17
by Deshan Sumanathilaka, Nicholas Micallef, Julian Hough, Saman Jayasinghe

Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions

Impact 4.0Import 4.0Pop 4.7
#152

Optimizing Korean-Centric LLMs via Token Pruning

Research ★ 4.4 multi-source (2)
arXiv cs.CL, arXiv Evals 2026-04-17
by Hoyeol Kim, Hyeonwoo Kim

This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in infe

Impact 4.0Import 4.0Pop 4.7
#153
arXiv cs.CL, arXiv cs.AI 2026-04-17
by Yanli Wang, Peng Kuang, Xiaoyu Han, Kaidi Xu et al.

Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities, entropy, and self-consistency can become brittle under calibration--deployment mismatch. Conformal prediction provides finite-sample validity under exchangeability, but its practical usefulness depends on the quality of the nonconformity score. We propose a conformal framework for LLM question answering that uses internal representations rather than output-facing statistics: specifically, we introduce Layer-Wise Information (LI) scores, which measure how conditioning on the input reshapes predictive entropy across model depth, and use them as nonconformity scores within a standard split conformal pipeline. Across closed-ended and open-domain QA benchmarks, with the clearest gains under cross-domain shift, our method achieves a better validity--efficiency trade-off than strong text-level baselines while maintaining competitive in-domain reliability at the same nominal risk level. These results suggest that internal representations can provide more informative conformal scores when surface-level uncertainty is unstable under distribu

Impact 4.0Import 4.0Pop 4.7
#154
arXiv cs.CL, arXiv cs.AI 2026-04-17
by Jessica H. Zhu, Shayla Stringfield, Vahe Zaprosyan, Michael Wagner et al.

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the e

Impact 4.0Import 4.0Pop 4.7
#155
arXiv cs.CL, arXiv cs.AI 2026-04-17
by Nicolò Pagan, Christopher Barrie, Chris Andrew Bail, Petter Törnberg

Large Language Models (LLMs) are increasingly deployed to curate and rank human-created content, yet the nature and structure of their biases in these tasks remains poorly understood: which biases are robust across providers and platforms, and which can be mitigated through prompt design. We present a controlled simulation study mapping content selection biases across three major LLM providers (OpenAI, Anthropic, Google) on real social media datasets from Twitter/X, Bluesky, and Reddit, using six prompting strategies (\textit{general}, \textit{popular}, \textit{engaging}, \textit{informative}, \textit{controversial}, \textit{neutral}). Through 540,000 simulated top-10 selections from pools of 100 posts across 54 experimental conditions, we find that biases differ substantially in how structural and how prompt-sensitive they are. Polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. Provider comparisons reveal distinct trade-offs: GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxi

Impact 4.0Import 4.0Pop 4.7
#156
arXiv cs.CL, arXiv PostTraining 2026-04-17
by Junyi Li, Yongqiang Chen, Ningning Ding

Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimizatio

Impact 4.0Import 4.0Pop 4.7
#157

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Agents & Tools ★ 4.4 multi-source (2)
arXiv cs.CL, arXiv Agents 2026-04-17
by Shidong Yang, Ziyu Ma, Tongwen Huang, Yiming Hu et al.

Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.

Impact 4.0Import 4.0Pop 4.7
#158
arXiv cs.CL, arXiv Agents 2026-04-17
by Chengwu Liu, Yichun Yin, Ye Yuan, Jiaxuan Xie et al.

Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are u

Impact 4.0Import 4.0Pop 4.7
#159
arXiv cs.CL, arXiv Agents 2026-04-17
by Yueling Fan, Richard Lee Davis, Olga Viberg

This study presents WriteFlow, an AI voice-based writing assistant designed to support reflective academic writing through goal-oriented interaction. Academic writing involves iterative reflection and evolving goal regulation, yet prior research and a formative study with 17 participants show that writers often struggle to articulate and manage changing goals. While commonly used AI writing tools emphasize efficiency, they offer limited support for metacognition and writer agency. WriteFlow frames AI interaction as a dialogic space for ongoing goal articulation, monitoring, and negotiation grounded in writers' intentions. Findings from a Wizard-of-Oz study with 12 expert users show that WriteFlow scaffolds metacognitive regulation and reflection-in-action by supporting iterative goal refinement, maintaining goal-text alignment during drafting, and prompting evaluation of goal fulfillment. We discuss design implications for AI writing systems that prioritize reflective dialogue, flexible goal structures, and multi-perspective feedback to support intentional and agentic writing.

Impact 4.0Import 4.0Pop 4.7
#160
arXiv cs.CL, arXiv Efficiency 2026-04-17
by Chi Liu, Xin Chen, Xu Zhou, Fangbo Tu et al.

Large Language Models (LLMs) have achieved remarkable success, underpinning diverse AI applications. However, they often suffer from performance degradation due to factors such as catastrophic forgetting during Supervised Fine-Tuning (SFT), quantization, and pruning. In this work, we introduce a performance recovery framework based on Self-Distillation Fine-Tuning (SDFT) that effectively restores model capabilities. Complementing this practical contribution, we provide a rigorous theoretical explanation for the underlying recovery mechanism. We posit that an LLM's generative capability fundamentally relies on the high-dimensional manifold constructed by its hidden layers. To investigate this, we employ Centered Kernel Alignment (CKA) to quantify the alignment between student and teacher activation trajectories, leveraging its invariance to orthogonal transformations and scaling. Our experiments demonstrate a strong correlation between performance recovery and manifold alignment, substantiating the claim that self-distillation effectively aligns the student's high-dimensional manifold with the optimal structure represented by the teacher. This study bridges the gap between practical

Impact 4.0Import 4.0Pop 4.7
#161

ASMR-Bench: Auditing for Sabotage in ML Research

Research ★ 4.4 multi-source (2)
arXiv cs.AI, arXiv Evals 2026-04-17
by Eric Gan, Aryan Bhatt, Buck Shlegeris, Julian Stastny et al.

As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red teamers and found that LLM-generated sabotages were weaker than human-generated ones but still sometimes evaded same-capability LLM auditors. We release ASMR-Bench to support research on monitoring and auditing techniques for AI-conducted research.

Impact 4.0Import 4.0Pop 4.7
#162
arXiv cs.AI, arXiv cs.CV 2026-04-17
by Van-Truong Le, Le-Khanh Nguyen, Trong-Doanh Nguyen

Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13\% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and

Impact 4.0Import 4.0Pop 4.7
#163
arXiv cs.AI, arXiv cs.CV 2026-04-17
by Henry O. Velesaca, Luigi Miranda, Angel D. Sappa

This paper presents SWNet, a bimodal end-to-end cross-spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long-range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near-Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge-Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds-Banana dataset indicate that SWNet outperforms ten state-of-the-art methods. The study demonstrates that the integration of cross-spectral data and boundary-guided refinement is essential for high segmentation accuracy in complex c

Impact 4.0Import 4.0Pop 4.7
#164
arXiv cs.AI, arXiv Efficiency 2026-04-17
by Lin Deng, Chang-bo Liu

We extracted the scholarly reasoning systems of two internationally prominent humanities and social science scholars from their published corpora alone, converted those systems into structured inference-time constraints for a large language model, and tested whether the resulting scholar-bots could perform core academic functions at expert-assessed quality. The distillation pipeline used an eight-layer extraction method and a nine-module skill architecture grounded in local, closed-corpus analysis. The scholar-bots were then deployed across doctoral supervision, peer review, lecturing and panel-style academic exchange. Expert assessment involved three senior academics producing reports and appointment-level syntheses. Across the preserved expert record, all review and supervision reports judged the outputs benchmark-attaining, appointment-level recommendations placed both bots at or above Senior Lecturer level in the Australian university system, and recovered panel scores placed Scholar A between 7.9 and 8.9/10 and Scholar B between 8.5 and 8.9/10 under multi-turn debate conditions. A research-degree-student survey showed high performance ratings across information reliability, th

Impact 4.0Import 4.0Pop 4.7
#165
arXiv cs.AI, arXiv PostTraining 2026-04-17
by Janet Vertesi, danah boyd, Alex Taylor, Benjamin Shestakofsky

The Project of AI is a world-building endeavor, wherein those who fund and develop AI systems both operate through and seek to sustain networks of power and wealth. As they expand their access to resources and configure our sociotechnical conditions, they benefit from the ways in which a suite of decoys animate scholars, critics, policymakers, journalists, and the public into co-constructing industry-empowering AI futures. Regardless of who constructs or nurtures them, these decoys often create the illusion of accountability while both masking the emerging political economies that the Project of AI has set into motion, and also contributing to the network-making power that is at the heart of the Project's extraction and exploitation. Drawing on literature at the intersection of communication, science and technology studies, and economic sociology, we examine how the Project of AI is constructed. We then explore five decoys that seemingly critique - but in actuality co-constitute - AI's emergent power relations and material political economy. We argue that advancing meaningful fairness or accountability in AI requires: 1) recognizing when and how decoys serve as a distraction, and 2

Impact 4.0Import 4.0Pop 4.7
#166

Neurosymbolic Repo-level Code Localization

Research ★ 4.4 multi-source (2)
arXiv cs.AI, arXiv Agents 2026-04-17
by Xiufeng Xu, Xiufeng Wu, Zejun Zhang, Yi Li

Code localization is a cornerstone of autonomous software engineering. Recent advancements have achieved impressive performance on real-world issue benchmarks. However, we identify a critical yet overlooked bias: these benchmarks are saturated with keyword references (e.g. file paths, function names), encouraging models to rely on superficial lexical matching rather than genuine structural reasoning. We term this phenomenon the Keyword Shortcut. To address this, we formalize the challenge of Keyword-Agnostic Logical Code Localization (KA-LCL) and introduce KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without any naming hints. Our evaluation reveals a catastrophic performance drop of state-of-the-art approaches on KA-LogicQuery, exposing their lack of deterministic reasoning capabilities. We propose LogicLoc, a novel agentic framework that combines large language models with the rigorous logical reasoning of Datalog for precise localization. LogicLoc extracts program facts from the codebase and leverages an LLM to synthesize Datalog programs, with parser-gated validation and mutation-based intermediate-rule diagnostic feedback to ensure correctness and effici

Impact 4.0Import 4.0Pop 4.7
#167
arXiv cs.AI, arXiv Agents 2026-04-17
by Hamed Jelodar, Samita Bai, Mohammad Meymani, Parisa Hamedi et al.

Generative AI, particularly Large Language Models, increasingly integrates graph-based representations to enhance reasoning, retrieval, and structured decision-making. Despite rapid advances, there remains limited clarity regarding when, why, where, and what types of graph-LLM integrations are most appropriate across applications. This survey provides a concise, structured overview of the design choices underlying the integration of graphs with LLMs. We categorize existing methods based on their purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, interaction graphs, causal graphs, dependency graphs), and integration strategies (prompting, augmentation, training, or agent-based use). By mapping representative works across domains such as cybersecurity, healthcare, materials science, finance, robotics, and multimodal environments, we highlight the strengths, limitations, and best-fit scenarios for each technique. This survey aims to offer researchers a practical guide for selecting the most suitable graph-LLM approach depending on task requirements, data characteristics, and reasoning complexity.

Impact 4.0Import 4.0Pop 4.7
#168
arXiv cs.RO, arXiv cs.CV 2026-04-17
by Nikhil Behari, Diego Rivero, Luke Apostolides, Suman Ghosh et al.

Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space-time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.

Impact 4.0Import 4.0Pop 4.7
#169
arXiv cs.RO, arXiv Efficiency 2026-04-17
by Fazeng Li, Gan Sun, Chenxi Liu, Yao He et al.

Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples fr

Impact 4.0Import 4.0Pop 4.7
#170
arXiv cs.RO, arXiv Agents 2026-04-17
by Hürkan Şahin, Van Huyen Dang, Erdi Sayar, Alper Yegenoglu et al.

Reinforcement learning (RL) often struggles in real-world tasks with high-dimensional state spaces and long horizons, where sparse or fixed rewards severely slow down exploration and cause agents to get trapped in local optima. This paper presents a fuzzy logic based reward shaping method that integrates human intuition into RL reward design. By encoding expert knowledge into adaptive and interpreable terms, fuzzy rules promote stable learning and reduce sensitivity to hyperparameters. The proposed method leverages these properties to adapt reward contributions based on the agent state, enabling smoother transitions between fast motion and precise control in challenging navigation tasks. Extensive simulation results on autonomous drone racing benchmarks show stable learning behavior and consistent task performance across scenarios of increasing difficulty. The proposed method achieves faster convergence and reduced performance variability across training seeds in more challenging environments, with success rates improving by up to approximately 5 percent compared to non fuzzy reward formulations.

Impact 4.0Import 4.0Pop 4.7
#171
arXiv cs.RO, arXiv Robotics-Embodied 2026-04-17
by Xu Huang, Weixin Mao, Yinhao Li, Hua Chen et al.

Vision-Language-Action (VLA) models have demonstrated significant potential for embodied decision-making; however, their application in complex chemical laboratory automation remains restricted by limited long-horizon reasoning and the absence of persistent experience accumulation. Existing frameworks typically treat planning and execution as decoupled processes, often failing to consolidate successful strategies, which results in inefficient trial-and-error in multi-stage protocols. In this paper, we propose ChemBot, a dual-layer, closed-loop framework that integrates an autonomous AI agent with a progress-aware VLA model (Skill-VLA) for hierarchical task decomposition and execution. ChemBot utilizes a dual-layer memory architecture to consolidate successful trajectories into retrievable assets, while a Model Context Protocol (MCP) server facilitates efficient sub-agent and tool orchestration. To address the inherent limitations of VLA models, we further implement a future-state-based asynchronous inference mechanism to mitigate trajectory discontinuities. Extensive experiments on collaborative robots demonstrate that ChemBot achieves superior operational safety, precision, and ta

Impact 4.0Import 4.0Pop 4.7
#172

Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan

Generative Media ★ 4.4 multi-source (2)
arXiv cs.CV, arXiv GenMedia 2026-04-17
by Shivarth Rai, Tejeswar Pokuri

Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tracking, and behavior analysis. To address this challenge, we introduce AnimalHaze3k a synthetic dataset comprising of 3,477 hazy images generated from 1,159 clear wildlife photographs through a physics-based pipeline. Our novel IncepDehazeGan architecture combines inception blocks with residual skip connections in a GAN framework, achieving state-of-the-art performance (SSIM: 0.8914, PSNR: 20.54, and LPIPS: 0.1104), delivering 6.27% higher SSIM and 10.2% better PSNR than competing approaches. When applied to downstream detection tasks, dehazed images improved YOLOv11 detection mAP by 112% and IoU by 67%. These advances can provide ecologists with reliable tools for population monitoring and surveillance in challenging environmental conditions, demonstrating significant potential for enhancing wildlife conservation efforts through robust visual analytics.

Impact 4.0Import 4.0Pop 4.7
#173
arXiv cs.CV, arXiv GenMedia 2026-04-17
by Yue Jiang, Mingyu Yang, Liuyuxin Yang, Yang Xu et al.

Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions b

Impact 4.0Import 4.0Pop 4.7
#174

The Amazing Stability of Flow Matching

Generative Media ★ 4.4 multi-source (2)
arXiv cs.CV, arXiv GenMedia 2026-04-17
by Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen et al.

The success of deep generative models in generating high-quality and diverse samples is often attributed to particular architectures and large training datasets. In this paper, we investigate the impact of these factors on the quality and diversity of samples generated by \emph{flow-matching} models. Surprisingly, in our experiments on CelebA-HQ dataset, flow matching remains stable even when pruning 50\% of the dataset. That is, the quality and diversity of generated samples are preserved. Moreover, pruning impacts the latent representation only slightly, that is, samples generated by models trained on the full and pruned dataset map to visually similar outputs for a given seed. We observe similar stability when changing the architecture or training configuration, such that the latent representation is maintained under these changes as well. Our results quantify just how strong this stability can be in practice, and help explain the reliability of flow-matching models under various perturbations.

Impact 4.0Import 4.0Pop 4.7
#175

Efficient Video Diffusion Models: Advancements and Challenges

Generative Media ★ 4.4 multi-source (2)
arXiv GenMedia, arXiv Efficiency 2026-04-17
by Shitong Shao, Lichen Bai, Pengfei Wan, James Kwok et al.

Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation.

Impact 4.0Import 4.0Pop 4.7
#176
arXiv Efficiency, arXiv AIScience 2026-04-17
by Yuanchang Zhou, Hongyu Wang, Yiming Du, Yan Wang et al.

Universal Machine Learning Interatomic Potentials (uMLIPs), pre-trained on massively diverse datasets encompassing inorganic materials and organic molecules across the entire periodic table, serve as foundational models for quantum-accurate physical simulations. However, uMLIP training requires second-order derivatives, which lack corresponding parallel training frameworks; moreover, scaling to the billion-parameter regime causes explosive growth in computation and communication overhead, making its training a tremendous challenge. We introduce MatRIS-MoE, a billion-parameter Mixture-of-Experts model built upon invariant architecture, and {Janus}, a pioneering high-dimensional distributed training framework for uMLIPs with hardware-aware optimizations. Deployed across two Exascale supercomputers, our code attains a peak performance of 1.2/1.0 EFLOPS (24\%/{35.5\%} of theoretical peak) in single precision at over 90\% parallel efficiency, compressing the training of billion-parameter uMLIPs from weeks to hours. This work establishes a new high-water mark for AI-for-Science (AI4S) foundation models at Exascale and provides essential infrastructure for rapid scientific discovery.

Impact 4.0Import 4.0Pop 4.7
#177
Simon Willison's Weblog 2026-04-17

This year's PyCon US is coming up next month from May 13th to May 19th, with the core conference talks from Friday 15th to Sunday 17th and tutorial and sprint days either side. It's in Long Beach, California this year, the first time PyCon US has come to the West Coast since Portland, Oregon in 2017 and the first time in California since Santa Clara in 2013. If you're based in California this is a great opportunity to catch up with the Python community, meet a whole lot of interesting people and learn a ton of interesting things. In addition to regular PyCon programming we have two new dedicated tracks at the conference this year: an AI track on Friday and a Security track on Saturday. The AI program was put together by track chairs Silona Bonewald (CitableAI) and Zac Hatfield-Dodds (Anthropic). I'll be an in-the-room chair this year, introducing speakers and helping everything run as smoothly as possible. Here's the AI track schedule in full: 11:00: AI-Assisted Contributions and Maintainer Load - Paolo Melchiorre 11:45: AI-Powered Python Education : Towards Adaptive and Inclusive Learning - Sonny Mupfuni 12:30: Making African Languages Visible: A Python-Based Guide to Low-Resource Language ID - Gift Ojeabulu 2:00: Running Large Language Models on Laptops: Practical Quantization Techniques in Python - Aayush Kumar JVS 2:45: Distributing AI with Python in the Browser: Edge Inference and Flexibility Without Infrastructure - Fabio Pliger 3:30: Don't Block the Loop: Python Async Patterns for AI Agents - Aditya Mehra 4:30: What Python Developers Need to Know About Hardware: A Practical Guide to GPU Memory, Kernel Scheduling, and Execution Models - Santosh Appachu Devanira Poovaiah 5:15: How to Build Your First Real-Time Voice Agent in Python (Without Losing Your Mind) - Cami

Impact 4.0Import 4.0Pop 4.8
#178

datasette 1.0a28

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-17

Release: datasette 1.0a28 I was upgrading Datasette Cloud to 1.0a27 and discovered a nasty collection of accidental breakages caused by changes in that alpha. This new alpha addresses those directly: Fixed a compatibility bug introduced in 1.0a27 where execute_write_fn() callbacks with a parameter name other than conn were seeing errors. ( #2691 ) The database.close() method now also shuts down the write connection for that database. New datasette.close() method for closing down all databases and resources associated with a Datasette instance. This is called automatically when the server shuts down. ( #2693 ) Datasette now includes a pytest plugin which automatically calls datasette.close() on temporary instances created in function-scoped fixtures and during tests. See Automatic cleanup of Datasette instances for details. This helps avoid running out of file descriptors in plugin test suites that were written before the Database(is_temp_disk=True) feature introduced in Datasette 1.0a27. ( #2692 ) Most of the changes in this release were implemented using Claude Code and the newly released Claude Opus 4.7. Tags: datasette

Impact 4.0Import 4.0Pop 4.8
#179
YT: Computerphile 2026-04-17

B is the forerunner to C - but seemed lost - Angelo Papenhoff decided to change it and brought it back from the brink! Here he tries to recreate Brian Kernighan's original 'Hello World' with a few wrinkles! Angelo's B compiler is here: https://github.com/aap/b Angelo's emulators: https://github.com/aap/blincolnlights There's a type in the end of the video, it should of course read "Obsolescence" Computerphile is supported by Jane Street. Learn more about them (and exciting career opportunities) at: https://jane-st.co/computerphile This video was filmed and edited by Sean Riley. Computerphile is a sister project to Brady Haran's Numberphile. More at https://www.bradyharanblog.com

Impact 4.0Import 5.5Pop 3.5
#180

[AINews] The Two Sides of OpenClaw

Agents & Tools ★ 4.4
Latent Space 2026-04-18

In an opportune coinciding of big three letter conferences, the TED talk and the AIE talks of Peter Steinberger dropped today. To the general public, the inspiring story of OpenClaw was delightfully told onstage , which recaps all the highs: To the engineering audience, it was more sober, talking about the unprecedented levels of security incidents (60x more reports than curl, at least 20% of skill contributions malicious) and scaling issues involved in maintaining the fastest growing open source project in history: An AMA moderated by me is included at the end. Contrast them, thoughts welcome. AI News for 4/16/2026-4/17/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space . You can opt in/out of email frequencies! AI Twitter Recap Anthropic’s Claude Opus 4.7 and Claude Design rollout Claude Design launched as Anthropic’s first design/prototyping surface : @claudeai announced Claude Design , a research-preview tool for generating prototypes, slides, and one-pagers from natural-language instructions, powered by Claude Opus 4.7 . The launch immediately framed Anthropic as moving beyond chat/coding into design tooling; multiple observers called it a direct shot at Figma/Lovable/Bolt/v0 , including @Yuchenj_UW , @kimmonismus , and @skirano . The market reaction itself became part of the story, with @Yuchenj_UW and others noting Figma’s sharp drawdown after the announcement. Product details surfaced via @TheRundownAI : inline refinement, sliders, exports to Canva/PPTX/PDF/HTML , and handoff to Claude Code for implementation. Opus 4.7 looks stronger overall, but the rollout was noisy : third-party benchmark posts were broadly favorab

Impact 4.0Import 4.0Pop 4.8
#181
Simon Willison's Weblog 2026-04-18

Anthropic are the only major AI lab to publish the system prompts for their user-facing chat systems. Their system prompt archive now dates all the way back to Claude 3 in July 2024 and it's always interesting to see how the system prompt evolves as they publish new models. Opus 4.7 shipped the other day (April 16, 2026) with a Claude.ai system prompt update since Opus 4.6 (February 5, 2026). I had Claude Code take the Markdown version of their system prompts , break that up into separate documents for each of the models and then construct a Git history of those files over time with fake commit dates representing the publication dates of each updated prompt - here's the prompt I used with Claude Code for the web. Here is the git diff between Opus 4.6 and 4.7 . These are my own highlights extracted from that diff - in all cases text in bold is my emphasis: The "developer platform" is now called the "Claude Platform". The list of Claude tools mentioned in the system prompt now includes "Claude in Chrome - a browsing agent that can interact with websites autonomously, Claude in Excel - a spreadsheet agent, and Claude in Powerpoint - a slides agent. Claude Cowork can use all of these as tools." - Claude in Powerpoint was not mentioned in the 4.6 prompt. The child safety section has been greatly expanded, and is now wrapped in a new <critical_child_safety_instructions> tag. Of particular note: "Once Claude refuses a request for reasons of child safety, all subsequent requests in the same conversation must be approached with extreme caution." It looks like they're trying to make Claude less pushy: "If a user indicates they are ready to end the conversation, Claude does not request that the user stay in the interaction or try to elicit another turn and instead respects t

Impact 4.0Import 4.0Pop 4.8
#182
Simon Willison's Weblog 2026-04-18

Research: Claude system prompts as a git timeline Anthropic publish the system prompts for Claude chat and make that page available as Markdown . I had Claude Code turn that page into separate files for each model and model family with fake git commit dates to enable browsing the changes via the GitHub commit view. I used this to write my own detailed notes on the changes between Opus 4.6 and 4.7 . Tags: system-prompts , anthropic , claude , generative-ai , ai , llms

Impact 4.0Import 4.0Pop 4.8
#183
Simon Willison's Weblog 2026-04-18

Agentic Engineering Patterns > Here's an example of a deceptively short prompt that got a quite a lot of work done in a single shot. First, some background. I send out a free Substack newsletter around once a week containing content copied-and-pasted from my blog. I'm effectively using Substack as a lightweight way to allow people to subscribe to my blog via email. I generate the newsletter with my blog-to-newsletter tool - an HTML and JavaScript app that fetches my latest content from this Datasette instance and formats it as rich text HTML, which I can then copy to my clipboard and paste into the Substack editor. Here's a detailed explanation of how that works . I recently added a new type of content to my blog to capture content that I post elsewhere, which I called "beats". These include things like releases of my open source projects, new tools that I've built, museums that I've visited (from niche-museums.com ) and other external content. I wanted to include these in the generated newsletter. Here's the prompt I ran against the simonw/tools repository that hosts my blog-to-newsletter tool, using Claude Code on the web . Clone simonw/simonwillisonblog from github to /tmp for reference Update blog-to-newsletter.html to include beats that have descriptions - similar to how the Atom everything feed on the blog works Run it with python -m http.server and use `uvx rodney --help` to test it - compare what shows up in the newsletter with what's on the homepage of https://simonwillison.net This got me the exact solution I needed. Let's break down the prompt. Clone simonw/simonwillisonblog from github to /tmp for reference I use this pattern a lot. Coding agents can clone code from GitHub, and the best way to explain a problem is often to have them look at relevant

Impact 4.0Import 4.0Pop 4.8
#185

Headless everything for personal AI

Agents & Tools ★ 4.4
Simon Willison's Weblog 2026-04-19

Headless everything for personal AI Matt Webb thinks headless services are about to become much more common: Why? Because using personal AIs is a better experience for users than using services directly (honestly); and headless services are quicker and more dependable for the personal AIs than having them click round a GUI with a bot-controlled mouse. Evidently Marc Benioff thinks so too : Welcome Salesforce Headless 360: No Browser Required! Our API is the UI. Entire Salesforce & Agentforce & Slack platforms are now exposed as APIs, MCP, & CLI. All AI agents can access data, workflows, and tasks directly in Slack, Voice, or anywhere else with Salesforce Headless. If this model does take off it's going to play havoc with existing per-head SaaS pricing schemes. I'm reminded of the early 2010s era when every online service was launching APIs. Brandur Leach reminisces about that time in The Second Wave of the API-first Economy , and predicts that APIs are ready to make a comeback: Suddenly, an API is no longer liability, but a major saleable vector to give users what they want: a way into the services they use and pay for so that an agent can carry out work on their behalf. Especially given a field of relatively undifferentiated products, in the near future the availability of an API might just be the crucial deciding factor that leads to one choice winning the field. Tags: apis , definitions , matt-webb , salesforce , saas , ai , brandur-leach

Impact 4.0Import 4.0Pop 4.8
#186
NVIDIA AI Blog 2026-04-20

Manufacturing is at an inflection point. Across every major industrial economy, the pressure to do more with less — due to faster design cycles, leaner operations and strain on skilled labor pools — is accelerating the shift to AI-driven production. The question is no longer whether to adopt AI, but how fast and at what scale. At Hannover Messe 2026 , running April 20-24 in Hannover, Germany, NVIDIA and its partners are demonstrating AI-driven manufacturing in action. Attendees will experience how advancements in accelerated computing, AI physics, agents and robotics are powering industrial innovation — from agentic design and engineering to real-time simulation, vision AI agents and humanoid robots operating in factories. The factory of the future isn’t just a concept. It’s being built now. AI Infrastructure: Powering Europe’s Next Industrial Era Running AI at scale across the factories and supply chains that manufacturing output relies on requires the right underlying infrastructure. As AI becomes foundational to how products, processes and facilities are designed, built and optimized, manufacturers need a unified, sovereign foundation that’s secure, scalable and built for industrial scale. The Industrial AI Cloud , one of Europe’s largest AI factories built in Germany by Deutsche Telekom on NVIDIA AI infrastructure, is a blueprint for the future. It provides a secure, sovereign foundation for accelerating AI and robotics across Europe’s industries. At the show, industry leaders, including Agile Robots , SAP , Siemens , PhysicsX and Wandelbots , will share how they are using this sovereign AI platform to run AI-accelerated workloads ranging from AI physics-driven, real-time simulation to factory-scale digital twins and software-defined robotics. EDAG , a leading indep

Impact 4.0Import 4.0Pop 4.8
#187
Simon Willison's Weblog 2026-04-20

TIL: SQL functions in Google Sheets to fetch data from Datasette I put together some notes on patterns for fetching data from a Datasette instance directly into Google Sheets - using the importdata() function, a "named function" that wraps it or a Google Apps Script if you need to send an API token in an HTTP header (not supported by importdata() .) Here's an example sheet demonstrating all three methods. Tags: spreadsheets , datasette , google

Impact 4.0Import 4.0Pop 4.8
#188
Simon Willison's Weblog 2026-04-20

Claude Token Counter, now with model comparisons I upgraded my Claude Token Counter tool to add the ability to run the same count against different models in order to compare them. As far as I can tell Claude Opus 4.7 is the first model to change the tokenizer, so it's only worth running comparisons between 4.7 and 4.6. The Claude token counting API accepts any Claude model ID though so I've included options for all four of the notable current models (Opus 4.7 and 4.6, Sonnet 4.6, and Haiku 4.5). In the Opus 4.7 announcement Anthropic said : Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. I pasted the Opus 4.7 system prompt into the token counting tool and found that the Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6. Opus 4.7 uses the same pricing is Opus 4.6 - $5 per million input tokens and $25 per million output tokens - but this token inflation means we can expect it to be around 40% more expensive. The token counter tool also accepts images. Opus 4.7 has improved image support, described like this: Opus 4.7 has better vision for high-resolution images: it can accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times as many as prior Claude models. I tried counting tokens for a 3456x2234 pixel 3.7MB PNG and got an even bigger increase in token counts - 3.01x times the number of tokens for 4.7 compared to 4.6: Update : That 3x increase for images is entirely due to Opus 4.7 being able to handle higher resolutions. I tried that again with a 682x318 pixel image and it took 314 tokens with Opus 4.7 and 310 with Opus 4.6, so effectively the same cost. Update 2 : I tried a 15MB, 30 pag

Impact 4.0Import 4.0Pop 4.8
#189
arXiv SSM 2026-04-13
by Wanli Ma, Sivasakthy Selvakumaran, Dain G. Farrimond, Adam A. Dennis et al.

Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapid SDA, with Mamba-based networks achieving state-of-the-art performance. However, these methods often require extensive training and large datasets, limiting real-world applicability. Moreover, they fail to incorporate key physical characteristics of blast loading for SDA. To overcome these challenges, we propose a Mamba-based multimodal network for rapid SDA that integrates multi-scale blast-loading information with optical remote sensing images. Evaluated on the 2020 Beirut explosion, our method significantly improves performance over state-of-the-art approaches. Code is available at: https://github.com/IMPACTSquad/Blast-Mamba

Impact 4.7Import 4.0Pop 3.5
#190
arXiv SSM 2026-04-14
by NVIDIA, :, Aakshita Chandiramani, Aaron Blakeman et al.

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.

Impact 4.7Import 4.0Pop 3.5
#191
arXiv SSM 2026-04-14
by Hsin-Tien Chiang, John H. L. Hansen

Speech enhancement (SE) is critical for improving speech intelligibility and quality in real-world environments, particularly for cochlear implant (CI) users who experience severe degradations in speech understanding under noisy and reverberant conditions. In this study, we propose TokenSE, a discrete token-based SE framework operating in the neural audio codec space, which predicts clean codec token indices from degraded speech using a Mamba-based model. Unlike the earlier Transformer architecture, whose self-attention mechanism has a computational complexity that grows quadratically with sequence length, the input-dependent selection mechanism of Mamba achieves linear complexity, making it a compelling alternative to Transformers, especially for CI and hearing-aid (HA) applications. Objective evaluations show that TokenSE consistently outperforms baseline methods on both in-domain and out-of-domain datasets. Moreover, subjective listening experiments with CI users indicate clear benefit in speech intelligibility under adverse noisy and reverberant environments.

Impact 4.7Import 4.0Pop 3.5
#192
arXiv SSM 2026-04-14
by Mohammed Ezzaldin Babiker Abdullah

The stable operation of off-grid photovoltaic systems requires accurate, computationally efficient solar forecasting. Contemporary deep learning models often suffer from massive computational overhead and physical blindness, generating impossible predictions. This paper introduces the Physics-Informed State Space Model (PISSM) to bridge the gap between efficiency and physical accuracy for edge-deployed microcontrollers. PISSM utilizes a dynamic Hankel matrix embedding to filter stochastic sensor noise by transforming raw meteorological sequences into a robust state space. A Linear State Space Model replaces heavy attention mechanisms, efficiently modeling temporal dependencies for parallel processing. Crucially, a novel Physics-Informed Gating mechanism leverages the Solar Zenith Angle and Clearness Index to structurally bound outputs, ensuring predictions strictly obey diurnal cycles and preventing nocturnal errors. Validated on a multi-year dataset for Omdurman, Sudan, PISSM achieves superior accuracy with fewer than 40,000 parameters, establishing an ultra-lightweight benchmark for real-time off-grid control.

Impact 4.7Import 4.0Pop 3.5
#193
arXiv SSM 2026-04-15
by Michiel Cevaal, Thomas de Jong, Mircea Lazar

In this paper, we consider the design of Model Predictive Control (MPC) algorithms based on Mamba neural networks. Mamba is a neural network architecture capable of sub-quadratic computational scaling in sequence length with state-of-the-art modeling capabilities. We provide a consistent and complete mathematical description of the Mamba neural network is provided. Then, adjustments and optimizations are made to construct a decoder-only Mamba multi-step predictor for MPC and an input-output formulation is given for sequence-to-sequence modeling of dynamical systems. The performance of Mamba-MPC is evaluated on several numerical examples and compared to a Long-Short-Term-Memory based MPC (LSTM-MPC) equivalent. First, a Single-Input-Single-Output (SISO) Van der Pol oscillator is considered, where stability, reference tracking, and noise robustness are evaluated. Then, a MIMO Four Tank setup is introduced where Multiple-Input-Multiple-Output (MIMO) reference tracking is evaluated. Lastly, Mamba-MPC is implemented on a physical Quanser Aero2 setup for closed-loop reference tracking. The results demonstrate that Mamba-MPC is able to stabilize and track a reference for SISO and MIMO syst

Impact 4.7Import 4.0Pop 3.5
#194
arXiv SSM 2026-04-15
by Jason Kong, Nilesh Prasad Pandey, Flavio Ponzina, Tajana Rosing

Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback-Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal-to-quantization-noise ratio (SQNR). Through extensive e

Impact 4.7Import 4.0Pop 3.5
#195
arXiv SSM 2026-04-15
by Guoan Xu, Yang Xiao, Guangwei Gao, Dongchen Zhu et al.

Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability-aware Self-Gated State Space Model (RSGMamba). At the core of our method is the Reliability-aware Self-Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross-modal interactions through a self-gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability-aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross-Gated Modulation (LCGM) is incorporated to refine fine-grained spatial details, complementing the global modeling capability of RSGMB. Extens

Impact 4.7Import 4.0Pop 3.5
#196
arXiv Robotics-Embodied 2026-04-15
by You Rim Choi, Subeom Park, Hyung-Sin Kim

As AI moves from data centers to robots and wearables, scaling ever-larger models becomes insufficient. Physical AI operates under tight latency, energy, privacy, and reliability constraints, and its performance depends not only on model capacity but also on how signals are acquired through controllable sensors in dynamic environments. We present Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. ATI is tripartite at the systems level: a Brainstem (L1) provides reflexive safety and signal-integrity control, a Cerebellum (L2) performs continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3/L4 supports routine skill selection and execution, coordination, and deep reasoning. This modular organization allows sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning to co-evolve within one closed-loop architecture, while keeping time-critical sensing and control on device and invoking higher-level inference only when needed. We instantiate ATI in a mobile camera prototype under dynamic lighting and motion. In our routed evaluation (L3-L4 split inference), compared to the defaul

Impact 4.8Import 4.0Pop 3.5
#197
C4ISRNET 2026-04-15

AeroVironment is debuting a new drone with the capacity to carry out reconnaissance, electronic warfare and strike missions, building on a lethal loitering system that is already being fielded by the Army, according to a Wednesday announcement . The defense technology firm introduced the system, known as MAYHEM 10, which expands upon its Switchblade family. The Army in February announced a $186 million purchase that includes two variants of Switchblade one-way attack, or “kamikaze,” drones: the Switchblade 600 Block 2 variant and the Switchblade 300 Block 20 variant. The difference is that MAYHEM 10 is multifunctional, meaning it can perform tasks in addition to striking. The new system can carry a 10-pound payload and has a range of over 62 miles, per the release. The system is capable of 50 minutes of endurance, with a launch assembly that can be done in under five minutes, the statement says. It can also be launched from the air, ground or maritime platforms. “By integrating advanced autonomy, multi-domain payloads, and rapid adaptability, we empower our forces to sense, disrupt, and strike with precision — even in the most contested environments,” Wahid Nawabi, AeroVironment’s chairman, president and chief executive officer, said in the statement. Last year, U.S. soldiers tested the Switchblade 600 system, which has a range of 27 miles and is designed to engage a target using onboard cameras. The Switchblade 300 Block 20, unlike the heavier 600 variant, is small enough to be carried in a backpack. For the first time, according to a February AeroVironment announcement , it will come equipped with an Explosively Formed Penetrator, a deadly warhead that is made to penetrate armored vehicles.

Impact 5.2Import 4.0Pop 3.5
#198
FedScoop 2026-04-15

The Department of Energy’s Office of Cybersecurity, Energy Security and Emergency Response has partnered with Lawrence Livermore National Laboratory to develop an AI testbed capable of identifying model weaknesses, the agency said in a blog post Tuesday.  Energy-sector stakeholders, including utilities, grid operators, vendors, national labs and research organizations, can use the platform to better understand model risk and how to integrate AI into critical systems.  Users will upload AI models to the platform and perform adversarial tests to assess security posture.  “The testbed enables users to observe the effects of attacks and quantify how vulnerable the model is to manipulation and leaked information,” DOE said in the blog post. “This facilitates apples-to-apples comparisons between models, showing users which model options are most robust and by what margin.” Named after the Norse god Thor’s hammer, the Mjölnir AI Testbed will give energy-sector players a look at whether an AI model behaves unsafely or exposes sensitive data at a time when AI models are becoming more integrated into critical workflows.  The technology is a high-value target for cyberattacks, underlining the need for resilient models. Anthropic, for example, says that its models have been targeted by Chinese competitors in attempts to steal information about how the technology works.  “As AI systems handle increasingly sensitive data and perform critical societal functions, failures in AI security could result in severe consequences, including privacy violations, operational disruptions, economic damages, and threats to public safety,” researchers from the Japan AI Safety Institute said in a July 2025 report .  Even when not targeted directly, AI systems are subject

Impact 5.2Import 4.0Pop 3.5
#199
Waymo Blog 2026-04-15

Waymo opens its robotaxi service to all residents and visitors in Miami and Orlando, no waitlist, and launches a rolling-invitation Nashville service. Continues the 2026 footprint expansion off existing operational Bay Area, Phoenix, LA, Austin launches.

Impact 5.2Import 4.0Pop 3.5
#200
arXiv SSM 2026-04-16
by Yoo-Min Jung, Leekyung Kim

Despite recent advances in state space models (SSMs) such as Mamba across various sequence domains, research on their standalone capacity for time series classification (TSC) has remained limited. We propose MambaSL, a framework that minimally redesigns the selective SSM and projection layers of a single-layer Mamba, guided by four TSC-specific hypotheses. To address benchmarking limitations -- restricted configurations, partial University of East Anglia (UEA) dataset coverage, and insufficiently reproducible setups -- we re-evaluate 20 strong baselines across all 30 UEA datasets under a unified protocol. As a result, MambaSL achieves state-of-the-art performance with statistically significant average improvements, while ensuring reproducibility via public checkpoints for all evaluated models. Together with visualizations, these results demonstrate the potential of Mamba-based architectures as a TSC backbone.

Impact 4.7Import 4.0Pop 3.5
#201
arXiv SSM 2026-04-16
by Badri N. Patro, Vijay S. Agneeswaran

Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization

Impact 4.7Import 4.0Pop 3.5
#202
arXiv SSM 2026-04-16
by Nikola Zubić, Qian Li, Yuyi Wang, Davide Scaramuzza

We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.

Impact 4.7Import 4.0Pop 3.5
#204

Find, Fix, Reason: Context Repair for Video Reasoning

Research ★ 4.3 multi-source (2)
arXiv cs.CV, arXiv Evals 2026-04-17
by Haojian Huang, Chuanyu Qin, Yinchuan Li, Yingcong Chen

Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity t

Impact 4.0Import 4.0Pop 4.7
#205
arXiv SSM 2026-04-17
by Sicheng Chen, Chad Wong, Tianyi Zhang, Enhui Chai et al.

Whole Slide Image (WSI) analysis is pivotal in computational pathology, enabling cancer diagnosis by integrating morphological and architectural cues across magnifications. Multiple Instance Learning (MIL) serves as the standard framework for WSI analysis. Recently, Mamba has become a promising backbone for MIL, overtaking Transformers due to its efficiency and global context modeling capabilities originating from Natural Language Processing (NLP). However, existing Mamba-based MIL approaches face three critical challenges: (1) disruption of 2D spatial locality during 1D sequence flattening; (2) sub-optimal modeling of fine-grained local cellular structures; and (3) high memory peaks during inference on resource-constrained edge devices. Studies like MambaOut reveal that Mamba's SSM component is redundant for local feature extraction, where Gated CNNs suffice. Recognizing that WSI analysis demands both fine-grained local feature extraction akin to natural images, and global context modeling akin to NLP, we propose MambaBack, a novel hybrid architecture that harmonizes the strengths of Mamba and MambaOut. First, we propose the Hilbert sampling strategy to preserve the 2D spatial loc

Impact 4.7Import 4.0Pop 3.5
#206
arXiv SSM 2026-04-17
by Enhui Chai, Sicheng Chen, Tianyi Zhang, Xingyu Li et al.

Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Direc

Impact 4.7Import 4.0Pop 3.5
#207
arXiv SSM 2026-04-17
by Pushpa Kumar Balan, Aijing Feng

Gradient saliency from deep sequence models surfaces candidate biomarkers efficiently, but the resulting gene lists can be contaminated by tissue-composition confounders that degrade downstream classifiers. We study whether LLM chain-of-thought (CoT) reasoning can filter these confounders, and whether reasoning quality is associated with downstream performance. We train a Mamba SSM on TCGA-BRCA RNA-seq and extract the top-50 genes by gradient saliency; DeepSeek-R1 evaluates every candidate with structured CoT to produce a final 17-gene set. On the held-out test split, the raw 50-gene saliency set (no LLM) performs worse than a 5,000-gene variance baseline (AUC 0.832 vs. 0.903), while the LLM-filtered set surpasses it (AUC 0.927), using 294x fewer features. A faithfulness audit (COSMIC CGC, OncoKB, PAM50) shows that 6 of 17 selected genes (35.3%) are validated BRCA biomarkers, while 10 of 16 known BRCA genes present in the input were missed - including FOXA1. This divergence between downstream performance and reasoning faithfulness suggests selective faithfulness in this setting: targeted confounder removal can improve predictive performance without comprehensive recall.

Impact 4.7Import 4.0Pop 3.5
#208
arXiv GenMedia 2026-04-16
by Olga Loginova, Frank Keller

Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified

Impact 4.0Import 4.6Pop 3.5
#209
arXiv cs.LG 2026-04-17
by Khang Le, Joaquín Torres-Sospedra, Philipp Müller

Fixed Radius Near Neighbor (FRNN) search is an alternative to the widely used k Nearest Neighbors (kNN) search. Unlike kNN, FRNN determines a label or an estimate for a test sample based on all training samples within a predefined distance. While this approach is beneficial in certain scenarios, assuming a fixed maximum distance for all training samples can decrease the accuracy of the FRNN. Therefore, in this paper we propose the Adaptive Radius Near Neighbor (ARNN) and the Weighted ARNN (WARNN), which employ adaptive distances and in latter case weights. All three methods are compared to kNN and twelve of its variants for a regression problem, namely WiFi fingerprinting indoor positioning, using 22 different datasets to provide a comprehensive analysis. While the performances of the tested FRNN and ARNN versions were amongst the worse, three of the four best methods in the test were WARNN versions, indicating that using weights together with adaptive distances achieves performance comparable or even better than kNN variants.

Impact 4.0Import 4.6Pop 3.5
#210
arXiv cs.LG 2026-04-17
by Erchi Wang, Pengrun Huang, Eli Chien, Om Thakkar et al.

Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and eval

Impact 4.0Import 4.6Pop 3.5
#211
arXiv cs.CL 2026-04-17
by Supriti Sinhamahapatra, Thai-Binh Nguyen, Yiğit Oğuz, Enes Ugan et al.

The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework, beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available in https://huggingface.co/datasets/goodpiku/muscat-eval \\ \newline \Keywords{multilingual, speech recognition, audio segmentation, speaker diarization}

Impact 4.0Import 4.6Pop 3.5
#212
arXiv cs.CL 2026-04-17
by Masahiro Suzuki, Hiroki Sakaji

We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well-defined financial text processing scenarios. The retrieval tasks leverage instruction-following datasets and financial text generation queries, while classification tasks cover sentiment analysis, document categorization, and domain-specific classification challenges derived from economic survey data. We conduct extensive evaluations across a wide range of embedding models, including Japanese-specific models of various sizes, multilingual models, and commercial embedding services. We publicly release JFinTEB datasets and evaluation framework at https://github.com/retarfi/JFinTEB to facilitate future research and provide a standardized evaluation protocol for the Japanese financial text mining community. This work addresses a critical gap in Japanese financial text processing resources and estab

Impact 4.0Import 4.6Pop 3.5
#213
arXiv cs.CL 2026-04-17
by Pritesh Jha

We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (Xtreme

Impact 4.0Import 4.6Pop 3.5
#214
arXiv cs.AI 2026-04-17
by Farhad Abtahi, Abdolamir Karbalaie, Eduardo Illueca-Fernandez, Fernando Seoane

Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and ch

Impact 4.0Import 4.6Pop 3.5
#215
arXiv cs.CV 2026-04-17
by Deepak Kumar, Abhishek Pratap Singh, Puneet Kumar, Xiaobai Li et al.

Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments. Group affect emerges from intertwined human-human interactions, contextual influences, and behavioral cues, making its quantitative modeling a challenging computational social systems problem. However, computational modeling of group affect in in-the-wild scenarios remains challenging due to limited large-scale annotated datasets and the inherent complexity of multimodal social interactions shaped by contextual and behavioral variability. The lack of comprehensive datasets annotated with multimodal and contextual information further limits advances in the field. To address this, we introduce the Group Affect from ViDeos (GAViD) dataset, comprising 5091 video clips with multimodal data (video, audio and context), annotated with ternary valence and discrete emotion labels and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present Context-Aware Group Affect Recognition Network (CAGNet) for multimodal context-aware group affect recognition. CAGNet achieves 63.20\% test accuracy on GAViD, c

Impact 4.8Import 4.0Pop 3.5
#216
arXiv cs.CV 2026-04-17
by Yuhai Deng, Huimin She, Wei Shen, Meng Li et al.

Tone style transfer for photo retouching aims to adapt the stylistic tone of the reference image to a given content image. However, the lack of high-quality large-scale triplet datasets with stylized ground truth forces existing methods to rely on self-supervised or proxy objectives, which limits model capability. To mitigate this gap, we design a data construction pipeline to build TST100K, a large-scale dataset of 100,000 content-reference-stylized triplets. At the core of this pipeline, we train a tone style scorer to ensure strict stylistic consistency for each triplet. In addition, existing methods typically extract content and reference features independently and then fuse them in a decoder, which may cause semantic loss and lead to inappropriate color transfer and degraded visual aesthetics. Instead, we propose ICTone, a diffusion-based framework that performs tone transfer in an in-context manner by jointly conditioning on both images, leveraging the semantic priors of generative models for semantic-aware transfer. Reward feedback learning using the tone style scorer is further incorporated to improve stylistic fidelity and visual quality. Experiments demonstrate the effect

Impact 4.8Import 4.0Pop 3.5
#217
arXiv AIScience 2026-04-16
by Tingjia Miao, Wenkai Jin, Muhua Zhang, Jinxin Tan et al.

The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current scientific benchmarks remain confined to domain knowledge comprehension and complex reasoning, failing to evaluate the exploratory nature and procedural complexity of real-world research. In this work, we present research-oriented evaluations in theoretical and computational physics, a natural testbed with comprehensive domain knowledge, complex reasoning, and verifiable end-to-end workflows without reliance on experiments. Here we introduce PRL-Bench (Physics Research by LLMs), a benchmark designed to systematically map the capability boundaries of LLMs in executing end-to-end physics research. Constructed from 100 curated papers from the latest issues of Physical Review Letters since August 2025 and validated by domain experts, PRL-Bench covers five major theory- and computation-intensive subfields of modern physics: astrophysics, condensed matter physics, high-energy physics, quantum information, and statistical physics. Each task in the benchmark is designed to replicate the core properties of authentic scientific research, includi

Impact 4.0Import 4.6Pop 3.5
#218
arXiv cs.NE 2026-04-17
by Saloni Garg, Ukant Jadia, Amit Sagtani, Kamal Kant Hiran

Automated classification of electrocardiogram (ECG) signals is a useful tool for diagnosing and monitoring cardiovascular diseases. This study compares three traditional machine learning algorithms (Decision Tree Classifier, Random Forest Classifier, and Logistic Regression) and three deep learning models (Simple Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Complex CNN (ECGLens)) for the classification of ECG signals from the PTB-XL dataset, which contains 12-lead recordings from normal patients and patients with various cardiac conditions. The DL models were trained on raw ECG signals, allowing them to automatically extract discriminative features. Data augmentation using the Stationary Wavelet Transform (SWT) was applied to enhance model performance, increase the diversity of training samples, and preserve the essential characteristics of the ECG signals. The models were evaluated using multiple metrics, including accuracy, precision, recall, F1-score, and ROC-AUC. The ECG-Lens model achieved the highest performance, with 80% classification accuracy and a 90% ROC-AUC. These findings demonstrate that deep learning architectures, particularly complex CNNs

Impact 4.0Import 4.6Pop 3.5
#219
arXiv cs.CV 2026-04-17
by Toby Perrett, Matthew Bouchard, William McCarthy

We introduce neuralCAD-Edit, the first benchmark for editing 3D CAD models collected from expert CAD engineers. Instead of text conditioning as in prior works, we collect realistic CAD editing requests by capturing videos of professional designers, interacting directly with CAD models in CAD software, while talking, pointing and drawing. We recruited ten consenting designers to contribute to this contained study. We benchmark leading foundation models against human CAD experts carrying out edits, and find a large performance gap in both automatic metrics and human evaluations. Even the best foundation model (GPT 5.2) scores 53% lower (absolute) than CAD experts in human acceptance trials, demonstrating the challenge of neuralCAD-Edit. We hope neuralCAD-Edit will provide a solid foundation against which 3D CAD editing approaches and foundation models can be developed. Code/data: https://autodeskailab.github.io/neuralCAD-Edit

Impact 4.0Import 4.6Pop 3.5
#220
arXiv Evals 2026-04-17
by Marie Maltais, Yejin Jeon, Min Ma, Shamsuddeen Hassan Muhammad et al.

Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope tha

Impact 4.0Import 4.6Pop 3.5
#221
arXiv Evals 2026-04-17
by J. Escorza, G. Pellicer, T. de Ara, J. Hurtado-Gallego et al.

Accurate electrical amplification is essential in molecular electronics for measuring conductance through atomic and molecular junctions, where currents often span several orders of magnitude. In this work, we present a systematic design and comparative analysis of four current-to-voltage ($I\text{--}V$) amplifier architectures: single-stage linear, series-linear, logarithmic, and multi-stage cascaded, specifically optimized for break junction (BJ) techniques, including scanning tunneling microscopy (STM-BJ) and mechanically controllable break junctions (MCBJ). Each configuration is evaluated based on sensitivity, noise performance, and dynamic range. Our results characterize the trade-offs between circuit complexity and noise, providing a robust framework and practical guidelines for selecting amplification schemes in quantum transport experiments.

Impact 4.0Import 4.6Pop 3.5
#222
arXiv Evals 2026-04-17
by Linnuo Zhang, Chihao Li, Jiajin Ge, Tatiana Azaryan et al.

We present results from two test beam campaigns that investigate the performance of straw tube detectors as potential candidates for an FCC-ee straw tracker. These studies were carried out at CERN using 150 GeV muon beams. Dedicated algorithms were developed to determine both single tube spatial resolution for the primary coordinate in the $r-φ$ plane and spatial resolution for the secondary coordinate along the tube direction within a straw chamber. Detection efficiency was also evaluated as a function of the extrapolated hit position for each tube. Both datasets showed consistent results for spatial resolutions and efficiency. Our findings will help establish benchmark performance metrics and provide valuable insight for future design, optimization, and construction of straw chambers for high-precision tracking applications.

Impact 4.0Import 4.6Pop 3.5
#224
arXiv Recurrent/LinAttn 2026-04-13
by Sahil Goyal, Swayam Agrawal, Gautham Govind Anil, Prateek Jain et al.

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With $4\times$ reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of $2.0$ on class-conditional ImageNet $256 \times 256$ and FVD of $72.8$ on class-c

Impact 4.0Import 4.0Pop 3.5
#225
arXiv Robotics-Embodied 2026-04-13
by Simon Brezovnik, Janez Žerovnik

Roman-type domination parameters form an important class of graph invariants that model protection and resource allocation problems on networks. Among them, $[k]$-Roman domination provides a unified framework that generalizes Roman, double Roman, and higher-order variants. In this paper we investigate the $[k]$-Roman domination number of cylindrical grids $C_m\Box P_n$ and derive several new constructive upper bounds. Our approach combines three complementary techniques: linear periodic constructions, uniform ceiling-type labelings, and packing-based refinements. We first analyze the case $C_9\Box P_n$, where these three families of bounds can be compared explicitly and their relative efficiency is shown to depend on the parameter $k$. We then extend the linear constructions to cylindrical grids whose circumference is a multiple of one of the values $3,\dots,9$, obtaining a unified family of upper bounds for $C_{rt}\Box P_n$. Motivated by the asymptotic behavior of these estimates, we further derive general upper bounds depending only on the residue class of $m$ modulo $5$, which apply to all cylindrical grids. As a consequence, we obtain explicit estimates for the double Roman dom

Impact 4.0Import 4.0Pop 3.5
#226
arXiv Robotics-Embodied 2026-04-13
by Cedric Le Gentil, Daniil Lisus, Timothy D. Barfoot

Recently, the robotics community has regained interest in radar-based perception and state estimation. A 2D imaging radar provides dense 360deg information about the environment. Despite the radar antenna's cone of emission and reception, the collected data is generally assumed to be limited to the plane orthogonal to the radar's spinning axis. Accordingly, most methods based on 2D imaging radars only perform SE(2) state estimation. This paper presents 3DRO, an extension of the SE(2) Direct Radar Odometry (DRO) framework to perform state estimation in SE(3). While still assuming planarity of the data through DRO's 2D velocity estimates, it integrates 3D gyroscope measurements over SO(3) to estimate SE(3) ego motion. While simple, this approach provides lidar-level odometry accuracy as demonstrated using 643km of data from the Boreas-RT dataset.

Impact 4.0Import 4.0Pop 3.5
#227
arXiv Robotics-Embodied 2026-04-13
by Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng et al.

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20\% on the public real-world RoboChallenge benc

Impact 4.0Import 4.0Pop 3.5
#228
arXiv Robotics-Embodied 2026-04-13
by Krishna Jaganathan, Patricio Vela

Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB-D, GeomPrompt improves over RGB-only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt-Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms.

Impact 4.0Import 4.0Pop 3.5
#229
arXiv Robotics-Embodied 2026-04-13
by Siyuan Xu, Tianshi Wang, Fengling Li, Lei Zhu et al.

Vision-Language-Action models (VLAs) have demonstrated strong potential for embodied AI, yet their deployment on resource-limited robots remains challenging due to high memory and computational demands. While Post-Training Quantization (PTQ) provides an efficient solution, directly applying PTQ to VLAs often results in severe performance degradation during sequential control. We identify temporal error accumulation as a key factor, where quantization perturbations at the vision-language-to-action interface are progressively amplified, leading to kinematic drift in executed trajectories. To address this issue, we propose Drift-Aware Post-Training Quantization (DA-PTQ), which formulates quantization as a drift-aware optimization problem over sequential decision processes. DA-PTQ consists of two components: (1) Cross-Space Representation Compensation, which mitigates structured distortions between multimodal representations and action space to improve action consistency, and (2) Motion-Driven Mixed-Precision Allocation, which assigns bit-widths by minimizing trajectory-level motion errors. Extensive experiments show that DA-PTQ significantly reduces kinematic drift and achieves compar

Impact 4.0Import 4.0Pop 3.5
#234
arXiv SSM 2026-04-14
by Farzaneh Jafari, Stefano Berretti, Anup Basu

We introduce SEDTalker, an emotion-aware framework for speech-driven 3D facial animation that leverages frame-level speech emotion diarization to achieve fine-grained expressive control. Unlike prior approaches that rely on utterance-level or manually specified emotion labels, our method predicts temporally dense emotion categories and intensities directly from speech, enabling continuous modulation of facial expressions over time. The diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture. This design allows effective disentanglement of linguistic content and emotional style while preserving identity and temporal coherence. We evaluate our approach on a large-scale multi-corpus dataset for speech emotion diarization and on the EmoVOCA dataset for emotional 3D facial animation. Quantitative results demonstrate strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors, while qualitative results show smooth emotion transitions and consistent expression control. These findings highlight the effectiveness of frame-level emotion diar

Impact 4.0Import 4.0Pop 3.5
#235
arXiv SSM 2026-04-14
by Chaoyao Shen, Linfeng Jiang, Yixian Shen, Tao Xu et al.

Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explos

Impact 4.0Import 4.0Pop 3.5
#236
arXiv SSM 2026-04-14
by Silvaneo V. dos Santos, Layla Parast

Surrogate markers offer the potential to reduce the burden of data collection by replacing costly or invasive primary outcomes with more accessible measurements, provided that they can faithfully indicate the effectiveness of a treatment. However, appropriate evaluation of a surrogate is particularly complex in longitudinal studies, where both outcomes and surrogates can evolve dynamically over time and interest lies not only in the treatment effect at one time, but rather treatment effects that may vary along the entire trajectory. In this paper, we develop a statistical framework for surrogate evaluation when both the surrogate and primary outcome are measured over time. Specifically, within the potential outcomes framework, we propose a formal causal definition of the proportion of the treatment effect on the longitudinal primary outcome that is explained by the treatment effect on the longitudinal surrogate. For estimation, we leverage state-space models, together with the Kalman filter and smoother, enabling efficient estimation of treatment effects under realistic conditions of temporal evolution and patient-level variability. We introduce a nonparametric bootstrap strategy f

Impact 4.0Import 4.0Pop 3.5
#237
arXiv SSM 2026-04-14
by Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, Wei Yang et al.

Motion reasoning serves as the cornerstone of multi-object tracking (MOT), as it enables consistent association of targets across frames. However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. By allowing objects with similar motion states to mutually constrain and refine each other, our framework stabilizes noisy trajectories and infers plausible motion continuity even when target is occluded. To realize this concept, we design HyperSSM, an architecture that integrates Hypergraph computation and a State Space Model (SSM) for unified spatial-temporal reasoning. The Hypergraph module captures spatial motion correlations through dynamic hyperedges, while the SSM enforces temporal smoothness via structured state transitions. This synergistic design enables simultaneous optimization of spatial consensus and temporal coherence, resulting in robus

Impact 4.0Import 4.0Pop 3.5
#238
arXiv SSM 2026-04-14
by Mohammed Asad, Mohit Bajpai, Sudhir Singh, Rahul Katarya

Accurate characterization of suspicious breast lesions in mammography is important for early diagnosis and treatment planning. While Convolutional Neural Networks (CNNs) are effective at extracting local visual patterns, they are less suited to modeling long-range dependencies. Vision Transformers (ViTs) address this limitation through self-attention, but their quadratic computational cost can be prohibitive. This paper presents a hybrid architecture that combines EfficientNetV2-M for local feature extraction with Vision Mamba, a State Space Model (SSM), for efficient global context modeling. The proposed model performs binary classification of abnormality-centered mammography regions of interest (ROIs) from the CBIS-DDSM dataset into benign and malignant classes. By combining a strong CNN backbone with a linear-complexity sequence model, the approach achieves strong lesion-level classification performance in an ROI-based setting.

Impact 4.0Import 4.0Pop 3.5
#239
arXiv SSM 2026-04-14
by Yongbo Shu, Wenzhao Xie, Shanhu Yao, Zirui Xin et al.

Multi-parametric prostate MRI -- combining T2-weighted, apparent diffusion coefficient, and high b-value diffusion-weighted sequences -- is central to non-invasive detection of clinically significant prostate cancer, yet in routine practice individual sequences may be missing or degraded by motion, artifacts, or abbreviated protocols. Existing multi-modal fusion strategies typically assume complete inputs and entangle modality-specific information at early layers, offering limited resilience when one channel is corrupted or absent. We propose Modality-Isolated Gated Fusion (MIGF), an architecture-agnostic module that maintains separate modality-specific encoding streams before a learned gating stage, combined with modality dropout training to enforce compensation behavior under incomplete inputs. We benchmark six bare backbones and assess MIGF-equipped models under seven missing-modality and artifact scenarios on the PI-CAI dataset (1,500 studies, fold-0 split, five random seeds). Among bare backbones, nnUNet provided the strongest balance of performance and stability. MIGF improved ideal-scenario Ranking Score for UNet, nnUNet, and Mamba by 2.8%, 4.6%, and 13.4%, respectively; the

Impact 4.0Import 4.0Pop 3.5
#240
arXiv Robotics-Embodied 2026-04-14
by Jaden Park, Jungtaek Kim, Jongwon Jeong, Robert D. Nowak et al.

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal

Impact 4.0Import 4.0Pop 3.5
#241
arXiv Robotics-Embodied 2026-04-14
by Ziyuan Xia, Jingyi Xu, Chong Cui, Yuanhong Yu et al.

Training embodied AI agents depends critically on the visual fidelity of simulation environments and the ability to model dynamic humans. Current simulators rely on mesh-based rasterization with limited visual realism, and their support for dynamic human avatars, where available, is constrained to mesh representations, hindering agent generalization to human-populated real-world scenarios. We present Habitat-GS, a navigation-centric embodied AI simulator extended from Habitat-Sim that integrates 3D Gaussian Splatting scene rendering and drivable gaussian avatars while maintaining full compatibility with the Habitat ecosystem. Our system implements a 3DGS renderer for real-time photorealistic rendering and supports scalable 3DGS asset import from diverse sources. For dynamic human modeling, we introduce a gaussian avatar module that enables each avatar to simultaneously serve as a photorealistic visual entity and an effective navigation obstacle, allowing agents to learn human-aware behaviors in realistic settings. Experiments on point-goal navigation demonstrate that agents trained on 3DGS scenes achieve stronger cross-domain generalization, with mixed-domain training being the mos

Impact 4.0Import 4.0Pop 3.5
#242
arXiv Robotics-Embodied 2026-04-14
by Huy Anh Nguyen, Feras Dayoub, Minh Hoai

We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives. To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced `high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanni

Impact 4.0Import 4.0Pop 3.5
#243
arXiv Robotics-Embodied 2026-04-14
by Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu et al.

Parametric Computer-Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi-part, movable models from high-level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training-free multi-agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high-quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross-stage rollback mechanism that accurately isolates and corrects design- and code-level errors. Additionally, a self-evolving experience store accumulates design knowledge to continuously improve performance on future t

Impact 4.0Import 4.0Pop 3.5
#244
AI Alignment Forum 2026-04-14

It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. In more powerful systems, this kind of failure would jeopardize safely navigating the intelligence explosion. It's crucial to build good processes to ensure development is executed according to plan, especially as human oversight becomes spread thin over increasing amounts of potentially untrusted and sloppy AI labor . This particular failure is also directly harmful, because it significantly reduces our confidence that the model's reasoning trace is monitorable (reflective of the AI's intent to misbehave). [1] I'm grateful that Anthropic has transparently reported on this issue as much as they have, allowing for outside scrutiny. I want to encourage them to continue to do so. Thanks to Carlo Leonardo Attubato, Buck Shlegeris, Fabien Roger, Arun Jose, and Aniket Chakravorty for feedback and discussion. See also previous discussion here . Incidents A technical error affecting Mythos, Opus 4.6, and Sonnet 4.6 This is the most recent incident. In the Claude Mythos alignment risk update , Anthropic report having accidentally exposed approximately 8% of chains of thought to the reward function. The technical error that caused this issue went unnoticed for a long time: it also affected Opus 4.6 and Sonnet 4.6. A technical error affecting Opus 4.6 A previous technical error also exposed CoTs to the oversight signal in Opus 4.6 (affecting a much smaller number of episodes concentrated near the end of training in this case). After noticing this error, Anthropic should have reworked their development process to mak

Impact 4.0Import 4.0Pop 3.5
#248
DefenseScoop 2026-04-14

Senior military officials are reviewing all the insights and input gained from the Marine Corps’ first Generative and Agentic AI Workshop that was held in Quantico last month, according to Maj. Christopher Clark, who said the service aims to use those findings to inform integration plans and reliable tech deployments. Clark is deeply involved in this work as the Corps’ AI lead. He spotlighted some of the major, early takeaways from that well-attended, four-day event during Scoop News Group’s AITalks conference on Tuesday. “We are doing an analysis of our workshop. We had 350 individuals from across the department and 102 companies [participate] in it — and we have a lot of feedback that we’ve collected. So, we’re planning to use that to then drive our priorities,” Clark said. “How do we ensure that we understand where AI needs to be integrated to solve [operational problems] and how do you do it in a way that’s safe and effective?” As they rapidly mature, advanced AI and machine learning models offer the military advantages associated with speed, data processing, targeting and more. Yet they also introduce serious uncertainty and risks of potential technical failures, unpredictable behavior, and unintended escalation. The term “generative AI” or genAI refers to systems that can respond to human prompts by generating a range of media and content. Agentic AI completes tasks by interacting with data and digital tools with little human supervision, and can apply genAI. The Marine Corps’ AI workshop was originally slated for November 2025. The service announced in October that it would be postponed due to the lapse in appropriations related to the federal government shutdown that disrupted agencies’ work at that time. Held at Quantico’s Warner Hall March 9-12, th

Impact 4.0Import 4.0Pop 3.5
#249
arXiv SSM 2026-04-15
by Runwei Lin, Ying Wang

Heart rate variability (HRV) analysis is important for the assessment of autonomic cardiovascular regulation. The inverse Gaussian process (IGP) has been widely used for beat-to-beat HRV modeling, as it gives a physiological relevant interpretation of heart depolarization process. A key challenge in IGP-based heartbeat modeling is the accurate estimation of time-varying parameters. In this study, we investigated whether recurrent neural networks (RNNs) can be used for IGP parameter identification and thereby enhance probabilistic modeling of R-R dynamics. Specifically, four representative RNN architectures, namely, GRU, LSTM, Structured State Space sequence model (S4), and Mamba, were evaluated using the Kolmogorov-Smirnov statistics. The results demonstrate the possibility of combining neural sequence models with the IGP framework for beat-wise R-R series modeling. This approach provides a flexible basis for probabilistic HRV modeling and for future incorporation of more complex physiological mechanisms and dynamic conditions.

Impact 4.0Import 4.0Pop 3.5
#250
arXiv SSM 2026-04-15
by Xinjin Li, Jinghan Cao, Mengyue Wang, Yue Wu et al.

Traffic forecasting requires modeling complex temporal dynamics and long-range spatial dependencies over large sensor networks. Existing methods typically face a trade-off between expressiveness and efficiency: Transformer-based models capture global dependencies well but suffer from quadratic complexity, while recent selective state-space models are computationally efficient yet less effective at modeling spatial interactions in graph-structured traffic data. We propose FAST, a unified framework that combines attention and state-space modeling for scalable spatiotemporal traffic forecasting. FAST adopts a Temporal-Spatial-Temporal architecture, where temporal attention modules capture both short- and long-term temporal patterns, and a Mamba-based spatial module models long-range inter-sensor dependencies with linear complexity. To better represent heterogeneous traffic contexts, FAST further introduces a learnable multi-source spatiotemporal embedding that integrates historical traffic flow, temporal context, and node-level information, together with a multi-level skip prediction mechanism for hierarchical feature fusion. Experiments on PeMS04, PeMS07, and PeMS08 show that FAST co

Impact 4.0Import 4.0Pop 3.5
#251
arXiv SSM 2026-04-15
by Jinlin You, Muyu Li, Xudong Zhao

Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT

Impact 4.0Import 4.0Pop 3.5
#252
arXiv Robotics-Embodied 2026-04-15
by Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang et al.

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA sig

Impact 4.0Import 4.0Pop 3.5
#253
arXiv Robotics-Embodied 2026-04-15
by Zhen Liu, Xinyu Ning, Zhe Hu, Xinxin Xie et al.

Recent vision-language-action (VLA) systems have demonstrated strong capabilities in embodied manipulation. However, most existing VLA policies rely on limited observation windows and end-to-end action prediction, which makes them brittle in long-horizon, memory-dependent tasks with partial observability, occlusions, and multi-stage dependencies. Such tasks require not only precise visuomotor control, but also persistent memory, adaptive task decomposition, and explicit recovery from execution failures. To address these limitations, we propose a dual-system framework for long-horizon embodied manipulation. Our framework explicitly separates high-level semantic reasoning from low-level motor execution. A high-level planner, implemented as a VLM-based agentic module, maintains structured task memory and performs goal decomposition, outcome verification, and error-driven correction. A low-level executor, instantiated as a VLA-based visuomotor controller, carries out each sub-task through diffusion-based action generation conditioned on geometry-preserving filtered observations. Together, the two systems form a closed loop between planning and execution, enabling memory-aware reasoning

Impact 4.0Import 4.0Pop 3.5
#254
arXiv Robotics-Embodied 2026-04-15
by Xueyang Zhou, Yihan Sun, Xijie Gong, Guiyao Tie et al.

Embodied AI research is increasingly moving beyond single-task, single-environment policy learning toward multi-task, multi-scene, and multi-model settings. This shift substantially increases the engineering overhead and development time required for stages such as evaluation environment construction, trajectory collection, model training, and evaluation. To address this challenge, we propose a new paradigm for embodied AI development in which users express goals and constraints through conversation, and the system automatically plans and executes the development workflow. We instantiate this paradigm with EmbodiedClaw, a conversational agent that turns high-frequency, high-cost embodied research activities, including environment creation and revision, benchmark transformation, trajectory synthesis, model evaluation, and asset expansion, into executable skills. Experiments on end-to-end workflow tasks, capability-specific evaluations, human researcher studies, and ablations show that EmbodiedClaw reduces manual engineering effort while improving executability, consistency, and reproducibility. These results suggest a shift from manual toolchains to conversationally executable workf

Impact 4.0Import 4.0Pop 3.5
#255
arXiv Robotics-Embodied 2026-04-15
by Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to

Impact 4.0Import 4.0Pop 3.5
#256
arXiv Robotics-Embodied 2026-04-15
by Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng et al.

Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large model

Impact 4.0Import 4.0Pop 3.5
#257
arXiv Robotics-Embodied 2026-04-15
by Jingjing Qian, Zeyuan He, Chen Shi, Lei Xiao et al.

Coordinating navigation and manipulation with robust performance is essential for embodied AI in complex indoor environments. However, as tasks extend over long horizons, existing methods often struggle due to catastrophic forgetting, spatial inconsistency, and rigid execution. To address these issues, we propose ESCAPE (Episodic Spatial Memory Coupled with an Adaptive Policy for Execution), operating through a tightly coupled perception-grounding-execution workflow. For robust perception, ESCAPE features a Spatio-Temporal Fusion Mapping module to autoregressively construct a depth-free, persistent 3D spatial memory, alongside a Memory-Driven Target Grounding module for precise interaction mask generation. To achieve flexible action, our Adaptive Execution Policy dynamically orchestrates proactive global navigation and reactive local manipulation to seize opportunistic targets. ESCAPE achieves state-of-the-art performance on the ALFRED benchmark, reaching 65.09% and 60.79% success rates in test seen and unseen environments with step-by-step instructions. By reducing redundant exploration, our ESCAPE attains substantial improvements in path-length-weighted metrics and maintains robu

Impact 4.0Import 4.0Pop 3.5
#258
arXiv Robotics-Embodied 2026-04-15
by Peiwen Jiang, Yushuo Feng, Jiajia Guo, Chao-Kai Wen et al.

The increasing deployment of agentic artificial intelligence (AI) systems has intensified the demand for efficient agent to agent communication, particularly over bandwidth limited wireless links. In embodied AI applications, agents must exchange task related information under strict latency and reliability constraints. Existing agent communication methods primarily focus on connectivity and protocol efficiency, but lack effective mechanisms to reduce physical layer transmission overhead while preserving task semantics.To address this challenge, this paper proposes a semantic agent communication framework that reduces communication overhead while maintaining task performance and shared understanding among agents. An LLM based semantic processor is first introduced to reorganize and condense agent generated messages by extracting task relevant semantic content. To cope with information loss introduced by aggressive message reduction, an importance-aware semantic transmission strategy is developed, which adaptively protects semantic components according to their task importance. Furthermore, a task specific knowledge base is incorporated as long term semantic memory to support recurr

Impact 4.0Import 4.0Pop 3.5
#259
arXiv RL 2026-04-15
by Qing Yan, Wenyu Yang, Yufei Wang, Wenhao Ma et al.

Traditional esports scouting workflows rely heavily on manual video review and aggregate performance metrics, which often fail to capture the nuanced decision-making patterns necessary to determine if a prospect fits a specific tactical archetype. To address this, we reframe style-based player evaluation in esports as an Inverse Reinforcement Learning (IRL) problem. In this paper, we introduce a novel player selection framework that learns professional-specific reward functions from logged gameplay demonstrations, allowing organizations to rank candidates by their stylistic alignment with a target star player. Our proposed architecture utilizes a multimodal, two-branch intake: one branch encodes structured state-action trajectories derived from high-resolution in-game telemetry, while the second encodes temporally aligned tactical pseudo-commentary generated by Vision-Language Models (VLMs) from broadcast footage. These representations are fused and evaluated via a Generative Adversarial Imitation Learning (GAIL) objective, where a discriminator learns to capture the unique mechanical and tactical signatures of elite professionals. By transitioning from generic skill estimation to

Impact 4.0Import 4.0Pop 3.5
#260
arXiv RL 2026-04-15
by Kristian Holme, Jean Rabault, Ricardo Vinuesa, Mikael Mortensen

Rotating detonation engines (RDEs) are a promising propulsion concept that may offer higher thermodynamic efficiency and specific impulse than conventional systems, but nonlinear phenomena, including transitions to oscillatory or chaotic propagation modes, can hinder practical operation. Deep Reinforcement Learning (DRL) has emerged as a promising method for controlling complex nonlinear dynamics such as those observed in RDEs. However, the multi-timescale nature of the RDE system makes direct application of DRL challenging. We address this challenge by reformulating the DRL problem in a moving reference frame that follows the detonation-wave pattern, making the wave structure appear quasi-steady to the agent. This reformulation enables scale separation between fast detonation propagation and slower operating-mode dynamics. We train DRL controllers to modulate spatially segmented injection pressure in a one-dimensional reduced-order RDE model and induce rapid transitions between different mode-locked states. Across a range of actuation periods, initial states, and target modes, controllers trained in the moving frame learn more reliably than those trained in a stationary frame and

Impact 4.0Import 4.0Pop 3.5
#261
arXiv RL 2026-04-15
by Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu et al.

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights,

Impact 4.0Import 4.0Pop 3.5
#264
AI Alignment Forum 2026-04-15

Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). [2] I disagree. Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven't, and often seem to "try" to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren't straightforward SWE tasks, and tasks that aren't easy to programmatically check. Also, when I apply AIs to very difficult tasks in long-running agentic scaffolds, it's quite common for them to reward-hack / cheat (depending on the exact task distribution)—and they don't make the cheating clear in their outputs. AIs typically don't flag these cheats when doing further work on the same project and often don't flag these cheats even when interacting with a user who would obviously want to know, probably both because the AI doing further work is itself misaligned and because it has been convinced by write-ups that contain motivated reasoning or misleading descriptions. There is a more general "slippery" quality to working with current frontier AI systems. AIs seem to be improving at making their outputs seem good and useful faster than they're improving at making their outputs actually good and useful, especially in hard-to-check domains. The experience of working with current AIs (especially on hard-to-check tasks) often feels like you're making decent/great progress but then later you realize that things were

Impact 4.0Import 4.0Pop 3.5
#271
DefenseScoop 2026-04-15

Deputy Defense Secretary Steve Feinberg’s March 9 memorandum sets an ambitious deadline for Pentagon and military leaders to transition the Maven Smart System (MSS) into a formal program of record by the end of this fiscal year. That Palantir-supplied, AI-enabled platform fuses disparate military systems and intelligence sources into one interface that rapidly integrates data and compresses the military’s processes for finding and striking targets.  The transition will allow for more stable funding streams and put the technology on an accelerated path to becoming an even more widely adopted fixture embedded across U.S. military operations. MSS has been deployed in live-fire exercises and is actively being used in real-world conflicts by multiple combatant commands. “This designation will streamline acquisition, ensure use of rigorous testing and evaluation standards, and create clear lines of accountability for performance, oversight, and management,” Feinberg wrote in the four-page memo , which was obtained by DefenseScoop. Notably, among a variety of instructions he provided to several Defense Department undersecretariats to fully establish the new program of record, Feinberg stated that if the Office of the Under Secretary of Defense for Research and Engineering identifies any transition task that will “negatively impact delivery of MSS capability to the warfighter,” the directorate should make any “necessary adjustments to the timelines” outlined on his behalf. Spokespersons from the military’s 11 combatant commands and DOD were largely unforthcoming when responding to questions from DefenseScoop over the last few weeks regarding whether or how the transition could disrupt their ongoing usage of MSS. “We decline to comment citing operational security,” a Pentag

Impact 4.0Import 4.0Pop 3.5
#272
DefenseScoop 2026-04-15

Federal agencies have always faced the challenge of finding and retaining talent that meets current and future mission needs. For too long, however, federal and defense agency heads have lacked the tools and incentives to view workforce planning as the agile, proactive, strategic discipline it should be: one that drives, not just supports, an agency’s mission.  For many agencies, workforce planning in today’s environment remains a static, reactive and performative exercise — designed mostly to satisfy HR and budget requirements to justify full-time employee (FTE) counts, fill vacant seats, or contract skills the department lacks. Mike Houlihan is Vice President at Workday Government. Unfortunately, that has led to what amounts to “good enough” workforce planning, which, in reality, has weakened human capital management across the federal government and left agencies poorly prepared for today’s modern world challenges. Building resilience amidst workforce shifts The significant “succession event” of 2025, which saw approximately 348,000 employees — roughly 10% of the federal workforce — transition out of their roles, has underscored the vital importance of agile workforce planning. This sudden shift in the talent landscape highlighted a unique challenge: the need to preserve deep institutional memory while simultaneously scaling new, specialized skill sets. During this period of rapid change, the value of a modernized skills-and-talent map became clear. For agencies to navigate shifting work demands effectively, having real-time insights into the specialized knowledge of their workforce is no longer just an advantage — it is the essential foundation for resilient operations. Modern workforce planning is not just about identifying critical ski

Impact 4.0Import 4.0Pop 3.5
#273
DefenseScoop 2026-04-15

From a business systems perspective, the Pentagon operates in essentially the same manner that any other multinational conglomerate does. Yet, the department has long seen its requirements, even those for everyday business operations, as unique. To meet those supposedly one-of-a-kind needs, it has been willing to foot a hefty upcharge to build highly customized software applications. Overengineered and underpowered since their inception, many of these services are now unwieldy productivity blockers that lag years behind offerings available to the private sector that are simultaneously powerful, usable, and cost-effective. With few unclassified exceptions, the national security community needs the same corporate software as the rest of us. But that’s not something it’s readily admitted to until recently. Commercial-first is no longer just a suggestion Software is the perfect place to begin practicing the “buy before build” mantra: buy what you can, and build only what you must. If acquirers believe there is no suitable commercial offering, they are required to produce documented market research to prove that no existing product can meet the requirement, even with reasonable modifications. Law and policy are unambiguous at this point: commercial must be the default, not the afterthought. Silicon Valley and Washington have never been more closely aligned. The federal government’s senior-most appointed roles are filled with executives from the private sector, hailing from deep tech unicorns, venture capital, and investment banking. These leaders bring exquisite knowledge of what it really means to move fast, and how to replicate it. They’ve been wielding it to great effect. Commercial awards across the enterprise have quietly demonstrated something long suspected: when the

Impact 4.0Import 4.0Pop 3.5
#274
arXiv cs.RO 2026-04-16
by Yirui Wang, Xiuwei Xu, Angyuan Ma, Bingyao Yu et al.

Manipulation policies deployed in uncontrolled real-world scenarios are faced with great in-category geometric diversity of everyday objects. In order to function robustly under such variations, policies need to work in a category-level manner, i.e. knowing how to interact with any object in a certain category, instead of only a specific one seen during training. This in-category generalizability is usually nurtured with shape-diversified training data; however, manually collecting such a corpus of data is infeasible due to the requirement of intense human labor and large collections of divergent objects at hand. In this paper, we propose ShapeGen, a data generation method that aims at generating shape-variated manipulation data in a simulator-free and 3D manner. ShapeGen decomposes the process into two stages: Shape Library curation and Function-Aware Generation. In the first stage, we train spatial warpings between shapes mapping points to points that correspond functionally, and aggregate 3D models along with the warpings into a plug-and-play Shape Library. In the second stage, we design a pipeline that, leveraging established Libraries, requires only minimal human annotation to

Impact 4.0Import 4.0Pop 3.5
#275
arXiv cs.RO 2026-04-16
by Longchen Niu, Andrew Nasif, Gennaro Notomista

This paper presents a novel density control framework for multi-robot systems with spatial safety and energy sustainability guarantees. Stochastic robot motion is encoded through the Fokker-Planck Partial Differential Equation (PDE) at the density level. Control Lyapunov and control barrier functions are integrated with PDEs to enforce target density tracking, obstacle region avoidance, and energy sufficiency over multiple charging cycles. The resulting quadratic program enables fast in-the-loop implementation that adjusts commands in real-time. Multi-robot experiment and extensive simulations were conducted to demonstrate the effectiveness of the controller under localization and motion uncertainties.

Impact 4.0Import 4.0Pop 3.5
#276
arXiv cs.RO 2026-04-16
by Kaleb Ben Naveed, Manveer Singh, Devansh R. Agrawal, Dimitra Panagou

Planning safe trajectories under model uncertainty is a fundamental challenge. Robust planning ensures safety by considering worst-case realizations, yet ignores uncertainty reduction and leads to overly conservative behavior. Actively reducing uncertainty on-the-fly during a nominal mission defines the dual control problem. Most approaches address this by adding a weighted exploration term to the cost, tuned to trade off the nominal objective and uncertainty reduction, but without formal consideration of when exploration is beneficial. Moreover, safety is enforced in some methods but not in others. We study a budget-constrained dual control problem, where uncertainty is reduced subject to safety and a mission-level cost budget that limits the allowable degradation in task performance due to exploration. In this work, we propose Dual-gatekeeper, a framework that integrates robust planning with active exploration under formal guarantees of safety and budget feasibility. The key idea is that exploration is pursued only when it provides a verifiable improvement without compromising safety or violating the budget, enabling the system to balance immediate task performance with long-term

Impact 4.0Import 4.0Pop 3.5
#277
arXiv cs.RO 2026-04-16
by Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto et al.

We present a new robotic foundation model, called $π_{0.7}$, that can enable strong out-of-the-box performance in a wide range of scenarios. $π_{0.7}$ can follow diverse language instructions in unseen environments, including multi-stage tasks with various kitchen appliances, provide zero-shot cross-embodiment generalization, for example enabling a robot to fold laundry without seeing the task before, and perform challenging tasks such as operating an espresso machine out of the box at a level of performance that matches much more specialized RL-finetuned models. The main idea behind $π_{0.7}$ is to use diverse context conditioning during training. This conditioning information, contained in the prompt, makes it possible to steer the model precisely to perform many tasks with different strategies. It is conditioned not just on a language command that describes what it should do, but on additional multimodal information that also describes the manner or strategy in which it should do it, including metadata about task performance and subgoal images. This enables $π_{0.7}$ to use very diverse data, including demonstrations, potentially suboptimal (autonomous) data including failures,

Impact 4.0Import 4.0Pop 3.5
#278
arXiv cs.RO 2026-04-16
by Yang Zhou, Yash Shetye, Long Quang, Devon Super et al.

Deploying learned multi-robot models on heterogeneous robots remains challenging due to hardware heterogeneity, communication constraints, and the lack of a unified execution stack. This paper presents NeuroMesh, a multi-domain, cross-platform, and modular decentralized neural inference framework that standardizes observation encoding, message passing, aggregation, and task decoding in a unified pipeline. NeuroMesh combines a dual-aggregation paradigm for reduction- and broadcast-based information fusion with a parallelized architecture that decouples cycle time from end-to-end latency. Our high-performance C++ implementation leverages Zenoh for inter-robot communication and supports hybrid GPU/CPU inference. We validate NeuroMesh on a heterogeneous team of aerial and ground robots across collaborative perception, decentralized control, and task assignment, demonstrating robust operation across diverse task structures and payload sizes. We plan to release NeuroMesh as an open-source framework to the community.

Impact 4.0Import 4.0Pop 3.5
#279
arXiv cs.RO 2026-04-16
by Skye Thompson, Ondrej Biza, George Konidaris

Given a demonstration, a robot should be able to generalize a skill to any object it encounters-but existing approaches to skill transfer often fail to adapt to objects with unfamiliar shapes. Motivated by examples of improved transfer from compositional modeling, we propose a method for improving transfer by decomposing objects into their constituent semantic parts. We leverage data-efficient generative shape models to accurately transfer interaction points from the parts of a demonstration object to a novel object. We autonomously construct an objective to optimize the alignment of those points on skill-relevant object parts. Our method generalizes to a wider range of object geometries than existing work, and achieves successful one-shot transfer for a range of skills and objects from a single demonstration, in both simulated and real environments.

Impact 4.0Import 4.0Pop 3.5
#280
arXiv cs.RO 2026-04-16
by Hilton Marques Souza Santana, João Carlos Virgolino Soares, Sven Goffin, Ylenia Nisticò et al.

Kalman filter-based algorithms are fundamental for mobile robots, as they provide a computationally efficient solution to the challenging problem of state estimation. However, they rely on two main assumptions that are difficult to satisfy in practice: (a) the system dynamics must be linear with Gaussian process noise, and (b) the measurement model must also be linear with Gaussian measurement noise. Previous works have extended assumption (a) to nonlinear spaces through the Invariant Extended Kalman Filter (IEKF), showing that it retains properties similar to those of the classical Kalman filter when the system dynamics are group-affine on a Lie group. More recently, the counterpart of assumption (b) for the same nonlinear setting was addressed in [1]. By means of the proposed Iterated Invariant Extended Kalman Filter (IterIEKF), the authors of that work demonstrated that the update step exhibits several compatibility properties of the classical linear Kalman filter. In this work, we introduce a novel open-source state estimation algorithm for legged robots based on the IterIEKF. The update step of the proposed filter relies solely on proprioceptive measurements, exploiting kinema

Impact 4.0Import 4.0Pop 3.5
#281
arXiv cs.RO 2026-04-16
by Yunfu Deng, Yuhao Li, Josiah P. Hanna

In recent years, reinforcement learning (RL) has shown remarkable success in robotics when a fast and accurate simulator is available for a given task. When using RL and simulation, more simulator realism is generally beneficial but becomes harder to obtain as robots are deployed in increasingly complex and widescale domains. In such settings, simulators will likely fail to model all relevant details of a given target task and this observation motivates the study of sim2real with simulators that leave out key task details. In this paper, we formalize and study the abstract sim2real problem: given an abstract simulator that models a target task at a coarse level of abstraction, how can we train a policy with RL in the abstract simulator and successfully transfer it to the real-world? Our first contribution is to formalize this problem using the language of state abstraction from the RL literature. This framing shows that an abstract simulator can be grounded to match the target task if the grounded abstract dynamics take the history of states into account. Based on the formalism, we then introduce a method that uses real-world task data to correct the dynamics of the abstract simula

Impact 4.0Import 4.0Pop 3.5
#282
arXiv cs.RO 2026-04-16
by Zhengdong Hong, Shenrui Wu, Haozhe Cui, Boyi Zhao et al.

3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/

Impact 4.0Import 4.0Pop 3.5
#283
arXiv cs.RO 2026-04-16
by Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff et al.

We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

Impact 4.0Import 4.0Pop 3.5
#284
arXiv cs.RO 2026-04-16
by Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani et al.

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

Impact 4.0Import 4.0Pop 3.5
#285
arXiv SSM 2026-04-16
by Jinlin You, Muyu Li, Xudong Zhao

Existing single-modal RGB trackers often face performance bottlenecks in complex dynamic scenes, while the introduction of event sensors offers new potential for enhancing tracking capabilities. However, most current RGB-event fusion methods, primarily designed in the spatial domain using convolutional, Transformer, or Mamba architectures, fail to fully exploit the unique temporal response and high-frequency characteristics of event data. To address this, we1 propose FreqTrack, a frequency-aware RGBE tracking framework that establishes complementary inter-modal correlations through frequency-domain transformations for more robust feature fusion. We design a Spectral Enhancement Transformer (SET) layer that incorporates multi-head dynamic Fourier filtering to adaptively enhance and select frequency-domain features. Additionally, we develop a Wavelet Edge Refinement (WER) module, which leverages learnable wavelet transforms to explicitly extract multi-scale edge structures from event data, effectively improving modeling capability in high-speed and low-light scenarios. Extensive experiments on the COESOT and FE108 datasets demonstrate that FreqTrack achieves highly competitive perfor

Impact 4.0Import 4.0Pop 3.5
#286
arXiv Recurrent/LinAttn 2026-04-16
by Junfeng Li, Wenyang Zhou, Xueheng Li, Xuanhua He et al.

In this work, we propose a Multigrain-aware Semantic Prototype Scanning paradigm for pan-sharpening, built upon a high-order RWKV architecture and a tri-token prompting mechanism derived from semantic clustering. Specifically, our method contains three key components: 1) Multigrain-aware Semantic Prototype Scanning. Although RWKV offers a efficient linear-complexity alternative to Transformers, its conventional bidirectional raster scanning is still semantic-agnostic and prone to positional bias. To address this issue, we introduce a semantic-driven scanning strategy that leverages locality-sensitive hashing to group semantically related regions and construct multi-grain semantic prototypes, enabling context-aware token reordering and more coherent global interaction. 2) Tri-token Prompt Learning. We design a tri-token prompting mechanism consisting of a global token, cluster-derived prototype tokens, and a learnable register token. The global and prototype tokens provide complementary semantic priors for RWKV modeling, while the register token helps suppress noisy and artifact-prone intermediate representations. 3) Invertible Q-Shift. To counteract spatial details, we apply center

Impact 4.0Import 4.0Pop 3.5
#287
arXiv Robotics-Embodied 2026-04-16
by Chuanlin Liu, Dan Liu, Jie Guan, Chao Lian

We investigate optically induced phase transitions in the two-dimensional (2D) ferroelectric (FE) material Nb2O2I4 using real-time time-dependent density functional theory (rt-TDDFT). Our results demonstrate that tailored laser pulses can activate specific coherent phonon modes. Specifically, the anharmonic atomic distortions of the A1-1 and A1-2 modes at the Γ-point facilitate the reversal of in-plane polarization. By fine-tuning laser parameters, additional phonon modes at both the Y and Γ points are excited. The resulting nonequilibrium atomic dynamics enable the formation of previously unreported ferroic phases, including three antiferroelectric (AFE) phases and one ferrielectric (FiE) phase. Notably, these optically induced phases can be reverted to the initial FE state using appropriate techniques. This controllable reversibility among multiple ferroic phases positions 2D Nb2O2I4 as a highly promising candidate for next-generation electronic storage applications.

Impact 4.0Import 4.0Pop 3.5
#288
arXiv Robotics-Embodied 2026-04-16
by Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng et al.

Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, laten

Impact 4.0Import 4.0Pop 3.5
#289
arXiv Robotics-Embodied 2026-04-16
by Tomoya Kamimura, Haruka Washiyama, Akihito Sano

Embodiment is a significant keyword in recent machine learning fields. This study focused on the passive nature of the body of a biped robot to generate walking and running locomotion using model-based deep reinforcement learning. We constructed two models in a simulator, one with passive elements (e.g., springs) and the other, which is similar to general humanoids, without passive elements. The training of the model with passive elements was highly affected by the attractor of the system. This lead that although the trajectories quickly converged to limit cycles, it took a long time to obtain large rewards. However, thanks to the attractor-driven learning, the acquired locomotion was robust and energy-efficient. The results revealed that robots with passive elements could efficiently acquire high-performance locomotion by utilizing stable limit cycles generated through dynamic interaction between the body and ground. This study demonstrates the importance of implementing passive properties in the body for future embodied AI.

Impact 4.0Import 4.0Pop 3.5
#290
arXiv Robotics-Embodied 2026-04-16
by Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao et al.

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and

Impact 4.0Import 4.0Pop 3.5
#291
arXiv RL 2026-04-16
by Yang Wu, Jinhong Yu, Jingwei Xiong, Zhimin Tao et al.

The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team's project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive

Impact 4.0Import 4.0Pop 3.5
#292
arXiv RL 2026-04-16
by Alexander Peysakhovich, William Berman

Consider an auto-regressive model that produces outputs x (e.g., answers to questions, molecules) each of which can be summarized by an attribute vector y (e.g., helpfulness vs. harmlessness, or bio-availability vs. lipophilicity). An arbitrary reward function r(y) encodes tradeoffs between these properties. Typically, tilting the model's sampling distribution to increase this reward is done at training time via reinforcement learning. However, if the reward function changes, re-alignment requires re-training. In this paper, we show that a reward weighted classifier-free guidance (RCFG) can act as a policy improvement operator in this setting, approximating tilting the sampling distribution by the Q function. We apply RCFG to molecular generation, demonstrating that it can optimize novel reward functions at test time. Finally, we show that using RCFG as a teacher and distilling into the base policy to serve as a warm start significantly speeds up convergence for standard RL.

Impact 4.0Import 4.0Pop 3.5
#293
arXiv RL 2026-04-16
by Yao Tong, Jiayuan Ye, Anastasia Borovykh, Reza Shokri

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

Impact 4.0Import 4.0Pop 3.5
#294
arXiv RL 2026-04-16
by Lukas Helff, Quentin Delfosse, David Steinmann, Ruben Härle et al.

As reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for scaling reasoning capabilities in LLMs, a new failure mode emerges: LLMs gaming verifiers. We study this phenomenon on inductive reasoning tasks, where models must induce and output logical rules. We find that RLVR-trained models systematically abandon rule induction. Instead of learning generalizable patterns (e.g., ``trains carrying red cars go east''), they enumerate instance-level labels, producing outputs that pass verifiers without capturing the relational patterns required by the task. We show that this behavior is not a failure of understanding but a form of reward hacking: imperfect verifiers that check only extensional correctness admit false positives. To detect such shortcuts, we introduce Isomorphic Perturbation Testing (IPT), which evaluates a single model output under both extensional and isomorphic verification, where the latter enforces invariance under logically isomorphic tasks. While genuine rule induction remains invariant, shortcut strategies fail. We find that shortcut behavior is specific to RLVR-trained reasoning models (e.g., GPT-5, Olmo3) and absent in non-RLVR m

Impact 4.0Import 4.0Pop 3.5
#295
arXiv RL 2026-04-16
by Bowen Ping, Zijun Chen, Tingfeng Hui, Qize Yu et al.

Reinforcement Learning (RL) has emerged as a critical driver for enhancing the reasoning capabilities of Large Language Models (LLMs). While recent advancements have focused on reward engineering or data synthesis, few studies exploit the model's intrinsic representation characteristics to guide the training process. In this paper, we first observe the presence of high-magnitude activations within the query and key vectors when processing long contexts. Drawing inspiration from model quantization -- which establishes the criticality of such high-magnitude activations -- and the insight that long-context reasoning inherently exhibits a sparse structure, we hypothesize that these weights serve as the pivotal drivers for effective model optimization. Based on this insight, we propose LongAct, a strategy that shifts from uniform to saliency-guided sparse updates. By selectively updating only the weights associated with these significant activations, LongAct achieves an approximate 8% improvement on LongBench v2 and enhances generalization on the RULER benchmark. Furthermore, our method exhibits remarkable universality, consistently boosting performance across diverse RL algorithms such

Impact 4.0Import 4.0Pop 3.5
#296
arXiv RL 2026-04-16
by Zhiyuan Zhai, Wenjing Yan, Xiaodan Shao, Xin Wang

Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer,

Impact 4.0Import 4.0Pop 3.5
#297
arXiv RL 2026-04-16
by Mathias Dus

We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto's calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.

Impact 4.0Import 4.0Pop 3.5
#298
arXiv RL 2026-04-16
by Seokwon Lee, Jaeyoung Sim, Sihyun Kim, Yuhsing Li et al.

Recent advances in query optimization have shifted from traditional rule-based and cost-based techniques towards machine learning-driven approaches. Among these, reinforcement learning (RL) has attracted significant attention due to its ability to optimize long-term performance by learning policies over query planning. However, existing RL-based query optimizers often exhibit unstable performance at the level of individual queries, including severe performance regressions, and require prolonged training to reach the plan quality of expert, cost-based optimizers. These shortcomings make learned query optimizers difficult to deploy in practice and remain a major barrier to their adoption in production database systems. To address these challenges, we present RELOAD, a robust and efficient learned query optimizer for database systems. RELOAD focuses on (i) robustness, by minimizing query-level performance regressions and ensuring consistent optimization behavior across executions, and (ii) efficiency, by accelerating convergence to expert-level plan quality. Through extensive experiments on standard benchmarks, including Join Order Benchmark, TPC-DS, and Star Schema Benchmark, RELOAD

Impact 4.0Import 4.0Pop 3.5
#299
arXiv RL 2026-04-16
by Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng

Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.

Impact 4.0Import 4.0Pop 3.5
#300
arXiv GenMedia 2026-04-16
by Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen et al.

Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectivel

Impact 4.0Import 4.0Pop 3.5
#301
arXiv GenMedia 2026-04-16
by Yuxin Liu, Yiqing Dong, Wenxue Yu, Zhan Wu et al.

Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \textbf{RelativeFlow}, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support differ

Impact 4.0Import 4.0Pop 3.5
#302
arXiv GenMedia 2026-04-16
by Zhanhao Liang, Tao Yang, Jie Wu, Chengjian Feng et al.

This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of g

Impact 4.0Import 4.0Pop 3.5
#303
arXiv GenMedia 2026-04-16
by Leyi Wu, Pengjun Fang, Kai Sun, Yazhou Xing et al.

Video generation has advanced rapidly, with recent methods producing increasingly convincing animated results. However, existing benchmarks-largely designed for realistic videos-struggle to evaluate animation-style generation with its stylized appearance, exaggerated motion, and character-centric consistency. Moreover, they also rely on fixed prompt sets and rigid pipelines, offering limited flexibility for open-domain content and custom evaluation needs. To address this gap, we introduce AnimationBench, the first systematic benchmark for evaluating animation image-to-video generation. AnimationBench operationalizes the Twelve Basic Principles of Animation and IP Preservation into measurable evaluation dimensions, together with Broader Quality Dimensions including semantic consistency, motion rationality, and camera motion consistency. The benchmark supports both a standardized close-set evaluation for reproducible comparison and a flexible open-set evaluation for diagnostic analysis, and leverages visual-language models for scalable assessment. Extensive experiments show that AnimationBench aligns well with human judgment and exposes animation-specific quality differences overlook

Impact 4.0Import 4.0Pop 3.5
#304
arXiv GenMedia 2026-04-16
by Onno Niemann, Gonzalo Martínez Muñoz, Alberto Suárez Gonzalez

Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as often the best results are obtained with weaker FP regularization. In this paper, we investigate whether simpler penalty terms can provide similar benefits. We empirically analyze several lightweight regularizers, study their effect on FP residuals and generation quality, and show that the benefits of FP regularization are available at substantially lower computational cost. Our code is available at https://github.com/OnnoNiemann/fp_diffusion_analysis.

Impact 4.0Import 4.0Pop 3.5
#305
arXiv GenMedia 2026-04-16
by Yuzhuo Chen, Zehua Ma, Han Fang, Hengyi Wang et al.

The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes ac

Impact 4.0Import 4.0Pop 3.5
#306
arXiv GenMedia 2026-04-16
by Chisatao Kumada, Satoru Hiwa, Tomoyuki Hiroyasu

Interactive Evolutionary Computation (IEC) provides a powerful framework for optimizing subjective criteria such as human preferences and aesthetics, yet it suffers from a fundamental limitation: in high-dimensional generative representations, defining crossover in a semantically consistent manner is difficult, often leading to a mutation-dominated search. In this work, we explicitly define crossover in diffusion models. We propose Diffusion crossover, which formulates evolutionary recombination as step-wise interpolation of noise sequences in the reverse process of Denoising Diffusion Probabilistic Models (DDPMs). By applying spherical linear interpolation (Slerp) to the noise sequences associated with selected parent images, the proposed method generates offspring that inherit characteristics from both parents while preserving the geometric structure of the diffusion process. Furthermore, controlling the time-step range of interpolation enables a principled trade-off between diversity (exploration) and convergence (exploitation). Experimental results using PCA analysis and perceptual similarity metrics (LPIPS) demonstrate that Diffusion crossover produces perceptually smooth and

Impact 4.0Import 4.0Pop 3.5
#307
arXiv PostTraining 2026-04-16
by Yang Li, Zirui Zhang, Yang Liu, Chengzhi Mao

Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another during inference. A central challenge is the absence of natural training data that exhibits such collaborative behavior. We address this gap with a synthetic data pipeline that explicitly teaches models to communicate and error-correct across threads. Experiments show that this unified exploration substantially outperforms standard parallel search, improving reasoning accuracy by over 7 points. Our results suggest that large language models can be more effective when parallel reasoning paths are allowed to interact.

Impact 4.0Import 4.0Pop 3.5
#308
arXiv PostTraining 2026-04-16
by Nassima M. Bouzid, Dehao Yuan, Nam H. Nguyen, Mayana Pereira

LLM-based simulators offer a promising path for generating complex synthetic data where traditional differentially private (DP) methods struggle with high-dimensional user profiles. But can LLMs faithfully reproduce statistical distributions from DP-protected inputs? We evaluate this using PersonaLedger, an agentic financial simulator, seeded with DP synthetic personas derived from real user statistics. We find that PersonaLedger achieves promising fraud detection utility (AUC 0.70 at epsilon=1) but exhibits significant distribution drift due to systematic LLM biases--learned priors overriding input statistics for temporal and demographic features. These failure modes must be addressed before LLM-based methods can handle the richer user representations where they might otherwise excel.

Impact 4.0Import 4.0Pop 3.5
#309
arXiv PostTraining 2026-04-16
by Andrew Dai, Boris Meinardus, Ciaran Regan, Yingtao Tian et al.

Frontier model developers aim to train models continually to possess emergent, diverse capabilities. To extend capabilities, the current pre-training and post-training paradigm requires manually starting training runs with static datasets or reward functions every time. Addressing this limitation, our work pursues the insight that open-endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run. We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open-ended \textit{Assessment Coevolving with Diverse Capabilities} (AC/DC). AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation. AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory. In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without \textit{any} explicit benchmark optimization. Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi-agent best-of-

Impact 4.0Import 4.0Pop 3.5
#310
arXiv PostTraining 2026-04-16
by Gabriele Mattioli, Evelyn Turri, Sara Sarto, Lorenzo Baraldi et al.

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descr

Impact 4.0Import 4.0Pop 3.5
#311
arXiv PostTraining 2026-04-16
by Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang et al.

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in se

Impact 4.0Import 4.0Pop 3.5
#312
arXiv PostTraining 2026-04-16
by Thanh Linh Nguyen, Nguyen Van Huynh, Quoc-Viet Pham

In data-sensitive domains such as healthcare, cross-silo federated learning (CFL) allows organizations to collaboratively train AI models without sharing raw data. However, practical CFL deployments are inherently coopetitive, in which organizations cooperate during model training while competing in downstream markets. In such settings, training contributions, including data volume, quality, and diversity, can improve the global model yet inadvertently strengthen rivals. This dilemma is amplified by non-IID data, which leads to asymmetric learning gains and undermines sustained participation. While existing competition-aware CFL and incentive-design approaches reward organizations based on marginal training contributions, they fail to account for the costs of strengthening competitors. In this paper, we introduce CoCoGen+, a coopetition-compatible data generation and incentivization framework that jointly models non-IID data and inter-organizational competition while endogenizing GenAI-based synthetic data generation as a strategic decision. Specifically, CoCoGen+ formulates each training round as a weighted potential game, where organizations strategically decide how much syntheti

Impact 4.0Import 4.0Pop 3.5
#313
arXiv PostTraining 2026-04-16
by Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang et al.

Generalized Category Discovery (GCD) challenges methods to identify known and novel classes using partially labeled data, mirroring human category learning. Unlike prior GCD methods, which operate within a single modality and require dataset-specific fine-tuning, we propose a modality-agnostic GCD approach inspired by the human brain's abstract category formation. Our $\textbf{OmniGCD}$ leverages modality-specific encoders (e.g., vision, audio, text, remote sensing) to process inputs, followed by dimension reduction to construct a $\textbf{GCD latent space}$, which is transformed at test-time into a representation better suited for clustering using a novel synthetically trained Transformer-based model. To evaluate OmniGCD, we introduce a $\textbf{zero-shot GCD setting}$ where no dataset-specific fine-tuning is allowed, enabling modality-agnostic category discovery. $\textbf{Trained once on synthetic data}$, OmniGCD performs zero-shot GCD across 16 datasets spanning four modalities, improving classification accuracy for known and novel classes over baselines (average percentage point improvement of $\textbf{+6.2}$, $\textbf{+17.9}$, $\textbf{+1.5}$ and $\textbf{+12.7}$ for vision, t

Impact 4.0Import 4.0Pop 3.5
#314
arXiv PostTraining 2026-04-16
by David Exler, Nils Friederich, Martin Krüger, John Jbeily et al.

Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three-dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry-driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN-based Domain Adaptation (DA). A compact 3D U-Net with self-supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero-shot segmentation models, demonstrating that biophysics-driven synthesis enables effective instance segmentation in annotation-scarce biomedical domains.

Impact 4.0Import 4.0Pop 3.5
#315
arXiv PostTraining 2026-04-16
by Saif Mahmoud

Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing body of work on speculative methods, the degree to which the cognitive characteristics of a task affect acceptance probability remains largely unexplored. We present an empirical study of tree-based speculative decoding acceptance dynamics. Our study spans four well-established NLP benchmark domains: code generation, mathematical reasoning, logical reasoning, and open-ended chat. For this, we use TinyLlama-1.1B as the draft model against Llama-2-7B-Chat-GPTQ as the target. Over 99,768 speculative nodes collected from 200 prompts, we derive per-domain acceptance rates, expected accepted lengths, depth-acceptance profiles, and entropy-acceptance correlations. We find that task type is a stronger predictor of acceptance than tree depth. Furthermore, only the chat domain consistently yields an expected accepted length exceeding 1.0 token per step. We also show that the entropy-acceptance correlation is consistently negative but weak across all domains

Impact 4.0Import 4.0Pop 3.5
#316
arXiv PostTraining 2026-04-16
by Fernando Barcelos Rosito, Sebastião De Jesus Menezes, Simone Ferreira Sturza, Adriana Seixas et al.

Purpose. Athlete monitoring is constrained by small cohorts, heterogeneous biomarker scales, limited feasibility of repeated sampling, and the lack of reliable injury ground truth. These limitations reduce the interpretability and utility of traditional univariate and binary risk models. This study addresses these challenges by proposing an unsupervised multivariate framework to identify latent physiological states in athletes using real data. Methods. We propose a modular computational framework that operates in the joint biomarker space, integrating preprocessing, clinical safety screening, unsupervised clustering, and centroid-based physiological interpretation. Profiles are learned exclusively from amateur soccer players during a competitive microcycle. Synthetic data augmentation evaluates robustness and scalability. Ward hierarchical clustering supports monitoring and etiological differentiation, while Gaussian Mixture Models (GMM) enable structural stability analysis in high-dimensional settings. Results. The framework identifies coherent profiles that distinguish mechanical damage from metabolic stress while preserving homeostatic states. Synthetic data augmentation demonst

Impact 4.0Import 4.0Pop 3.5
#322
MIT Technology Review 2026-04-16

The AI boom has hit across industries, and public sector organizations are facing pressure to accelerate adoption. At the same time, government institutions face distinct constraints around security, governance, and operations that set them apart from their business counterparts. For this reason, purpose-built small language models (SLMs) offer a promising path to operationalize AI in these environments.   A Capgemini study found that 79 percent of public sector executives globally are wary about AI’s data security, an understandable figure given the heightened sensitivity of government data and the legal obligations surrounding its use. As Han Xiao, vice president of AI at Elastic, says, “Government agencies must be very restricted about what kind of data they send to the network. This sets a lot of boundaries on how they think about and manage their data.” The fundamental need for control over sensitive information is one of many factors complicating AI deployment, particularly when compared against the private sector’s standard operational assumptions. Unique operational challenges When private-sector entities expand AI, they typically assume certain conditions will be in place, including continuous connectivity to the cloud, reliance on centralized infrastructure, acceptance of incomplete model transparency, and limited restrictions on data movement. For many state institutions, however, accepting these conditions could be anything from dangerous to impossible.  Government agencies must ensure that their data stays under their control, that information can be checked and verified, and that operational disruptions are kept to an absolute minimum. At the same time, they often have to run their systems in environments where internet connectivity is limit

Impact 4.0Import 4.0Pop 3.5
#323
MIT Technology Review 2026-04-16

There’s a fault line running through enterprise AI, and it’s not the one getting the most attention. The public conversation still tracks foundation models and benchmarks—GPT versus Gemini, reasoning scores, and marginal capability gains. But in practice, the more durable advantage is structural: who owns the operating layer where intelligence is applied, governed, and improved. One model treats AI as an on-demand utility; the other embeds it as an operating layer—the combination of operation software, data capture, feedback loops and governance that sits between models and real work—that compounds with use. Model providers like OpenAI and Anthropic sell intelligence as a service: you have a problem, you call an API, you get an answer. That intelligence is general-purpose, largely stateless, and only loosely connected to the day-to-day operations where decisions are made. It’s highly capable and increasingly interchangeable. The distinction that matters is whether intelligence resets on every prompt or accumulates over time. Incumbent organizations, by contrast, can treat AI as an operating layer: instrumentation across operations, feedback loops from human decisions, and governance that turns individual tasks into reusable policy. In that setup, every exception, correction, and approval becomes a chance to learn—and intelligence can improve as the platform absorbs more of the organization’s work. The organizations most likely to shape the enterprise AI era are those that can embed intelligence directly into operational platforms and instrument those platforms so work generates usable signals. The prevailing narrative says nimble startups will out-innovate incumbents by building AI-native from scratch. If AI is primarily a model problem, that story holds. But in many en

Impact 4.0Import 4.0Pop 3.5
#324
MIT Technology Review 2026-04-16

This is today’s edition of The Download , our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Cyberscammers are bypassing banks’ security with illicit tools sold on Telegram Inside a money-laundering center in Cambodia, an employee opens a banking app on his phone. It asks for a photo linked to the account, so he uploads a picture of a 30-something Asian man.  The app then requests a video “liveness” check. The scammer holds up a static image of a woman who doesn’t match the account. After 90 seconds, he’s in.  The exploit relies on illicit hacking services sold on Telegram that break “Know Your Customer” (KYC) facial scans.  MIT Technology Review  found 22 channels and groups advertising these services.  This is what we discovered .  —Fiona Kelliher   Is carbon removal in trouble?   —Casey Crownhart   Last week, news emerged that Microsoft was pausing carbon removal purchases. It was a bombshell—Microsoft effectively  is  the carbon removal market, single-handedly purchasing around 80% of all contracted carbon removal.  The report sparked fear across the industry, raising questions about the future of carbon removal and the role of Big Tech.  Read the full  story .  This  story  is from The Spark, our weekly newsletter exploring the technology that could combat the climate crisis.  Sign up  to receive it in your inbox every Wednesday.   The quest to measure our relationship with nature   —Emma Marris   Humans have done some destructive things to the ecosystems around us. But conservationists are learning that we can also be a force for good.  To under

Impact 4.0Import 4.0Pop 3.5
#325
AI Alignment Forum 2026-04-16

Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind. [1] There are various flavors of “safe” people suggest. Sometimes they suggest building “aligned” ASI: You have a full agentic autonomous god-like ASI running around, but it really really loves you and definitely will do the right thing. Sometimes they suggest we should simply build “tool AI” or “non-agentic” AI. Sometimes they have even more exotic, or more obviously-stupid ideas. Now I could argue at lengths about why this is astronomically harder than people think it is, why their various proposals are almost universally unworkable, why even attempting this is insanely immoral [2] , but that’s not the main point I want to make. Instead, I want to make a simpler point: Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”. [3] Punchline:  On   your way to figuring out how to build controllable ASI, you will have figured out how to build unsafe ASI , because unsafe ASI is  vastly  easier to build than controlled ASI, and is on the same tech path. You can’t build a controlled ASI without knowing many, MANY things about intelligence and how to build it. So this then bottlenecks the dual technical problems of “how to find an agenda that results in controllable ASI” and “how to execute on such an agenda” on  “even if you had such an agenda, how do you execute it without accidentally, or due to some asshole leaving the project or reading your papers, building unsafe ASI along the way?” No one I know pursuing various agendas of this type has answers to these questions. And lets be crystal clear: This is  the  fundamental

Impact 4.0Import 4.0Pop 3.5
#329
Defense One 2026-04-16

Federal court orders protect some collective-bargaining groups, but members of the American Federation of Government Employees remain vulnerable.

Impact 4.0Import 4.0Pop 3.5
#333
DefenseScoop 2026-04-16

The Army’s top civilian leader told lawmakers the service is leaning on its new counter-drone marketplace to bolster local security at upcoming high-profile events, ones that experts and officials have warned are at risk of unmanned aerial system threats. Earlier this year, the Pentagon debuted an initial launch of its “Counter-UAS Marketplace,” which officials have touted as an “Amazon-like” platform with a catalogue of anti-drone parts and systems for government personnel to buy.  The Pentagon’s counter-drone entity, Joint Interagency Task Force 401, said this week that $13 million-worth of tech — such as low-collateral systems, sensors, radars and electronic warfare platforms — has been purchased from the site since its launch. During a House Defense Appropriations subcommittee hearing Thursday, the Army’s top official said non-federal agencies can also buy items from the marketplace. “State and local and federal law enforcement officers across the country can purchase from this site. We’ve already had purchases,” Army Secretary Dan Driscoll said when asked about the service’s role in sharing counter-drone capabilities for security at events in the United States. He said Joint Base Myer-Henderson Hall has hosted 350 state and local police departments, though he didn’t specify when, and that the Army is “syncing them in” to JIATF-401’s counter-drone efforts. While briefly mentioned at Thursday’s budget hearing, Driscoll’s comments about the marketplace and its role in combatting what one lawmaker described as “high security concerns” at major events, highlights long-brewing alarm over stateside drone defense. Experts have told DefenseScoop that the threats drones present to domestic infrastructure and civilians is ubiquitous given their low cost, easy acces

Impact 4.0Import 4.0Pop 3.5
#334
arXiv cs.LG 2026-04-17
by Sean Hill, Felix X. -F. Ye

Stochastic dynamical systems with slow or metastable behavior evolve, on long time scales, on an unknown low-dimensional manifold in high-dimensional ambient space. Building a reduced simulator from short-burst ambient ensembles is a long-standing problem: local-chart methods like ATLAS suffer from exponential landmark scaling and per-step reprojection, while autoencoder alternatives leave tangent-bundle geometry poorly constrained, and the errors propagate into the learned drift and diffusion. We observe that the ambient covariance~$Λ$ already encodes coordinate-invariant tangent-space information, its range spanning the tangent bundle. Using this, we construct a tangent-bundle penalty and an inverse-consistency penalty for a three-stage pipeline (chart learning, latent drift, latent diffusion) that learns a single nonlinear chart and the latent SDE. The penalties induce a function-space metric, the $ρ$-metric, strictly weaker than the Sobolev $H^1$ norm yet achieving the same chart-quality generalization rate up to logarithmic factors. For the drift, we derive an encoder-pullback target via Itô's formula on the learned encoder and prove a bias decomposition showing the standard d

Impact 4.0Import 4.0Pop 3.5
#335
arXiv cs.LG 2026-04-17
by Haoran Zhang, Livia Betti, Konstantin Klemmer, Esther Rolf et al.

In computer vision and machine learning for geographic data, out-of-domain generalization is a pervasive challenge, arising from uneven global data coverage and distribution shifts across geographic regions. Though models are frequently trained in one region and deployed in another, there is no principled method for determining when this cross-region adaptation will be successful. A well-defined notion of distance between distributions can effectively quantify how different a new target domain is compared to the domains used for model training, which in turn could support model training and deployment decisions. In this paper, we propose a strategy for computing distances between geospatial domains that leverages geographic information with Optimal Transport methods (GeoSpOT). In our experiments, GeoSpOT distances emerge as effective predictors of cross-domain transfer difficulty. We further demonstrate that embeddings from pretrained location encoders provide information comparable to image/text embeddings, despite relying solely on longitude-latitude pairs as input. This allows users to get an approximation of out-of-domain performance for geospatial models, even when the exact d

Impact 4.0Import 4.0Pop 3.5
#336
arXiv cs.LG 2026-04-17
by Yide Ran, Jianwen Xie, Minghui Wang, Wenjin Zheng et al.

Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia (14M-6.9B) families, RISE reduces index storage by up to 112$\times$ compared to RapidIn and scales to 32B parameters LLM, where gradient-based baselines such as RapidIn and ZO-Inf become memory-infeasible. We evaluate RISE on two paradigms: (1) retrospective attribution, retrieving influential training exa

Impact 4.0Import 4.0Pop 3.5
#337
arXiv cs.LG 2026-04-17
by Karim K. Ben Hicham, Jan G. Rittig, Martin Grohe, Alexander Mitsos

Accurate molecular property prediction is central to drug discovery, catalysis, and process design, yet real-world applications are often limited by small datasets. Molecular foundation models provide a promising direction by learning transferable molecular representations; however, they typically involve task-specific fine-tuning, require machine learning expertise, and often fail to outperform classical baselines. Tabular foundation models (TFMs) offer a fundamentally different paradigm: they perform predictions through in-context learning, enabling inference without task-specific training. Here, we evaluate TFMs in the low- to medium-data regime across both standardized pharmaceutical benchmarks and chemical engineering datasets. We evaluate both frozen molecular foundation model representations, as well as classical descriptors and fingerprints. Across the benchmarks, the approach shows excellent predictive performance while reducing computational cost, compared to fine-tuning, with these advantages also transferring to practical engineering data settings. In particular, combining TFMs with CheMeleon embeddings yields up to 100\% win rates on 30 MoleculeACE tasks, while compact

Impact 4.0Import 4.0Pop 3.5
#338
arXiv cs.LG 2026-04-17
by Fernando Moro, Vinicius M. A. Souza

Multivariate time series classification (MTSC) plays a crucial role in various domains, including biomedical signal analysis and motion monitoring. However, existing approaches, particularly deep learning models, often require high computational resources, making them unsuitable for real-time applications or deployment on low-cost hardware, such as IoT devices and wearable systems. In this paper, we propose the Univariate Channel Fusion (UCF) method to deal with MTSC efficiently. UCF transforms multivariate time series into a univariate representation through simple channel fusion strategies such as the mean, median, or dynamic time warping barycenter. This transformation enables the use of any classifier originally designed for univariate time series, providing a flexible and computationally lightweight alternative to complex models. We evaluate UCF in five case studies covering diverse application domains, including chemical monitoring, brain-computer interfaces, and human activity analysis. The results demonstrate that UCF often outperforms baseline methods and state-of-the-art algorithms tailored for MTSC, while achieving substantial gains in computational efficiency, being par

Impact 4.0Import 4.0Pop 3.5
#339
arXiv cs.LG 2026-04-17
by Michał Dereziński, Yuji Nakatsukasa, Elizaveta Rebrova

The quest for an algorithm that solves an $n\times n$ linear system in $O(n^2)$ time complexity, or $O(n^2 \text{poly}(1/ε))$ when solving up to $ε$ relative error, is a long-standing open problem in numerical linear algebra and theoretical computer science. There are two predominant paradigms for measuring relative error: forward error (i.e., distance from the output to the optimum solution) and backward error (i.e., distance to the nearest problem solved by the output). In most prior studies, convergence of iterative linear system solvers is measured via various notions of forward error, and as a result, depends heavily on the conditioning of the input. Yet, the numerical analysis literature has long advocated for backward error as the more practically relevant notion of approximation. In this work, we show that -- surprisingly -- the classical and simple Richardson iteration incurs at most $1/k$ (relative) backward error after $k$ iterations on any positive semidefinite (PSD) linear system, irrespective of its condition number. This universal convergence rate implies an $O(n^2/ε)$ complexity algorithm for solving a PSD linear system to $ε$ backward error, and we establish simila

Impact 4.0Import 4.0Pop 3.5
#340
arXiv cs.LG 2026-04-17
by Nicole Funk, Annika Hennes, Johanna Hillebrand, Sarah Sturm

We study discrete k-clustering problems in general metric spaces that are constrained by a combination of two different fairness conditions within the demographic fairness model. Given a metric space (P,d), where every point in P is equipped with a protected attribute, and a number k, the goal is to partition P into k clusters with a designated center each, such that a center-based objective function is minimized and the attributes are fairly distributed with respect to the following two fairness concepts: 1) group fairness: We aim for clusters with balanced numbers of attributes by specifying lower and upper bounds for the desired attribute proportions. 2) diverse center selection: Clusters have natural representatives, i.e., their centers. We ask for a balanced set of representatives by specifying the desired number of centers to choose from each attribute. Dickerson, Esmaeili, Morgenstern and Zhang (2023) denote the combination of these two constraints as doubly constrained fair clustering. They present algorithms whose guarantees depend on the best known approximation factors for either of these problems. Currently, this implies an 8-approximation with a small additive violatio

Impact 4.0Import 4.0Pop 3.5
#341
arXiv cs.LG 2026-04-17
by Ulrich Tan

We introduce the Tan-HWG framework (Hebbian-Wasserstein-Geometry), a geometric theory of Hebbian plasticity in which memory states are modeled as probability measures evolving through Wasserstein minimizing movements. Hebbian learning rules are formalized as Hebbian energies satisfying a sequential stability condition, ensuring well-posed fiberwise JKO updates, optimal-transport realizations, and an energy descent inequality. This variational structure induces a fundamental separation between internal and observable dynamics. Internal memory states evolve along Wasserstein geodesics in a latent curved space, while observable quantities, such as effective synaptic weights, arise through geometric projection maps into external spaces. Simplicial projections recover classical affine schemes (including exponential moving averages and mirror descent), while revealing synaptic competition and pruning as geometric consequences of mass redistribution. Hilbertian projections provide a geometric account of phase alignment and multi-scale coherence. Classical neural networks appear as flat projections of this curved dynamics, while the framework naturally accommodates richer distributional re

Impact 4.0Import 4.0Pop 3.5
#342
arXiv cs.LG 2026-04-17
by Abdulaziz Aldegheishem, Nabil Alrajeh, Lorena Parra, Oscar Romero et al.

The ambulance service is the main transport for diseased or injured people which suffers the same acceleration forces as regular vehicles. These accelerations, caused by the movement of the vehicle, impact the performance of tasks executed by sanitary personnel, which can affect patient survival or recovery time. In this paper, we have trained, validated, and tested a system to assess driving in ambulance services. The proposed system is composed of a sensor node which measures the vehicle vibrations using an accelerometer. It also includes a GPS sensor, a battery, a display, and a speaker. When two possible routes reach the same destination point, the system compares the two routes based on previously classified data and calculates an index and a score. Thus, the index balances the possible routes in terms of time to reach the destination and the vibrations suffered in the patient cabin to recommend the route that minimises those vibrations. Three datasets are used to train, validate, and test the system. Based on an Artificial Neural network (ANN), the classification model is trained with tagged data classified as low, medium, and high vibrations, and 97% accuracy is achieved. Th

Impact 4.0Import 4.0Pop 3.5
#343
arXiv cs.LG 2026-04-17
by Jie Yuan, Lei Wang, Yanhao Wang, Yimin Liu

This paper introduces a robust discrimination method for distinguishing real ship targets from corner-reflector-array jamming with frequency-agile radar. The key idea is to exploit the multidimensional micro-motion signatures that separate rigid ships from non-rigid decoys. From Range-Velocity maps we derive two new hand-crafted descriptors-mean weighted residual (MWR) and complementary contrast factor (CCF) and fuse them with deep features learned by a lightweight CNN. An XGBoost classifier then gives the final decision. Extensive simulations show that the hybrid feature set consistently outperforms state-of-the-art alternatives, confirming the superiority of the proposed approach.

Impact 4.0Import 4.0Pop 3.5
#344
arXiv cs.LG 2026-04-17
by Marcin Hoffmann, Paweł Kryszkiewicz

M-MIMO is one of the crucial technologies for increasing spectral and energy efficiency of wireless networks. Most of the current works assume that M-MIMO arrays are equipped with a linear front end. However, ongoing efforts to make wireless networks more energy-efficient push the hardware to the limits, where its nonlinear behavior appears. This is especially a common problem for the multicarrier systems, e.g., OFDM used in 4G, 5G, and possibly also in 6G, which is characterized by a high Peak-to-Average Power Ratio. While the impact of a nonlinear Power Amplifier (PA) on an OFDM signal is well characterized, it is a relatively new topic for the M-MIMO OFDM systems. Most of the recent works either neglect nonlinear effects or utilize simplified models proper for Rayleigh or LoS radio channel models. In this paper, we first theoretically characterize the nonlinear distortion in the M-MIMO system under commonly used radio channel models. Then, utilizing 3D-Ray Tracing (3D-RT) software, we demonstrate that these models are not very accurate. Instead, we propose two models: a statistical one and an ML-based one using 3D-RT results. The proposed statistical model utilizes the Generaliz

Impact 4.0Import 4.0Pop 3.5
#345
arXiv cs.LG 2026-04-17
by Prabin Bohara, Pralhad Kumar Shrestha, Arpan Rai, Usha Poudel Lamgade et al.

Accurate automatic brain tumor segmentation in Low and Middle-Income (LMIC) countries is challenging due to the lack of defined national imaging protocols, diverse imaging data, extensive use of low-field Magnetic Resonance Imaging (MRI) scanners and limited health-care resources. As part of the Brain Tumor Segmentation (BraTS) Africa 2025 Challenge, we applied topology refinement to the state-of-the-art segmentation models like nnU-Net, MedNeXt, and a combination of both. Since the BraTS-Africa dataset has low MRI image quality, we incorporated the BraTS 2025 challenge data of pre-treatment adult glioma (Task 1) to pre-train the segmentation model and use it to fine-tune on the BraTS-Africa dataset. We added an extra topology refinement module to address the issue of deformation in prediction that arose due to topological error. With the introduction of this module, we achieved a better Normalized Surface Distance (NSD) of 0.810, 0.829, and 0.895 on Surrounding Non-Enhancing FLAIR Hyperintensity (SNFH) , Non-Enhancing Tumor Core (NETC) and Enhancing tumor (ET).

Impact 4.0Import 4.0Pop 3.5
#346
arXiv cs.LG 2026-04-17
by Yaohong Yang, Sammie Katt, Samuel Kaski

Multi-objective Bayesian optimization (MOBO) provides a principled framework for optimizing expensive black-box functions with multiple objectives. However, existing MOBO methods often struggle with coverage, scalability with respect to the number of objectives, and integrating constraints and preferences. In this work, we propose \textit{STAGE-BO, Sequential Targeting Adaptive Gap-Filling $\varepsilon$-Constraint Bayesian Optimization}, that explicitly targets under-explored regions of the Pareto front. By analyzing the coverage of the approximate Pareto front, our method identifies the largest geometric gaps. These gaps are then used as constraints, which transforms the problem into a sequence of inequality-constrained subproblems, efficiently solved via constrained expected improvement acquisition. Our approach provides a uniform Pareto coverage without hypervolume computation and naturally applies to constrained and preference-based settings. Experiments on synthetic and real-world benchmarks demonstrate superior coverage and competitive hypervolume performance against state-of-the-art baselines.

Impact 4.0Import 4.0Pop 3.5
#347
arXiv cs.LG 2026-04-17
by Tristan Kirscher, Alexandra Ertl, Klaus Maier-Hein, Xavier Coubez et al.

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

Impact 4.0Import 4.0Pop 3.5
#348
arXiv cs.LG 2026-04-17
by Jeremy Qin, Maksym Andriushchenko

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemin

Impact 4.0Import 4.0Pop 3.5
#349
arXiv cs.LG 2026-04-17
by Zhaobo Hu, Vincent Gauthier, Mehdi Naima

Distribution shift severely degrades the performance of deep forecasting models. While this issue is well-studied for individual time series, it remains a significant challenge in the spatio-temporal domain. Effective solutions like instance normalization and its variants can mitigate temporal shifts by standardizing statistics. However, distribution shift on a graph is far more complex, involving not only the drift of individual node series but also heterogeneity across the spatial network where different nodes exhibit distinct statistical properties. To tackle this problem, we propose Reversible Residual Normalization (RRN), a novel framework that performs spatially-aware invertible transformations to address distribution shift in both spatial and temporal dimensions. Our approach integrates graph convolutional operations within invertible residual blocks, enabling adaptive normalization that respects the underlying graph structure while maintaining reversibility. By combining Center Normalization with spectral-constrained graph neural networks, our method captures and normalizes complex Spatio-Temporal relationships in a data-driven manner. The bidirectional nature of our framew

Impact 4.0Import 4.0Pop 3.5
#350
arXiv cs.LG 2026-04-17
by Zhaobo Hu, Vincent Gauthier, Mehdi Naima

Spatiotemporal modeling has evolved beyond simple time series analysis to become fundamental in structural time series analysis. While current research extensively employs graph neural networks (GNNs) for spatial feature extraction with notable success, these networks are limited to capturing only pairwise relationships, despite real-world networks containing richer topological relationships. Additionally, GNN-based models face computational challenges that scale with graph complexity, limiting their applicability to large networks. To address these limitations, we present Modern Structure-Aware Simplicial SpatioTemporal neural network (ModernSASST), the first approach to leverage simplicial complex structures for spatiotemporal modeling. Our method employs spatiotemporal random walks on high-dimensional simplicial complexes and integrates parallelizable Temporal Convolutional Networks to capture high-order topological structures while maintaining computational efficiency. Our source code is publicly available on GitHub\footnote{Code is available at: https://github.com/ComplexNetTSP/ST_RUM.

Impact 4.0Import 4.0Pop 3.5
#351
arXiv cs.CL 2026-04-17
by Hitesh Mehta, Arjit Saxena, Garima Chhikara, Rohit Kumar

This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness Theory by Brown and Levinson and the Impoliteness Framework by Culpeper form the basis of experiments conducted across three languages (English, Hindi, Spanish), five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, and Llama 3), and three interaction histories between users (raw, polite, and impolite). Our sample consists of 22,500 pairs of prompts and responses of various types, evaluated across five levels of politeness using an eight-factor assessment framework: coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability. The findings show that model performance is highly influenced by tone, dialogue history, and language. While polite prompts enhance the average response quality by up to ~11% and impolite tones worsen it, these effects are neither consistent nor universal across languages and models. English is best served by courteous or direct tones, Hindi by deferential and indirect tones, and Spanish by assertive tones. Among the models, Llama is the most tone-sensit

Impact 4.0Import 4.0Pop 3.5
#352
arXiv cs.CL 2026-04-17
by Mahir Labib Dihan, Abir Muhtasim

The rapid proliferation of Large Language Models (LLMs) in software development has made distinguishing AI-generated code from human-written code a critical challenge with implications for academic integrity, code quality assurance, and software security. We present LLMSniffer, a detection framework that fine-tunes GraphCodeBERT using a two-stage supervised contrastive learning pipeline augmented with comment removal preprocessing and an MLP classifier. Evaluated on two benchmark datasets - GPTSniffer and Whodunit - LLMSniffer achieves substantial improvements over prior baselines: accuracy increases from 70% to 78% on GPTSniffer (F1: 68% to 78%) and from 91% to 94.65% on Whodunit (F1: 91% to 94.64%). t-SNE visualizations confirm that contrastive fine-tuning yields well-separated, compact embeddings. We release our model checkpoints, datasets, codes and a live interactive demo to facilitate further research.

Impact 4.0Import 4.0Pop 3.5
#353
arXiv cs.CL 2026-04-17
by Sophie Steger, Rui Li, Sofiane Ennadir, Anya Sims et al.

The widespread adoption of large language models (LLMs) has increased concerns about their robustness. Vulnerabilities in perturbations of tokenisation of the input indicate that models trained with a deterministic canonical tokenisation can be brittle to adversarial attacks. Recent studies suggest that stochastic tokenisation can deliver internal representations that are less sensitive to perturbations. In this paper, we analyse how stochastic tokenisations affect robustness to adversarial attacks and random perturbations. We systematically study this over a range of learning regimes (pre-training, supervised fine-tuning, and in-context learning), data sets, and model architectures. We show that pre-training and fine-tuning with uniformly sampled stochastic tokenisations improve robustness to random and adversarial perturbations. Evaluating on uniformly sampled non-canonical tokenisations reduces the accuracy of a canonically trained Llama-1b model by 29.8%. We find that training with stochastic tokenisation preserves accuracy without increasing inference cost.

Impact 4.0Import 4.0Pop 3.5
#354
arXiv cs.CL 2026-04-17
by Ke Xiong, Qian Wu, Wangjie Gan, Yuke Li et al.

Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning for few-shot HTC tasks (SCHK-HTC). Our work enhances the model's perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state-of-the-art methods in most cases. Ou

Impact 4.0Import 4.0Pop 3.5
#355
arXiv cs.CL 2026-04-17
by Andreea-Elena Bodea, Stephen Meisenbacher, Florian Matthes

Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.

Impact 4.0Import 4.0Pop 3.5
#356
arXiv cs.CL 2026-04-17
by Judith Sieker, Sina Zarrieß

Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs' performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.

Impact 4.0Import 4.0Pop 3.5
#357
arXiv cs.CL 2026-04-17
by Tanja Baeumel, Josef van Genabith, Simon Ostermann

Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.

Impact 4.0Import 4.0Pop 3.5
#358
arXiv cs.CL 2026-04-17
by Dianqing Lin, Tian Lan, Jiali Zhu, Jiang Li et al.

While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP tasks involving Chouxiang Language across six tasks. Experimental results show that, current state-of-the-art (SOTA) LLMs exhibit clear limitations on multiple tasks, while performing well on tasks that involve contextual semantic understanding. In addition, we further discuss the reasons behind the generally low performance of SOTA LLMs on Chouxiang Language, examine whether the LLM-as-a-judge approach adopted for translation tasks aligns with human judgments and values, and analyze the key factors that influence Chouxiang translation. Our study aims to promote further research in the NLP community on multicultural integration and the dynamics of evolving internet languages. Our code and data are publicly available.

Impact 4.0Import 4.0Pop 3.5
#359
arXiv cs.CL 2026-04-17
by Tobias Schimanski, Stefanie Lewandowski, Christian Woerle, Nicola Reichenau et al.

Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.

Impact 4.0Import 4.0Pop 3.5
#360
arXiv cs.CL 2026-04-17
by Hyunseok Park, Jihyeon Kim, Jongeun Kim, Dongsik Yoon

Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstructs documents by determining their association with specific topics or query types. CHOP integrates two key components: the CNM-Extractor, which generates compact per-chunk signatures capturing categories, key nouns, and model names, and the Continuity Decision Module, which preserves contextual coherence by deciding whether consecutive chunks belong to the same document flow. By prefixing each chunk with context-aware metadata, CHOP reduces semantic conflicts among similar documents and enhances retriever discrimination. Experiments on benchmark datasets show that CHOP alleviates retrieval confusion and provides a scalable approach for building high-quality knowledge bases, achieving a Top-1 Hit Rate of 90.77% and notable gains in ranking quality metrics.

Impact 4.0Import 4.0Pop 3.5
#361
arXiv cs.CL 2026-04-17
by Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang

As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model's information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various rep

Impact 4.0Import 4.0Pop 3.5
#362
arXiv cs.CL 2026-04-17
by Kai Wei, Raymond Li, Xi Zhu, Zhaoqian Xue et al.

Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after m

Impact 4.0Import 4.0Pop 3.5
#363
arXiv cs.CL 2026-04-17
by Jinlun Ye, Jiang Liao, Runhe Lai, Xinhua Lu et al.

Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstra

Impact 4.0Import 4.0Pop 3.5
#364
arXiv cs.CL 2026-04-17
by Ponhvoan Srey, Xiaobao Wu, Cong-Duy Nguyen, Anh Tuan Luu

Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token-wise, layer-wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per-token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deplo

Impact 4.0Import 4.0Pop 3.5
#365
arXiv cs.CL 2026-04-17
by Yichen Xu, Yuanhang Liu, Chuhan Wang, Zihan Zhao et al.

While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on norma

Impact 4.0Import 4.0Pop 3.5
#366
arXiv cs.AI 2026-04-17
by Reham Alharbi, Valentina Tamma, Terry R. Payne, Jacopo de Berardinis

Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of

Impact 4.0Import 4.0Pop 3.5
#367
arXiv cs.AI 2026-04-17
by Shiran Dudy, Jan Simson, Yanan Long

As a relatively new forum, ACM FAccT has become a key space for activists and scholars to critically examine emerging AI and ML technologies. It brings together academics, civil society members, and government representatives from diverse fields to explore the broader societal impacts of both deployed and proposed technologies. We report a large-scale participatory design (PD) process for reflexive conference governance, which combined an in-person CRAFT session, an asynchronous Polis poll and the synthesis of a governance-facing report for the FAccT leadership. Participants shaped the substantive agenda by authoring seed statements, adding new statements and making patterns of agreement, disagreement and uncertainty made visible through voting.Our endeavors represent one of the the first instances of applying PD to a venue that critically interrogates the societal impacts of AI, fostering a niche in which critical scholars are free to voice their concerns. Finally, this work advances large-scale PD theory by providing an effective case study of a co-design paradigm that can readily scale temporally and epistemologically.

Impact 4.0Import 4.0Pop 3.5
#368
arXiv cs.AI 2026-04-17
by Stefan Behfar, Richard Mortier

Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features. We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, hist

Impact 4.0Import 4.0Pop 3.5
#369
arXiv cs.AI 2026-04-17
by Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

Impact 4.0Import 4.0Pop 3.5
#370
arXiv cs.AI 2026-04-17
by Sihan Lv, Yechen Jin, Zhen Li, Jintao Chen et al.

Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a mel-space guidance signal, enforcing structural constraints only where necessary without disrupting the generative manifold. To fill the gap of publicly accessible benchmarks, we introduce LibriSpeech-Edit, a new and larger speech editing dataset. As existing metrics poorly evaluate temporal consisten

Impact 4.0Import 4.0Pop 3.5
#371
arXiv cs.AI 2026-04-17
by Colin Jüni, Mina Montazeri, Yi Guo, Federica Bellizio et al.

Buildings account for approximately 40% of global energy consumption, and with the growing share of intermittent renewable energy sources, enabling demand-side flexibility, particularly in heating, ventilation and air conditioning systems, is essential for grid stability and energy efficiency. This paper presents a safe deep reinforcement learning-based control framework to optimize building space heating while enabling demand-side flexibility provision for power system operators. A deep deterministic policy gradient algorithm is used as the core deep reinforcement learning method, enabling the controller to learn an optimal heating strategy through interaction with the building thermal model while maintaining occupant comfort, minimizing energy cost, and providing flexibility. To address safety concerns with reinforcement learning, particularly regarding compliance with flexibility requests, we propose a real-time adaptive safety-filter to ensure that the system operates within predefined constraints during demand-side flexibility provision. The proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators and improves energy

Impact 4.0Import 4.0Pop 3.5
#372
arXiv cs.AI 2026-04-17
by Delfina S. Martinez Pandiani, Ella Streefkerk, Laurens Naudts, Paula Helm

This paper traces a conceptual shift from understanding vulnerability as a static, essentialized property of data subjects to examining how it is actively enacted through data practices. Unlike reflexive ethical frameworks focused on missing or counter-data, we address the condition of abundance inherent to platformized life-a context where a near inexhaustible mass of data points already exists, shifting the ethical challenge to the researcher's choices in operating upon this existing mass. We argue that the ethical integrity of data science depends not just on who is studied, but on how technical pipelines transform "vulnerable" individuals into data subjects whose vulnerability can be further precarized. We develop this argument through an AI for Social Good (AI4SG) case: a journalist's request to use computer vision to quantify child presence in monetized YouTube 'family vlogs' for regulatory advocacy. This case reveals a "protection paradox": how data-driven efforts to protect vulnerable subjects can inadvertently impose new forms of computational exposure, reductionism, and extraction. Using this request as a point of departure, we perform a methodological deconstruction of t

Impact 4.0Import 4.0Pop 3.5
#373
arXiv cs.AI 2026-04-17
by Olivier Létoffé, Xuanxiang Huang, Joao Marques-Silva

For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.

Impact 4.0Import 4.0Pop 3.5
#374
arXiv cs.RO 2026-04-17
by Ruiyang Wang, Hao-Lun Hsu, Jiwoo Kim, Miroslav Pajic

Coordinating multi-robot systems (MRS) to search in unknown environments is particularly challenging for tasks that require semantic reasoning beyond geometric exploration. Classical coordination strategies rely on frontier coverage or information gain and cannot incorporate high-level task intent, such as searching for objects associated with specific room types. We propose \textit{Semantic Area Graph Reasoning} (SAGR), a hierarchical framework that enables Large Language Models (LLMs) to coordinate multi-robot exploration and semantic search through a structured semantic-topological abstraction of the environment. SAGR incrementally constructs a semantic area graph from a semantic occupancy map, encoding room instances, connectivity, frontier availability, and robot states into a compact task-relevant representation for LLM reasoning. The LLM performs high-level semantic room assignment based on spatial structure and task context, while deterministic frontier planning and local navigation handle geometric execution within assigned rooms. Experiments on the Habitat-Matterport3D dataset across 100 scenarios show that SAGR remains competitive with state-of-the-art exploration method

Impact 4.0Import 4.0Pop 3.5
#375
arXiv cs.RO 2026-04-17
by Thomas Campagnolo, Ezio Malis, Philippe Martinet, Gaétan Bahl

Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent

Impact 4.0Import 4.0Pop 3.5
#376
arXiv cs.RO 2026-04-17
by Xinglei Yu, Zhenyang Liu, Shufeng Nan, Simo Wu et al.

Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions,

Impact 4.0Import 4.0Pop 3.5
#377
arXiv cs.RO 2026-04-17
by Ayodele James Oyejide, Ustaz A. Yaqub, Samir Erturk, Eray A. Baran et al.

Vine-inspired robots achieve large workspace coverage through tip eversion, enabling safe navigation in confined and cluttered environments. However, their deployment in free space is fundamentally limited by low axial stiffness, poor load-bearing capacity, and the inability to retain shape during and after steering. In this work, we propose a reconfigurable pneumatic joint (RPJ) architecture that introduces discrete, pressure-tunable stiffness along the robot body without compromising continuous growth. Each RPJ module comprises symmetrically distributed pneumatic chambers that locally increase bending stiffness when pressurized, enabling decoupling between global compliance and localized rigidity. We integrate the RPJs into a soft growing robot with tendon-driven steering and develop a compact base station for mid-air eversion. System characterization and experimental validation demonstrate moderate pressure requirements for eversion, as well as comparable localized stiffening and steering performance to layer-jamming mechanisms. Demonstrations further show that the proposed robot achieves improved shape retention during bending, reduced gravitational deflection under load, casca

Impact 4.0Import 4.0Pop 3.5
#378
arXiv cs.RO 2026-04-17
by Vishal Ramesh, Antony Thomas

Multi-UAV inspection missions require spare drones to replace active drones during recharging cycles. Existing fleet-sizing approaches often assume steady-state operating conditions that do not apply to finite-horizon missions, or they treat replacement requests as statistically independent events. The latter provides per-request blocking guarantees that fail to translate to mission-level reliability when demands cluster. This paper identifies a structural failure mode where efficient routing assigns similar workloads to each UAV, leading to synchronized battery depletion and replacement bursts that exhaust the spare pool even when average capacity is sufficient. We derive a closed-form sufficient fleet-sizing rule, k = m(ceil(R) + 1), where m is the number of active UAVs and R is the recovery-to-active time ratio. This additive buffer of m spares absorbs worst-case synchronized demand at recovery-cycle boundaries and ensures mission-level reliability even when all UAVs deplete simultaneously. Monte Carlo validation across five scenarios (m in [2, 10], R in [0.87, 3.39], 1000 trials each) shows that Erlang-B sizing with a per-request blocking target epsilon = 0.01 drops to 69.9% mi

Impact 4.0Import 4.0Pop 3.5
#379
arXiv cs.RO 2026-04-17
by Vishal Ramesh, Aman Singh, Shishir Kolathaya

Series and parallel elastic actuators offer complementary but mutually exclusive advantages, yet no existing actuator enables real-time transition between these topologies during operation. This paper presents a novel actuator design called the Dual-Topology Elastic Actuator (DTEA), which enables dynamic switching between SEA and PEA topologies during operation. A proof-of-concept prototype of the DTEA is developed to demonstrate the feasibility of the topology-switching mechanism. Experiments are conducted to evaluate the robustness and timing of the switching mechanism under operational conditions. The actuator successfully performed 324 topology-switching cycles under load without damage, demonstrating the robustness of the mechanism. The measured switching time between SEA and PEA modes is under 33.33 ms. Additional experiments are conducted to characterize the static stiffness and disturbance rejection performance in both SEA and PEA modes. Static stiffness tests show that the PEA mode is 1.53x stiffer than the SEA mode, with KSEA = 5.57 +/- 0.02 Nm/rad and KPEA = 8.54 +/- 0.02 Nm/rad. Disturbance rejection experiments show that the mean peak deflection in SEA mode is 2.26x la

Impact 4.0Import 4.0Pop 3.5
#380
arXiv cs.RO 2026-04-17
by Zhi Zhang, Chalermchon Satirapod, Bingtao Ma, Changjun Gu

Solid-state LiDAR-inertial SLAM has attracted significant attention due to its advantages in speed and robustness. However, achieving accurate mapping in extreme environments remains challenging due to severe geometric degeneracy and unreliable observations, which often lead to ill-conditioned optimization and map inconsistencies. To address these challenges, we propose an environment-adaptive solid-state LiDAR-inertial odometry that integrates local normal-vector constraints with degeneracy-aware map maintenance to enhance localization accuracy. Specifically, we introduce local normal-vector constraints to improve the stability of state estimation, effectively suppressing localization drift in degenerate scenarios. Furthermore, we design a degeneration-guided map update strategy to improve map precision. Benefiting from the refined map representation, localization accuracy is further enhanced in subsequent estimation. Experimental results demonstrate that the proposed method achieves superior mapping accuracy and robustness in extreme and perceptually degraded environments, with an average RMSE reduction of up to 12.8% compared to the baseline method.

Impact 4.0Import 4.0Pop 3.5
#381
arXiv cs.RO 2026-04-17
by Jed R Muff, Karine Miras, A. E. Eiben

Lamarckian inheritance has been shown to be a powerful accelerator in systems where the joint evolution of robot morphologies and controllers is enhanced with individual learning. Its defining advantage lies in the offspring inheriting controllers learned by their parents. The efficacy of this option, however, relies on morphological similarity between parent and offspring. In this study, we examine how Lamarckian inheritance performs when the search process is driven toward high morphological variance, potentially straining the requirement for parent-offspring similarity. Using a system of modular robots that can evolve and learn to solve a locomotion task, we compare Darwinian and Lamarckian evolution to determine how they respond to shifting from pure task-based selection to a multi-objective pressure that also rewards morphological novelty. Our results confirm that Lamarckian evolution outperforms Darwinian evolution when optimizing task-performance alone. However, introducing selection pressure for morphological diversity causes a substantial performance drop, which is much greater in the Lamarckian system. Further analyses show that promoting diversity reduces parent-offsprin

Impact 4.0Import 4.0Pop 3.5
#382
arXiv cs.RO 2026-04-17
by Junjie Wen, Junlin He, Fei Ma, Jinqiang Cui

Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.

Impact 4.0Import 4.0Pop 3.5
#383
arXiv cs.RO 2026-04-17
by Aedan Mangan, Kehan Long, Ki Myung Brian Lee, Miheer Potdar et al.

Continuum robots are well suited for navigating confined and fragile environments, such as vascular or endoluminal anatomy, where contact with surrounding structures is often unavoidable. While controlled contact can assist motion, unfavorable contact can degrade controllability, induce kinematic singularities, or introduce safety risks. We present a contact-aware planning approach that evaluates contact quality, penalizing hazardous interactions, while permitting benign contact. The planner produces kinematically feasible trajectories and contact-aware Jacobians which can be used for closed-loop control in hardware experiments. We validate the approach by testing the integrated system (planning, control, and mechanical design) on anatomical models from patient scans. The planner generates effective plans for three common anatomical environments, and, in all hardware trials, the continuum robot was able to reach the target while avoiding dangerous tip contact (100% success). Mean tracking errors were 1.9 +/- 0.5 mm, 1.2 +/- 0.1 mm, and 1.7 +/- 0.2 mm across the three different environments. Ablation studies showed that penalizing end-of-continuum-segment (ECS) contact improved mani

Impact 4.0Import 4.0Pop 3.5
#384
arXiv cs.RO 2026-04-17
by Lorenzo Ticozzi, Patricio A. Vela, Panagiotis Tsiotras

Reconstructing the shape of continuum manipulators from sparse, noisy sensor data is a challenging task, owing to the infinite-dimensional nature of such systems. Existing approaches broadly trade off between parametric methods that yield compact state representations but lack probabilistic structure, and Cosserat rod inference on factor graphs, which provides principled uncertainty quantification at the cost of a state dimension that grows with the spatial discretization. This letter combines the strength of both paradigms by estimating the coefficients of a low-dimensional Geometric Variable Strain (GVS) parameterization within a factor graph framework. A novel kinematic factor, derived from the Magnus expansion of the strain field, encodes the closed-form rod geometry as a prior constraint linking the GVS strain coefficients to the backbone pose variables. The resulting formulation yields a compact state vector directly amenable to model-based control, while retaining the modularity, probabilistic treatment and computational efficiency of factor graph inference. The proposed method is evaluated in simulation on a 0.4 m long tendon-driven continuum robot under three measurement c

Impact 4.0Import 4.0Pop 3.5
#385
arXiv cs.RO 2026-04-17
by Dong-Uk Seo, Jinwoo Jeon, Eungchang Mason Lee, Hyun Myung

Gaussian splatting has recently gained traction as a compelling map representation for SLAM systems, enabling dense and photo-realistic scene modeling. However, its application to monocular SLAM remains challenging due to the lack of reliable geometric cues from monocular input. Without geometric supervision, mapping or tracking could fall in local-minima, resulting in structural degeneracies and inaccuracies. To address this challenge, we propose GaussianFlow SLAM, a monocular 3DGS-SLAM that leverages optical flow as a geometry-aware cue to guide the optimization of both the scene structure and camera poses. By encouraging the projected motion of Gaussians, termed GaussianFlow, to align with the optical flow, our method introduces consistent structural cues to regularize both map reconstruction and pose estimation. Furthermore, we introduce normalized error-based densification and pruning modules to refine inactive and unstable Gaussians, thereby contributing to improved map quality and pose accuracy. Experiments conducted on public datasets demonstrate that our method achieves superior rendering quality and tracking accuracy compared with state-of-the-art algorithms. The source c

Impact 4.0Import 4.0Pop 3.5
#386
arXiv Robotics-Embodied 2026-04-17
by Jianuo Cao, Yuxin Chen, Masayoshi Tomizuka

Training language-conditioned whole-body controllers for humanoid robots demands large-scale motion-language datasets. Existing approaches based on motion capture are costly and limited in diversity, while text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible. We present CLAW, a pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW composes motion primitives from a kinematic planner, parameterized by movement, heading, speed, pelvis height, and duration, and provides two browser-based interfaces--a real-time keyboard mode and a timeline-based sequence editor--for exploratory and batch data collection. A low-level controller tracks these references in MuJoCo simulation, yielding physically grounded trajectories. In parallel, a template-based engine generates diverse natural-language annotations at both segment and trajectory levels. To support scalable generation of motion-language paired data for humanoid robot learning, we make our system publicly available at: https://github.com/JianuoCao/CLAW

Impact 4.0Import 4.0Pop 3.5
#387
arXiv RL 2026-04-17
by Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin et al.

Reinforcement learning has become a powerful approach for enhancing large language model reasoning, but faces a fundamental dilemma: training on easy problems can cause overfitting and pass@k degradation, while training on hard problems often results in sparse rewards. Recent question augmentation methods address this by prepending partial solutions as hints. However, uniform hint provision may introduce redundant information while missing critical reasoning bottlenecks, and excessive hints can reduce reasoning diversity, causing pass@k degradation. We propose \textbf{PieceHint}, a hint injection framework that strategically identifies and provides critical reasoning steps during training. By scoring the importance of different reasoning steps, selectively allocating hints based on problem difficulty, and progressively withdrawing scaffolding, PieceHint enables models to transition from guided learning to independent reasoning. Experiments on six mathematical reasoning benchmarks show that our 1.5B model achieves comparable average performance to 32B baselines while preserving pass@k diversity across all $k$ values.

Impact 4.0Import 4.0Pop 3.5
#388
arXiv RL 2026-04-17
by Huan Lin, Lianghui Ding

Large-scale Unmanned Aerial Vehicle (UAV) failures can split an unmanned aerial vehicle swarm network into disconnected sub-networks, making decentralized recovery both urgent and difficult. Centralized recovery methods depend on global topology information and become communication-heavy after severe fragmentation. Decentralized heuristics and multi-agent reinforcement learning methods are easier to deploy, but their performance often degrades when the swarm scale and damage severity vary. We present Physics-informed Graph Adversarial Imitation Learning algorithm (PhyGAIL) that adopts centralized training with decentralized execution. PhyGAIL builds bounded local interaction graphs from heterogeneous observations, and uses physics-informed graph neural network to encode directional local interactions as gated message passing with explicit attraction and repulsion. This gives the policy a physically grounded coordination bias while keeping local observations scale-invariant. It also uses scenario-adaptive imitation learning to improve training under fragmented topologies and variable-length recovery episodes. Our analysis establishes bounded local graph amplification, bounded intera

Impact 4.0Import 4.0Pop 3.5
#389
arXiv RL 2026-04-17
by Peter Vamplew, Cameron Foale

This research note identifies a previously overlooked distinction between multi-objective reinforcement learning (MORL), and more conventional single-objective reinforcement learning (RL). It has previously been noted that the optimal policy for an MORL agent with a non-linear utility function is required to be conditioned on both the current environmental state and on some measure of the previously accrued reward. This is generally implemented by concatenating the observed state of the environment with the discounted sum of previous rewards to create an augmented state. While augmented states have been widely-used in the MORL literature, one implication of their use has not previously been reported -- namely that they require the agent to have continued access to the reward signal (or a proxy thereof) after deployment, even if no further learning is required. This note explains why this is the case, and considers the practical repercussions of this requirement.

Impact 4.0Import 4.0Pop 3.5
#390
arXiv RL 2026-04-17
by Prashant Rangarajan, Rajesh P. N. Rao

Active inference, a neurally-inspired model for inferring actions based on the free energy principle (FEP), has been proposed as a unifying framework for understanding perception, action, and learning in the brain. Active inference has previously been used to model ecologically important tasks such as navigation and planning, but scaling it to solve complex large-scale problems in real-world environments has remained a challenge. Inspired by the existence of multi-scale hierarchical representations in the brain, we propose a model for planning of actions based on hierarchical active inference. Our approach combines a hierarchical model of the environment with successor representations for efficient planning. We present results demonstrating (1) how lower-level successor representations can be used to learn higher-level abstract states, (2) how planning based on active inference at the lower-level can be used to bootstrap and learn higher-level abstract actions, and (3) how these learned higher-level abstract states and actions can facilitate efficient planning. We illustrate the performance of the approach on several planning and reinforcement learning (RL) problems including a var

Impact 4.0Import 4.0Pop 3.5
#391
arXiv RL 2026-04-17
by Tim Launer, Jonas Hübotter, Marco Bagatella, Ido Hakimi et al.

We investigate Functional Majority Voting (FMV), a method based on functional consensus for code generation with Large Language Models, which identifies a representative solution from multiple generations using their runtime execution signatures on test inputs. We find that FMV is an effective test-time inference strategy, substantially boosting performance on LiveCodeBench without a large compute overhead. Furthermore, we extend the utility of functional consensus and apply it as an aggregation strategy for label-free Test-Time Reinforcement Learning. We demonstrate that this increases pass@1 on holdout tasks, but find no evidence of self-improvement beyond the base model's performance ceiling.

Impact 4.0Import 4.0Pop 3.5
#392
arXiv RL 2026-04-17
by Taisuke Kobayashi

This paper proposes a novel method that incorporates empowerment when reasoning actions in reinforcement learning (RL), thereby achieving the flexibility of exploration-exploitation dilemma (EED). In previous methods, empowerment for promoting exploration has been provided as a bonus term to the task-specific reward function as an intrinsically-motivated RL. However, this approach introduces a delay until the policy that accounts for empowerment is learned, making it difficult to adjust the emphasis on exploration as needed. On the other hand, a trick devised for fine-tuning recent foundation models at reasoning, so-called best-of-N (BoN) sampling, allows for the implicit acquisition of modified policies without explicitly learning them. It is expected that applying this trick to exploration-promoting terms, such as empowerment, will enable more flexible adjustment of EED. Therefore, this paper investigates BoN sampling for empowerment. Furthermore, to adjust the degree of policy modification in a generalizable manner while maintaining computational cost, this paper proposes a novel BoN sampling method extended by Tsalis statistics. Through toy problems, the proposed method's cabil

Impact 4.0Import 4.0Pop 3.5
#393
arXiv RL 2026-04-17
by Yuncong Liu, Yuan Wan, Zhou Jiang, Yao Lu

Key Opinion Leader (KOL) discourse on social media is widely consumed as investment guidance, yet turning it into executable trading strategies without injecting assumptions about unspecified execution decisions remains an open problem. We observe that the gaps in KOL statements are not random deficiencies but a structured separation: KOLs express directional intent (what to buy or sell and why) while leaving execution decisions (when, how much, how long) systematically unspecified. Building on this observation, we propose an intent-preserving policy completion framework that treats KOL discourse as a partial trading policy and uses offline reinforcement learning to complete the missing execution decisions around the KOL-expressed intent. Experiments on multimodal KOL discourse from YouTube and X (2022-2025) show that KICL achieves the best return and Sharpe ratio on both platforms while maintaining zero unsupported entries and zero directional reversals, and ablations confirm that the full framework yields an 18.9% return improvement over the KOL-aligned baseline.

Impact 4.0Import 4.0Pop 3.5
#394
arXiv Agents 2026-04-17
by Matthew Frazier, Kostadin Damevski, Lori Pollock

Secondary school students enrolled in the AP Computer Science Principles (CSP) course commonly utilize web resources (e.g., tutorials, Q\&A sites) to better understand key concepts in the curriculum. The primary obstacle to using these resources is finding information appropriate for the learning task and student's background. In addition to web search, conversational agents are increasingly a viable alternative for CSP students. In this paper, we study the potential of conversational agents to aid secondary school students as they acquire knowledge on CSP concepts. We explore general purpose, generative conversational agents (e.g., ChatGPT) and custom, fixed-response conversational agents built specifically to aid CSP students. We present results from classroom use by 45 high school students in grades 9-11 (ages 14-17) across six CSP sections. Our main contributions are in better understanding how conversational agents can help CSP students and an evaluation of the effectiveness and engagement of different approaches for CSP exploratory search.

Impact 4.0Import 4.0Pop 3.5
#395
arXiv Agents 2026-04-17
by Shaoqing Liu, Mushuang Liu

Computational complexity has been a major challenge in game-theoretic model predictive control (GT-MPC), as real-time solutions to a game (e.g., Nash equilibria (NEs)) have to be computed at each sampling instant of an MPC. This challenge is especially critical in autonomous driving, where interactions may involve many agents, and decisions must be made at fast sampling rates. We show that this challenge can be addressed through time-distributed solution-seeking iterations designed based on, e.g., Newton and Newton--Kantorovich methods. Specifically, the autonomous vehicle decision-making problem is first formulated as a GT-MPC problem. To ensure solution attainability, a potential game framework is adopted. Within this framework, both potential-function optimization and best-response dynamics are used to seek the NE. To enable real-time implementation, Newton and Newton--Kantorovich methods are employed to solve the optimization problems arising in the NE-seeking algorithms, with their iterations distributed over time. Numerical experiments on an intersection-crossing scenario demonstrate that the proposed methods achieve effective real-time performance.

Impact 4.0Import 4.0Pop 3.5
#396
arXiv Agents 2026-04-17
by Maks Pečnik Bambič, Nuno A. M. Araújo, Giorgio Volpe

Collective rotations are common in active matter, enhancing cohesion, transport, and mixing. They are typically attributed to chiral non-reciprocal dynamics due to intrinsic particle chirality, torque-generating interactions among units, or geometric confinement. Here, we uncover a different mechanism for rotational order in active matter where a dynamic environment coordinates the self-organization of non-chiral active particles into living crystals exhibiting sustained collective solid-like rotations. At intermediate densities, feedback from a fluctuating landscape of passive Brownian particles stabilizes large living crystals of obstacle-avoiding run-and-tumble agents. Strikingly, this environmental feedback also produces living crystals with qualitatively distinct dynamics: collective solid-like spinning emerges for particles with long persistence times approaching ballistic motion, rather than for particles moving by conventional enhanced diffusion. Beyond revealing a new route to collective rotational order in active matter, these findings highlight the integral role of a dynamic environment in self-organization and suggest environment-mediated design principles for active ma

Impact 4.0Import 4.0Pop 3.5
#397
arXiv Agents 2026-04-17
by Florian Furbach, Lucas Clorius, Roland Kuhn, Hernán Melgratti et al.

Swarm protocols are a recently introduced formalism for specifying, implementing, and verifying peer-to-peer systems called swarms. A swarm consists of distributed agents called machines that communicate by asynchronous event propagation. Following a local-first model, each machine can progress without requiring continuous connectivity to other machines. Existing models of swarms are not compositional, making the modular development of large and complex swarm applications as well as the reuse of code difficult. We address these issues by presenting novel theory and techniques for the compositional specification, verification, and implementation of swarms. These results enable the correct compositional reuse of pre-existing swarm protocols and machine implementations. We implement these contributions in a companion software artifact which enables the automatic integration of independently designed and verified swarm components.

Impact 4.0Import 4.0Pop 3.5
#398
arXiv Agents 2026-04-17
by Koken Hata, Rintaro Chujo, Reina Takamatsu, Wenzhen Xu et al.

Conversational agents have the potential to support intergroup relations when psychological or linguistic barriers prevent direct interaction. Based on intergroup contact theory, we propose GroupEnvoy, a conversational agent that represents outgroup perspectives during ingroup discussions, grounded in transcripts from outgroup-only sessions. To evaluate this approach and derive design principles, we conducted a mixed-methods, between-subjects study with university students, where host-country students formed the ingroup and international students formed the outgroup. Ingroup students performed a collaborative task, receiving outgroup perspectives via GroupEnvoy (experimental) or reading written transcripts (control). Compared to the control group, the experimental group showed greater reduction in intergroup anxiety and greater improvement in perspective-taking. Qualitatively, AI-mediated contact enhanced outcome expectancies, whereas passive exposure fostered future contact intentions. The two conditions also elicited empathy toward distinct targets: outgroup evaluations of the ingroup versus outgroup lived experiences. These findings validate AI-mediated contact as a promising pa

Impact 4.0Import 4.0Pop 3.5
#399
arXiv Agents 2026-04-17
by Aswini Misro, Vikash Sharma, Shreyank N Gowda

We present Veritas-RPM, a provenance-guided multi-agent architecture comprising five processing layers: VeritasAgent (ground-truth assembly), SentinelLayer (anomaly detection), DirectorAgent (specialist routing), six domain Specialist Agents, and MetaSentinelAgent (conflict resolution and final decision). We construct a 98-case synthetic taxonomy of false-positive scenarios derived from documented RPM patterns. Synthetic patient epochs (n = 530) were generated directly from taxonomy parameters and processed through the pipeline. Ground-truth labels are known for all cases. Performance is reported as True Suppression Rate (TSR), False Escalation Rate (FER), and Indeterminate Rate (INDR).

Impact 4.0Import 4.0Pop 3.5
#400
arXiv Agents 2026-04-17
by Katharina Stich, Bastian Perner, Friedemann Laue, Torsten Reissland et al.

This paper proposes the LiFE-CD algorithm for convergence time analysis of the max-consensus algorithm in multi-agent systems under Bernoulli-distributed link failures. Unlike existing approaches, which either assume ideal communication or provide asymptotic upper bounds on the expected convergence time, LiFE-CD deterministically computes the full probability distribution of the convergence time from network topology and individual link failure probabilities, without simulation. The full probability distribution enables deadline-aware protocol design with specified reliability guarantees. Based on geometrically distributed link delays, the proposed algorithm iteratively reduces the given network topology considering both unicast and broadcast transmissions. LiFE-CD yields exact results for acyclic networks and, for cyclic networks, tight upper bounds on the convergence time via shortest-path spanning tree construction. Numerical results confirm analytical exactness for acyclic networks, validate tightness for cyclic networks, and demonstrate improvement over existing approaches. Our complexity analysis shows reduced computational cost compared to Monte Carlo simulations, while elim

Impact 4.0Import 4.0Pop 3.5
#401
arXiv Agents 2026-04-17
by Yaohui Han, Tianshuo Wang, Zixi Zhao, Zhengchun Zhu et al.

Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem-solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world-class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi-agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real-world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi-process tasks.

Impact 4.0Import 4.0Pop 3.5
#402
arXiv Agents 2026-04-17
by Haoran Wu, Zeyu Cao, Yao Lai, Binglei Lou et al.

Emerging agentic LLM workloads are driving rapidly growing demand on both memory capacity and bandwidth, with different phases of inference (e.g., prefill and decode) imposing distinct requirements. Industry is responding by composing heterogeneous accelerators into single interconnected systems, as exemplified by NVIDIA's Vera Rubin platform, where each device brings its own memory architecture. This heterogeneity is further compounded by a widening landscape of available memory technologies: high-density on-chip SRAM, HBM, LPDDR, GDDR, and emerging options such as high-bandwidth flash (HBF), each offering different capacity, bandwidth, and power trade-offs. Identifying the right memory architecture for next-generation inference accelerators requires navigating a vast and rapidly evolving design space, in which the interplay between workload characteristics, NPU design dimensions, and memory system design remains largely underexplored. To address this challenge, we present MemExplorer, a new memory system synthesizer for heterogeneous NPU systems. MemExplorer provides a unified abstraction for modeling diverse memory technologies across different hierarchy levels (e.g., on-chip an

Impact 4.0Import 4.0Pop 3.5
#403
arXiv Agents 2026-04-17
by Jiashuo Cao, Chen Li, Wujie Gao, Simon Hoermann et al.

Virtual agents have shown promising potential in mental health applications, but current research has predominantly focused on contexts outside of traditional therapy sessions. This paper examines the impact of a virtual supporter in remote psychotherapy sessions conducted via Zoom. We used a two-phase research approach. First we conducted a formative study to understand the roles and functions of human supporters in psychotherapy contexts. Based on these findings, we developed a virtual supporter operating in two modes: Daily Mode (for mood journaling outside therapy) and Therapy Mode (as an additional participant in Zoom therapy sessions). Finally we ran a user study with 14 participants who engaged with the virtual supporter for a week and then joined a remote psychotherapy session together. Our findings revealed that the virtual supporter had positive effects on creating psychological safety, reducing anxiety, and enhancing emotional articulation without disrupting the therapeutic process. We then discussed both the benefits and potential disadvantages of virtual supporters in therapeutic contexts, including concerns about over-reliance and the need for appropriate boundaries.

Impact 4.0Import 4.0Pop 3.5
#404
arXiv Agents 2026-04-17
by Xujin Chen, Xiaodong Hu, Changjun Wang, Yuchun Xiong et al.

This paper studies an online trading variant of the classical secretary problem, called secretary problem variant trading (SPVT), from the perspective of an intermediary who facilitates trade between a seller and $n$ buyers (collectively referred to as agents). The seller has an item, and each buyer demands the item. These agents arrive sequentially in a uniformly random order to meet the intermediary, each revealing their valuation of the item upon arrival. After each arrival, the intermediary must make an immediate and irrevocable decision before the next agent appears. The intermediary's objective is to maximize the price of the agent who ultimately holds the item at the end of the process. We evaluate the performance of online algorithms for SPVT using two notions of competitive ratio: strong and weak. The strong notion benchmarks the online algorithm against a powerful offline optimum: the highest price among the $n+1$ agents. We propose an online algorithm for SPVT achieving a strong competitive ratio of $\frac{4e^2}{e^2+1} \approx 3.523$, which is the best possible even when the seller's price may be zero. This tight ratio closes the gap between the previous best upper bound

Impact 4.0Import 4.0Pop 3.5
#405
arXiv Agents 2026-04-17
by Bo Zhao, Kairui Guo, Runnan Du, Haiyang Sun et al.

Instruction guided image editing has advanced substantially with recent generative models, yet it still fails to produce reliable results across many seemingly simple cases. We observe that a large portion of these failures stem not from insufficient model capacity, but from poorly formulated editing tasks, such as those involving small targets, implicit spatial relations, or under-specified instructions. In this work, we frame image editing failures as a task formulation problem and propose an adaptive task reformulation framework that improves editing performance without modifying the underlying model. Our key idea is to transform the original image-instruction pair into a sequence of operations that are dynamically determined and executed by a MLLM agent through analysis, routing, reformulation, and feedback-driven refinement. Experiments on multiple benchmarks, including ImgEdit, PICA, and RePlan, across diverse editing backbones such as Qwen Image Edit and Nano Banana, show consistent improvements, with especially large gains on challenging cases. These results suggest that task reformulation is a critical but underexplored factor, and that substantial gains can be achieved by

Impact 4.0Import 4.0Pop 3.5
#406
arXiv Agents 2026-04-17
by Xiquan Li, Aurian Quelennec, Slim Essid

Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answering music-related questions by following user instructions. However, their massive scale, often billions of parameters, results in expensive training, slow inference, and limited deployability on edge devices. In this work, we present TinyMU, a lightweight (229M) Music-Language Model (MLM) that achieves performance comparable to much larger LALMs while remaining efficient and compact. To train TinyMU, we introduce MusicSkills-3.5M, a carefully curated, music-grounded question-answering dataset with 3.5M samples. Spanning multiple-choice, binary, and open-ended formats, this dataset provides fine-grained supervision across diverse musical concepts. For its architecture, TinyMU leverages MATPAC++, the SOTA self-supervised audio encoder for fine-grained feature extraction. Paired with a lightweight linear projector, it efficiently aligns audio embeddings with the language model. Through extensive evaluation,

Impact 4.0Import 4.0Pop 3.5
#407
arXiv Agents 2026-04-17
by Ze Dong, Hao Shi, Zejia Gao, Zhonghua Yi et al.

Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training o

Impact 4.0Import 4.0Pop 3.5
#408
arXiv Agents 2026-04-17
by Tianqi Luo, Leixian Shen, Yuyu Luo

Agentic visual analytics (VA) represents an emerging class of systems in which large language model (LLM)-driven agents autonomously plan, execute, evaluate, and iterate across the full visual analytics pipeline. By shifting users from low-level tool operations to high-level analytical goals expressed through natural language, these systems are fundamentally transforming how humans interact with data. However, the rapid proliferation of such systems in recent years has outpaced our understanding of their design landscape. Two intertwined problems remain open: how do autonomous agents reshape the traditional VA pipeline, and how must human involvement adapt as agent autonomy increases? To address these questions, this paper presents a comprehensive survey of 55 primary agentic VA systems and introduces a co-evolutionary framework. This framework is essential because it jointly analyzes the progression of agent autonomy alongside the necessary shift in human roles from manual operators to strategic supervisors. Within this framework, we define a role-workflow taxonomy that aligns four key agentic roles (PLANNER, CREATOR, REVIEWER, and CONTEXT MANAGER) and maps them onto established V

Impact 4.0Import 4.0Pop 3.5
#409
arXiv MechInterp 2026-04-17
by Dorival Gonçalves, Ajay Kaladharan, Alberto Navarro

Angular correlations in Higgs decays to electroweak gauge bosons, $h \to ZZ^*, WW^*$, provide a powerful probe of both new physics effects and quantum information observables. We present a systematic study of semi-leptonic decays $h \to V V^* \to \ell^+\ell^- q\bar{q}$ and $\ell^\pm ν_\ell q\bar{q}'$, including finite final state fermion masses, NLO QCD, and NLO electroweak corrections. We show that finite final state quark masses can induce effects that go beyond the two-qutrit description in more inclusive regimes, while remaining controllable with suitable kinematic selections. QCD corrections lead to modest percent-level shifts, whereas electroweak corrections can significantly modify the angular structure, particularly in the $h\to ZZ^*$ channels. We assess the impact of these effects on the reconstructed density matrix and entanglement measures, finding that, while they modify the angular observables, semi-leptonic channels retain an effective two-qutrit description.

Impact 4.0Import 4.0Pop 3.5
#410
arXiv MechInterp 2026-04-17
by Wieland Schöllkopf, Sandy Gewinner, Marco De Pas, Heinz Junkes et al.

We report on the design and performance of a two-color dual-oscillator infrared free-electron laser (FEL). The mid-infrared (MIR) FEL at the Fritz Haber Institute (FHI FEL) has been upgraded to include a second oscillator FEL beamline that permits lasing in the far-infrared (FIR) regime from 4.5 μm to 175 μm. In addition, a 500 MHz kicker cavity has been installed downstream of the electron accelerator. It allows to deflect electron bunches of up to 50 MeV energy alternately left and right by an angle of {\pm}2°. It can, thus, split the high-repetition-rate (1 GHz) electron bunch train from the accelerator into two bunch trains of 500 MHz repetition rate each; one is steered to the MIR FEL and the other one to the new FIR FEL. In this two-color mode of simultaneous, synchronized operation the wavelengths in both FELs can be tuned independently over wide ranges of up to a factor of four each by undulator-gap variation. In addition, two-color operation is also available at reduced repetition rates (e.g. 55.6 MHz of both MIR and FIR pulses), as needed for some applications. This unique two-color mode opens up a wealth of novel user applications such as, MIR-FIR pump-probe experiments.

Impact 4.0Import 4.0Pop 3.5
#411
arXiv MechInterp 2026-04-17
by Toru Harada, Yoshiharu Hirabayashi

We study the production of $^3_Λ$H and $^4_Λ$H in the $^{3,4}$He($K^-$,$π^0$) reactions at $p_{K^-}=1.0$~GeV/$c$ within the distorted-wave impulse approximation, using the optimal Fermi-averaged $K^-p\toπ^0Λ$ amplitude. Because the $^3_Λ$H ground state is extremely weakly bound, the $d$--$Λ$ wave function becomes spatially extended. We calculate the integrated cross sections $σ_{\rm lab}$ and their ratio $R_{34}=σ_{\rm lab}(^3_Λ{\rm H})/σ_{\rm lab}(^4_Λ{\rm H})$ for forward angles $θ_{\rm lab}=0^\circ$--$20^\circ$. The production strength of $^3_Λ$H and the ratio $R_{34}$ are strongly sensitive to the $Λ$ binding energy $B_Λ$, which is constrained to be approximately 0.05--0.15~MeV by comparison with experimental data from the J-PARC E73 experiment. This indicates that the $^3$He($K^-$,$π^0$) reaction provides a sensitive probe of the weak binding of $^3_Λ$H.

Impact 4.0Import 4.0Pop 3.5
#412
arXiv MechInterp 2026-04-17
by Euclid Collaboration, A. A. Tumborang, K. I. Caputi, P. Rinaldi et al.

Little Red Dots (LRDs) are some the most intriguing galaxy populations recently identified at z>~4 with JWST. They constitute the most extreme class of a more abundant population of sources with `V-shaped' spectral energy distributions (SEDs) and compact morphologies, which includes also Little Blue Dots (LBDs). Finding brighter analogues to these sources requires surveying sky areas which are significantly larger than those covered with JWST. Euclid deep images are ideally suited for this purpose. We make use of Euclid near-infrared images, complemented by Spitzer Infrared Array Camera (IRAC) data, over 0.75 sq. deg. of the COSMOS field to select a sample of 233 sources with `V-shaped' SEDs at z>4. Out of those, we identify 16 sources with compactness >1sigma above the median of all z>4 galaxies, which we consider robust LRD/LBD candidates in our sample. The stellar masses of these 16 sources are in the range 10^{8.5} - 10^{10.5} Msun, so they are significantly more massive than typical JWST-selected LRDs/LBDs. Interestingly, half of them are about as old as the Universe at their redshifts. In addition, we find that the median photometric properties of the Euclid LRDs/LBDs are sim

Impact 4.0Import 4.0Pop 3.5
#413
arXiv MechInterp 2026-04-17
by Julia Gehrlein, Jaime Hoefken Zink, Pedro A. N. Machado, João Paulo Pinheiro

At long-baseline neutrino experiments, neutral-current (NC) events accumulate in large numbers but are seldom exploited for new physics searches. We demonstrate their potential using non-standard neutrino interactions (NSI) with quarks as a case study. Charged-current (CC) analyses constrain NSI through matter effects on neutrino propagation, which probe almost exclusively the isoscalar combination of up- and down-quark couplings; the orthogonal isovector combination is suppressed by a factor of $\sim$100. Because NSI also modify NC cross sections in a flavor-dependent way, NC events become sensitive to oscillations: the far-to-near detector ratio acquires a dependence on the beam's flavor composition that probes both isoscalar and isovector couplings with comparable weight. Using existing NOvA data and DUNE projections, we derive the first bounded constraints on isovector NSI from a long-baseline experiment and show that combining CC and NC measurements resolves the individual quark couplings, breaking a degeneracy that persists in either analysis alone.

Impact 4.0Import 4.0Pop 3.5
#414
arXiv MechInterp 2026-04-17
by Peter F. Wyper, Jonathan Squire, Etienne Pariat, Oleksiy V. Agapitov et al.

Magnetic switchbacks are large amplitude deflections of the magnetic field within the solar wind. They are Alfvénic in character and so are associated with a spike in velocity and a generally small variation in local plasma density. Early orbits of Parker Solar Probe revealed that the solar wind near the Sun is dominated by these structures, and therefore, they may be playing an important role in the energy budget and acceleration of the young solar wind. In this review, we present an overview of different mechanisms that have been proposed for how switchbacks could be formed. We group the mechanisms by whether they predominantly act in the low solar atmosphere or within the solar wind (in situ). We focus on mechanisms that can create reversals of the ambient magnetic field direction and, thus, account for the most extreme perturbations. The general consensus is that mechanisms in the lower solar atmosphere do not form such reversals on their own but provide the seed perturbations, flows, or particle beams necessary for in situ mechanisms to create switchbacks within the solar wind. Switchback observations thus likely contain an imprint of the coronal source of the seed perturbatio

Impact 4.0Import 4.0Pop 3.5
#415
arXiv MechInterp 2026-04-17
by Long Xiong, Xiaoyang Wang, Xiaoxia Cai, Xiao Yuan

Nonlinear spectroscopy is a cornerstone of quantum science, providing unique access to multi-point correlations, quantum coherence, and couplings that are invisible to linear methods. However, classical simulation of these phenomena is fundamentally limited by the exponential growth of the Hilbert space, and practical quantum algorithms for the nonlinear regime have remained largely unexplored. Here, we present a unified quantum algorithmic framework for computing $n$-th order nonlinear spectroscopies. By reformulating multi-time responses as a weighted sum of expectation values at finite pump amplitudes via a generalized parameter shift rule, our approach bypasses the costly evaluation of high-order commutators and time-dependent operator expansions. This reformulation enables efficient execution via real-time evolution on current quantum hardware, ensuring inherent noise resilience. We validate the framework on IBM's superconducting quantum processors, successfully obtain higher-order response functions of a 12-qubit XXZ spin-chain. Furthermore, the versatility of our method is demonstrated by resolving quasi-particle excitation spectra in spin-liquids and identifying interaction

Impact 4.0Import 4.0Pop 3.5
#416
arXiv MechInterp 2026-04-17
by J Krishnamoorthi, Anil Kumar, Sanjib Kumar Agarwalla

Atmospheric neutrinos provide a unique avenue to probe theories beyond the Standard Model (BSM) over a wide range of energies and path lengths. The theory of non-standard interactions (NSI) of neutrinos is one of the important BSM scenarios, which can modify flavor oscillations of atmospheric neutrinos traveling through the Earth. In this work, we use a high-purity $ν_μ$ charged-current (CC) sample of atmospheric neutrinos from IceCube DeepCore with a livetime of 7.5 years to search for the NSI parameters $\varepsilon_{eμ}$, $\varepsilon_{eτ}$, and $\varepsilon_{ee}-\varepsilon_{μμ}$. The $ν_μ$ CC events mainly come from the $ν_μ$ survival channel having no significant dependence on $δ_{\rm CP}$. Therefore, the constraints on $\varepsilon_{eμ}$ and $\varepsilon_{eτ}$ obtained using this $ν_μ$ CC sample are expected to be free from the $δ_{\rm CP}$-degeneracy. The data sample is found to be in agreement with the standard neutrino interactions. Therefore, we place bounds on these NSI parameters that are consistent with and comparable to existing experimental constraints. These $δ_{\rm CP}$-free constraints from IceCube DeepCore are complementary to those from the long-baseline neutri

Impact 4.0Import 4.0Pop 3.5
#417
arXiv MechInterp 2026-04-17
by Joao R. L. Santos, Guillem Domènech, Amilcar R. Queiroz

Fast Radio Bursts (FRBs) are among the most intriguing phenomena observed in radio astronomy. So far, about 130 FRB signals have been confirmed and characterized by different surveys, and the CHIME telescope has recently reported a new catalog of 4539 bursts. Therefore, these numbers are expected to increase in the coming years. The detection, or lack thereof, of lensed FRB events can be used to probe Primordial Black Holes (PBHs) as a fraction of dark matter. We investigate the potential of three upcoming radio telescopes, LOFAR2.0, FAST Core Array, and BINGO, to test the PBH scenario. We forecast that LOFAR2.0 will constrain $f_{\mathrm{PBH}} 1\,{M_{\odot}}$, while FAST Core Array and BINGO will restrict $f_{\mathrm{PBH}} 10\,{M_{\odot}}$ and $M_{\rm PBH}>10^{-2}\,{M_{\odot}}$, respectively. Despite the existence of stricter constraints, FRB lensing offers an independent and complementary probe of PBHs in the Universe, which will improve in the future.

Impact 4.0Import 4.0Pop 3.5
#418
arXiv MechInterp 2026-04-17
by C. A. S. Almeida

We propose an effective non-relativistic framework in which wave-function collapse emerges as a deterministic dynamical instability induced by gravitational self-interaction and regulated by short-distance repulsion. The dynamics is described by a nonlinear Schrödinger equation supplemented by a phenomenological repulsive sector ensuring regularity at high densities. Using a variational Gaussian ansatz, we derive an explicit effective energy functional and show that extended quantum states lose stability beyond a critical mass scale. This loss of stability is associated with a bifurcation in the reduced dynamical system governing the wave-function width, leading to the emergence of stable localized configurations. Within this picture, collapse corresponds to the dynamical selection of one of these localized attractors, driven by infinitesimal asymmetries in the initial state and occurring without stochastic noise or environmental coupling. The mechanism provides a controlled and quantitative realization of gravity-induced localization, extending Schrödinger--Newton-type models while avoiding their pathological short-distance behavior. Possible implications for mesoscopic systems pr

Impact 4.0Import 4.0Pop 3.5
#419
arXiv MechInterp 2026-04-17
by Kishore C. Patra, Emily R. Liepold, Nicholas Earl, Ryan J. Foley et al.

Off-nuclear tidal disruption events (TDEs) provide a rare probe of massive black holes (MBHs) outside galactic nuclei. Only a handful are known, including five X-ray-selected candidates and two optically selected events. We present observations of TDE 2025abcr, the second optically selected off-nuclear TDE, discovered at a projected offset of $9.08 \pm 0.02$ kpc from the nucleus of its host galaxy. We analyze X-ray, UV, optical, and infrared (IR) data from Swift, Keck, ZTF, and JWST. Broad H and He emission lines in the optical and IR confirm a TDE-H-He classification. From luminosity scaling relations and MOSFiT modeling, we infer a BH mass of $10^{6}$-$10^{7}\,M_{\odot}$, substantially smaller than the $10^{8.35 \pm 0.41}\,M_{\odot}$ BH inferred for the host-galaxy nucleus. We observe velocity evolution in the N III + He II emission complex, shifting from $-500$ km s$^{-1}$ at day $-7$ to $+1000$ km s$^{-1}$ by day $+29$, which we interpret as radiative transfer effects in an evolving reprocessing layer. The IR SED deviates from a thermal blackbody, with $νL_ν \propto λ^{-2.13 \pm 0.04}$, significantly shallower than the Rayleigh-Jeans slope of $λ^{-3}$. We rule out dust as the s

Impact 4.0Import 4.0Pop 3.5
#420
arXiv MechInterp 2026-04-17
by Caroline Tornow, Julia Rupprecht, Pascal Engeler, Ute Drechsler et al.

Topological insulators are typically characterized by particularly stable properties, such as global invariants, and can be identified by probing their robust surface states. A recently discovered novel form of band topology, delicate topology, challenges this paradigm: its defining property, multicellularity, can be removed by introducing a coupling to local orbitals anywhere in the spectrum, even far above the relevant band gap. This makes it hard to diagnose delicate topology with conventional probes that access only low-energy degrees of freedom. Here, we introduce strong local impurities as a spectroscopic probe of a delicate topological insulator which we realize in a phononic metamaterial. By tuning the impurity strength and performing orbital-resolved readout, we observe recently proposed indicators of topology: ring states, in-gap bound states whose frequencies remain pinned in the strong-impurity limit while their real-space profiles form a pronounced ring around the impurity site. We find that these ring states persist even when the multicellularity in our system is removed by a weakly hybridizing additional orbital. Our results establish impurity-induced ring states as

Impact 4.0Import 4.0Pop 3.5
#421
arXiv MechInterp 2026-04-17
by Debtroy Das, Swarnim Shashank, Cosimo Bambi

Gravitational wave (GW) observations of binary black hole (BBH) mergers provide a unique opportunity to probe the nature of spacetime in the strong-field and dynamical regime. We present updated constraints on deviations from the Kerr metric using BBH inspirals from the fourth Gravitational-Wave Transient Catalog (GWTC-4). Building on our previous GWTC-3 analysis, we employ a theory-agnostic framework in which non-Kerr effects of the Johannsen metric are incorporated as parametrized corrections to the GW phase within the post-Newtonian framework. We perform Bayesian parameter estimation on a selected subset of GWTC-4 events to constrain the deformation parameters $α_{13}$ and $ε_3$, yielding significantly tighter bounds compared to earlier results. When varied individually, the deformation parameters are found to be consistent with zero, providing no evidence for departures from the Kerr geometry. Our results reinforce the validity of General Relativity, particularly the Kerr hypothesis, and highlight the progress enabled by GWTC-4.

Impact 4.0Import 4.0Pop 3.5
#422
arXiv MechInterp 2026-04-17
by Z. Tavukoğlu, A. T. Olgun, K. Azizi

We investigate rare special dileptonic decays of $ Λ_b$, $Σ_b$ and $Ξ_b $ baryons in the Standard Model and context of the general Two-Higgs-Doublet Model with Type III. Specifically, we consider the decays $ Λ_b \rightarrow Λ\ell^+ \ell^-$, $Σ_b \rightarrow Σ\ell^+ \ell^-$ and $Ξ_b \rightarrow Ξ\ell^+ \ell^-$, where $\ell$ represents $μ$ or $τ$ lepton. By studying these rare decays, we aim to assess the impact of the Two-Higgs-Doublet Model with Type III on various observables, such as the differential branching ratio, total branching ratio, and lepton forward-backward asymmetries using the decay amplitude and the transition matrix elements in terms of form factors calculated via light cone QCD in full theory. We compare our results to those of the Standard Model, as well as existing lattice QCD predictions and experimental data, to assess the agreement and viability of the Two-Higgs-Doublet Model with Type III. Furthermore, we highlight the potential for experimental investigations of these decay channels in the near future. The soon-to-be updated LHCb and/or Belle II detectors, renowned for their capabilities in studying rare decays, present excellent opportunities for probing t

Impact 4.0Import 4.0Pop 3.5
#423
arXiv GenMedia 2026-04-17
by Jun Li, Lizhi Xiong, Ziqiang Li, Weiwei Jiang et al.

Text-to-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets. Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose TICoE, a text-image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation. Our code is available at https://github.com/OpenAscent-L/TICoE.git

Impact 4.0Import 4.0Pop 3.5
#424
arXiv GenMedia 2026-04-17
by Jingyuan Li, Xiaoyi Jiang, Fukang Wen, Wei Liu et al.

Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix as a single object -- via concrete scores, clean-data predictions ($x_0$-parameterization), or denoising distributions -- rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. Since a CTMC is fundamentally a Poisson process fully determined by these two quantities, decomposing along this structure is closer to first principles and naturally leads to our formulation. We propose \textbf{Neural CTMC}, which separately parameterizes the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) using two dedicated network heads. We show that the evidence lower bound (ELBO) differs from a path-space KL divergence between the true and learned reverse processes by a $θ$-independent constant, so that the training objective is fully governed by the exit rate and jump distribution we parameterize. Moreover, this KL factorizes into a Poisson KL for timing and a categorical

Impact 4.0Import 4.0Pop 3.5
#425
arXiv GenMedia 2026-04-17
by Sumit Chaturvedi, Yannick Hold-Geoffroy, Mengwei Ren, Jingyuan Liu et al.

This paper presents a method for image relighting that enables precise and continuous control over multiple illumination attributes in a photograph. We formulate relighting as a conditional image generation task and introduce attribute tokens to encode distinct lighting factors such as intensity, color, ambient illumination, diffuse level, and 3D light positions. The model is trained on a large-scale synthetic dataset with ground-truth lighting annotations, supplemented by a small set of real captures to enhance realism and generalization. We validate our approach across a variety of relighting tasks, including controlling in-scene lighting fixtures and editing environment illumination using virtual light sources, on synthetic and real images. Our method achieves state-of-the-art quantitative and qualitative performance compared to prior work. Remarkably, without explicit inverse rendering supervision, the model exhibits an inherent understanding of how light interacts with scene geometry, occlusion, and materials, yielding convincing lighting effects even in traditionally challenging scenarios such as placing lights within objects or relighting transparent materials plausibly. Pro

Impact 4.0Import 4.0Pop 3.5
#426
arXiv GenMedia 2026-04-17
by Huanran Hu, Zihui Ren, Dingyi Yang, Liangyu Chen et al.

Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and lack support for evaluating this full process. To address this gap, we propose Multimodal Context-to-Script Creation (MCSC), a new task that transforms noisy multimodal inputs and user instructions into structured, executable video scripts. We further introduce MCSC-Bench, the first large-scale MCSC dataset, comprising 11K+ well-annotated videos. Each sample includes: (1) redundant multimodal materials and user instructions; (2) a coherent, production-ready script containing material-based shots, newly planned shots (with shooting instructions), and shot-aligned voiceovers. MCSC-Bench supports comprehensive evaluation across material selection, narrative planning, and conditioned script generation, and includes both in-domain and out-of-domain test sets. Experiments show that current multimodal LLMs struggle with structure-aware reasoning under long contexts, highlighting the challenges posed by our benchmark

Impact 4.0Import 4.0Pop 3.5
#427
arXiv PostTraining 2026-04-17
by Yuanhang Luo, Shuxing Fang, Ruijian Han, Yiming Xu

Classical latent-score ranking models often fail to distinguish objects' intrinsic scores from contextual effects, which are typically nonlinear and can dominate the observed outcomes. To address this, we introduce a semiparametric ranking framework in which the log-score of each object is modeled as the sum of a utility parameter and a nonparametric covariate effect. Within this framework, we establish model identifiability under mild regularity and connectivity conditions. For estimation, we approximate the covariate effect using a neural network and estimate the parameters via maximum likelihood. Under random design assumptions, we prove that the resulting estimator exists with high probability and derive non-asymptotic error bounds that achieve minimax optimality for both the parametric and nonparametric components. Numerical experiments on both synthetic data and an ATP tennis dataset are conducted to support our findings.

Impact 4.0Import 4.0Pop 3.5
#428
arXiv PostTraining 2026-04-17
by Taewoong Kang, Hyojin Jang, Sohyun Jeong, Seunggi Moon et al.

Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identit

Impact 4.0Import 4.0Pop 3.5
#429
arXiv PostTraining 2026-04-17
by Jun-Chao Liang, Yin-Bi Li, A-Li Luo, Shuo Li et al.

To improve the accuracy and efficiency of high-dimensional stellar parameter inference in large spectroscopic datasets, we propose a projection-assisted parameter-inference framework -- Projected-Space Inference of Stellar Parameters (PISP). PISP constructs an orthonormal basis and optimizes in the projected space, reducing the impact of parameter correlations on inference. The basis is constructed using either principal component analysis (PCA) or the active-subspace (AS) method and is combined with two inference strategies -- Non-L1, which optimizes the projection coefficients for a user-specified projected dimensionality, and L1, which introduces L1 regularization in the full projected space to adaptively select projection directions -- yielding four strategies: PCA-Non-L1, AS-Non-L1, PCA-L1, and AS-L1. For different computational scenarios, we implement two versions: PISP-CurveFit for fast single-spectrum inference and PISP-Adam for large-scale GPU-parallel inference. Using a fully connected neural network and a residual network as spectral emulators, we evaluate PISP on Kurucz synthetic spectra and on $722{,}896$ APOGEE DR$17$ observed spectra. Compared to the baseline strateg

Impact 4.0Import 4.0Pop 3.5
#430
arXiv PostTraining 2026-04-17
by Xiaoyu Yang, En Yu, Wei Duan, Jie Lu

Reinforcement Fine-Tuning (RFT) has established itself as a critical paradigm for the alignment of Multi-modal Large Language Models (MLLMs) with complex human values and domain-specific requirements. Nevertheless, current research primarily focuses on mitigating exogenous distribution shifts arising from data-centric factors, the non-stationarity inherent in the endogenous reasoning remains largely unexplored. In this work, a critical vulnerability is revealed within MLLMs: they are highly susceptible to endogenous reasoning drift, across both thinking and perception perspectives. It manifests as unpredictable distribution changes that emerge spontaneously during the autoregressive generation process, independent of external environmental perturbations. To adapt it, we first theoretically define endogenous reasoning drift within the RFT of MLLMs as the multi-modal concept drift. In this context, this paper proposes Counterfactual Preference Optimization ++ (CPO++), a comprehensive and autonomous framework adapted to the multi-modal concept drift. It integrates counterfactual reasoning with domain knowledge to execute controlled perturbations across thinking and perception, employi

Impact 4.0Import 4.0Pop 3.5
#431
arXiv PostTraining 2026-04-17
by Jixuan Leng, Si Si, Hsiang-Fu Yu, Vinod Raman et al.

Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.

Impact 4.0Import 4.0Pop 3.5
#432
arXiv Efficiency 2026-04-17
by Djordje Bogdanović, Marija Dimitrijević Ćirić, Richard J. Szabo

We construct cubic scalar field theory on $λ$-Minkowski space by combining the Batalin-Vilkovisky formalism with harmonic analysis, and produce two inequivalent noncommutative quantum field theories. The braided theory is based on a braided $L_\infty$-algebra whereby covariance dictates a spectral decomposition into cylindrical Bessel functions that diagonalise the angular Drinfel'd twist; in this theory we find the usual logarithmic ultraviolet divergences and confirm the absence of UV/IR mixing. The standard noncommutative theory is based on a classical $L_\infty$-algebra; in this theory we relate the spectral decompositions into plane wave and cylindrical harmonic eigenmodes of the Klein-Gordan operator, we verify the planar equivalence theorem, and we demonstrate a periodic form of UV/IR mixing in which non-planar correlators are generically ultraviolet finite but become non-analytic on an infinite lattice of exceptional momenta.

Impact 4.0Import 4.0Pop 3.5
#433
arXiv Efficiency 2026-04-17
by Yeganeh Abdollahinejad, Ahmad Mousavi, Naeemul Hassan, Kai Shu et al.

The widespread dissemination of multimodal content on social media has made misinformation detection increasingly challenging, as misleading narratives often arise not only from textual or visual content alone, but also from semantic inconsistencies between modalities and their evolution over time. Existing multimodal misinformation detection methods typically model cross-modal interactions statically and often show limited robustness across heterogeneous datasets, domains, and narrative settings. To address these challenges, we propose MOMENTA, a unified framework for multimodal misinformation detection that captures modality heterogeneity, cross-modal inconsistency, temporal dynamics, and cross-domain generalization within a single architecture. MOMENTA employs modality-specific mixture-of-experts modules to model diverse misinformation patterns, bidirectional co-attention to align textual and visual representations in a shared semantic space, and a discrepancy-aware branch to explicitly capture semantic disagreement between modalities. To model narrative evolution, we introduce an attention-based temporal aggregation mechanism with drift and momentum encoding over overlapping ti

Impact 4.0Import 4.0Pop 3.5
#434
arXiv Efficiency 2026-04-17
by Eric W. Fischer, Michael Roemelt

We extend our recent work on the cavity-modified spin Zeeman effect of an effective spin-1/2-system[J. Chem. Phys. 163, 174307 (2025)] to a relativistic Jahn-Teller scenario under strong light-matter coupling. Here, the effective spin-1/2-system is realized via a single electron or a single hole in a doubly-degenerate molecular orbital system of trigonal symmetric transition metal complexes. Both single-particle and single-hole systems are subject to both vibronic and spin-orbit coupling (SOC) augmented by interactions with a quantized cavity field via the cavity Zeeman interaction. Methodologically, we combine the relativistic $E\times e$-Jahn-Teller model with a recently introduced effective Hamiltonian formalism based on quasi-degenerate perturbation theory, which treats the cavity-spin interaction in leading order beyond the dipole approximation. We derive analytic expressions for Kramers pair energies in weak and strong SOC regimes as well as related cavity-modified effective electronic g-factors. We find cavity-induced modifications of the electronic g-factor to become relevant in the weak SOC regime for both single-particle and single-hole systems while being effectively que

Impact 4.0Import 4.0Pop 3.5
#435
arXiv Efficiency 2026-04-17
by Markus Bachmayr, Sebastian Krämer, Max Pfeffer

We consider an iterative eigensolver for Schrödinger equations that constructs low-rank approximations of eigenfunctions with accuracy-adapted ranks, with particular focus on fermionic Schrödinger equations in second-quantized form and on matrix product state approximations enforcing particle number conservation. We provide a complete analysis of a solver based on preconditioned inverse iteration combined with rank truncation and propose a generalization to subspace iteration for the joint approximation of several eigenspaces. The practical performance of the method is illustrated by numerical tests for several model problems.

Impact 4.0Import 4.0Pop 3.5
#436
arXiv Efficiency 2026-04-17
by Yimin Huang, Tiancheng Qi, Quanshui Wu, Ruipeng Zhu

In this paper, we study the Chevalley property of Cayley-Hamilton Hopf algebras in the sense of De Concini-Procesi-Reshetikhin-Rosso using discriminant ideals. For any affine Cayley-Hamilton Hopf algebra $(H,C,\text{tr})$ whose identity fiber algebra has the Chevalley property, we prove that an irreducible $H$-module $V$ has the property that $V\otimes W$ is a completely reducible $H$-module for every irreducible $H$-module $W$ if and only if $V$ is annihilated by the lowest discriminant ideal of $(H,C,\text{tr})$, which establishes a bridge between the tensor-nondegenerate behaviour of the irreducible representations of $H$ and the lowest discriminant ideal of $(H,C,\text{tr})$. Using discriminant ideals, we prove that an affine Cayley-Hamilton Hopf algebra $(H,C,\text{tr})$ has the Chevalley property if and only if its identity fiber algebra $H/\mathfrak{m}_{\overline{\varepsilon}}H$ has the Chevalley property and all the discriminant ideals of $(H,C,\text{tr})$ are trivial, thereby resolving a question posed by Huang-Mi-Qi-Wu. Moreover, it is shown that the lowest discriminant subvariety $\mathcal{V}_{\ell}$ of the algebraic group $\operatorname{maxSpec}C$ is a closed subgroup,

Impact 4.0Import 4.0Pop 3.5
#437
arXiv Efficiency 2026-04-17
by Jiaxin Ye, Gaoxiang Cong, Chenhui Wang, Xin-Cheng Wen et al.

Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders direct alignment between visual and speech features at specific hierarchical levels during property matching. In this paper, leveraging the hierarchical structure of Residual Vector Quantization (RVQ)-based codec, we propose HiCoDiT, a novel Hierarchical Codec Diffusion Transformer that exploits the inherent hierarchy of discrete speech tokens to achieve strong audio-visual alignment. Specifically, since lower-level tokens encode coarse speaker-aware semantics and higher-level tokens capture fine-grained prosody, HiCoDiT employs low-level and high-level blocks to generate tokens at different levels. The low-level blocks condition on lip-synchronized motion and facial identity to capture speaker-aware content, while the high-level blocks use facial expression to modulate prosodic dynamics. Finally, to enable more effective coarse-to-fine conditioning, we propose a dual-scale adaptive instance layer normaliza

Impact 4.0Import 4.0Pop 3.5
#438
arXiv Efficiency 2026-04-17
by Toru Kojo, Sakura Itatani

We investigate the properties of neutral and charged mesons in magnetic fields, from weak-field to strong-field regimes. To develop analytic insights, we employ a non-relativistic quark model with a confining potential of the harmonic oscillator type. Short-range correlations, such as Coulomb and color-magnetic interactions, are treated as perturbations. In particular, we focus on the magnetic field dependence of the relative and the center-of-mass motions. The qualitative trends differ significantly between neutral and charged mesons: for neutral mesons, the transverse momenta are conserved and continuous, while charged mesons exhibit quantized transverse dynamics. The Zeeman effects, arising from intrinsic spins and orbital angular momenta, are carefully examined. In particular, for charged mesons with spins $s\ge 1$, we discuss how the zero-point energy in the internal quark motion cancels the Zeeman energy from the orbital angular momentum, ensuring the energetic stability of mesons with high spins. The effectively reduced dimensionality of these mesons in the strong-field limit is also discussed.

Impact 4.0Import 4.0Pop 3.5
#439
arXiv Efficiency 2026-04-17
by Oluwaleke Yusuf, M. Tsaqif Wismadi, Adil Rasheed

Urban bike-sharing systems require strategic station expansion to meet growing demand. Traditional allocation approaches rely on explicit demand modelling that may not capture the urban characteristics distinguishing successful stations. This study addresses the need to exploit patterns from existing stations to inform expansion decisions, particularly in data-constrained environments. We present a data-driven framework leveraging existing stations deemed desirable by operational metrics. A hybrid denoising autoencoder (HDAE) learns compressed latent representations from multi-source grid-level features (socio-demographic, built environment, and transport network), with a supervised classification head regularising the embedding space structure. Expansion candidates are selected via greedy allocation with spatial constraints based on latent-space similarity to existing stations. Evaluation on Trondheim's bike-sharing network demonstrates that HDAE embeddings yield more spatially coherent clusters and allocation patterns than raw features. Sensitivity analyses across similarity methods and distance metrics confirm robustness. A consensus-based procedure across multiple parametrisati

Impact 4.0Import 4.0Pop 3.5
#440
arXiv Efficiency 2026-04-17
by Xiang Xia, Wuyang Zhang, Jiazheng Liu, Cheng Yan et al.

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks. However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts, limiting the quality-speed trade-off. In this paper, we argue that block-wise DLM inference requires more suitable signals for its two core decisions: cross-step signals for determining block boundaries, and token-level conflict signals for parallel decoding. Based on this view, we propose DepCap, a training-free framework for efficient block-wise DLM inference. Specifically, DepCap instantiates the cross-step signal as the influence of the last decoded block and uses it to adaptively determine how far the next block should extend, while identifying a conflict-f

Impact 4.0Import 4.0Pop 3.5
#441
arXiv Efficiency 2026-04-17
by Yao Chen, Jiawei Sheng, Wenyuan Zhang, Tingwen Liu

The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers' dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. This establishes structured guidance for the student's progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method t

Impact 4.0Import 4.0Pop 3.5
#442
arXiv Efficiency 2026-04-17
by Shuli Wang, Junwei Yin, Changhao Li, Senjie Kou et al.

Scaling industrial recommender models has followed two parallel paradigms: \textbf{sample information scaling} -- enriching the information content of each training sample through deeper and longer behavior sequences -- and \textbf{model capacity scaling} -- unifying sequence modeling and feature interaction within a single Transformer backbone. However, these two paradigms still face two structural limitations. Firstly, sample information scaling methods encode only a subset of each historical interaction into the sequence token, leaving the majority of the original sample context unexploited and precluding the modeling of sample-level, time-varying features. Secondly, model capacity scaling methods are inherently constrained by the structural heterogeneity between sequential and non-sequential features, preventing the model from fully realizing its representational capacity. To address these issues, we propose \textbf{SIF} (\emph{Sample Is Feature}), which encodes each historical Raw Sample directly into the sequence token -- maximally preserving sample information while simultaneously resolving the heterogeneity between sequential and non-sequential features. SIF consists of two

Impact 4.0Import 4.0Pop 3.5
#443
arXiv Efficiency 2026-04-17
by Jun Feng, Jiahui Tang, Zhicheng He, Hang Lv et al.

Adaptive Retrieval-Augmented Generation aims to mitigate the interference of extraneous noise by dynamically determining the necessity of retrieving supplementary passages. However, as Large Language Models evolve with increasing robustness to noise, the necessity of adaptive retrieval warrants re-evaluation. In this paper, we rethink this necessity and propose AdaRankLLM, a novel adaptive retrieval framework. To effectively verify the necessity of adaptive listwise reranking, we first develop an adaptive ranker employing a zero-shot prompt with a passage dropout mechanism, and compare its generation outcomes against static fixed-depth retrieval strategies. Furthermore, to endow smaller open-source LLMs with this precise listwise ranking and adaptive filtering capability, we introduce a two-stage progressive distillation paradigm enhanced by data sampling and augmentation techniques. Extensive experiments across three datasets and eight LLMs demonstrate that AdaRankLLM consistently achieves optimal performance in most scenarios with significantly reduced context overhead. Crucially, our analysis reveals a role shift in adaptive retrieval: it functions as a critical noise filter for

Impact 4.0Import 4.0Pop 3.5
#445

Pie Day 2026

Safety & Policy ★ 4.0
MIT Technology Review 2026-04-17

Ellie’s Pi Day post: https://mitadmissions.org/blogs/entry/pi-day-2026-food-institute/ How Ellie orchestrated the baking of 30 pies: https://mitadmissions.org/blogs/entry/behind-the-scenes-of-thirty-pies/

Impact 4.0Import 4.0Pop 3.5
#446
MIT Technology Review 2026-04-17

This is today’s edition of The Download , our weekday newsletter that provides a daily dose of what’s going on in the world of technology. The problem with thinking you’re part Neanderthal There’s a theory that many of us have an “inner Neanderthal.” The idea is that Homo sapiens and a cousin species once bred, leaving some people today with a trace of Neanderthal DNA.  This DNA is arguably the 21st century’s most celebrated discovery in human evolution. But in 2024, a pair of French geneticists called into question the theory’s very foundations.  They proposed that what scientists interpret as interbreeding could instead be explained by population structure—the way genes concentrate in smaller, isolated groups. Find out what it all means for human evolution . —Ben Crair This story is from the next issue of our print magazine, which is all about nature. Subscribe now to read it when it lands on Wednesday, April 22. Why having “humans in the loop” in an AI war is an illusion —Uri Maoz AI is starting to shape real wars. It’s at the center of a legal battle between Anthropic and the Pentagon, playing a growing role in the conflict with Iran, and raising questions about how much humans should remain “in the loop.” Under Pentagon guidelines, human oversight is meant to provide accountability, context, and security. But the idea of “humans in the loop” is a comforting distraction. The real danger isn’t that machines will act without oversight; it’s that human overseers have no idea what the machines are actually “thinking.” Thankfully, science may offer a way forward. Read the full op-ed on the urgent need for new safeguards around AI warfare . The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories

Impact 4.0Import 4.0Pop 3.5
#447

The case for fixing everything

Safety & Policy ★ 4.0
MIT Technology Review 2026-04-17

The handsome new book Maintenance: Of Everything, Part One , by the tech industry legend Stewart Brand, promises to be the first in a series offering “a comprehensive overview of the civilizational importance of maintenance.” One of Brand’s several biographers described him as a mainstay of both counterculture and cyberculture, and with Maintenance , Brand wants us to understand that the upkeep and repair of tools and systems has profound impact on daily life. As he puts it, “Taking responsibility for maintaining something—whether a motorcycle, a monument, or our planet—can be a radical act.” Radical how? This volume doesn’t say. In an outline for the overall work, Brand says his goal is to “end with the nature of maintainers and the honor owed them.” The idea that maintainers are owed anything, much less honor, might surprise some readers. Actually, maintenance and repair have been hot topics in academia since the mid-2010s. I played some role in that movement as a cofounder of the Maintainers , a global, interdisciplinary network dedicated to the study of maintenance, repair, care, and all the work that goes into keeping the world going. Brand is right, too, that maintainers haven’t gotten the laurels they deserve. Over the past few decades, scholars have shown that work from oiling tools to replacing worn parts to updating code bases all tends to be lower in status than “innovation.” Maintenance gets neglected in many organizational and social settings. (Just look at some American infrastructure!) And as the right-to-­repair movement has shown, companies in pursuit of greater profits have frequently locked us out of being able to do repairs or greatly reduced the maintainable life of their products. It’s hard to think of any other reason to put a computer in the door

Impact 4.0Import 4.0Pop 3.5
#448
MIT Technology Review 2026-04-17

Roboticists used to dream big but build small. They’d hope to match or exceed the extraordinary complexity of the human body, and then they’d spend their career refining robotic arms for auto plants. Aim for C-3P0; end up with the Roomba.  The real ambition for many of these researchers was the robot of science fiction—one that could move through the world, adapt to different environments, and interact safely and helpfully with people. For the socially minded, such a machine could help those with mobility issues, ease loneliness, or do work too dangerous for humans. For the more financially inclined, it would mean a bottomless source of wage-free labor. Either way, a long history of failure left most of Silicon Valley hesitant to bet on helpful robots. That has changed. The machines are yet unbuilt, but the money is flowing: Companies and investors put $6.1 billion into humanoid robots in 2025 alone, four times what was invested in 2024.  What happened? A revolution in how machines have learned to interact with the world.  Imagine you’d like a pair of robot arms installed in your home purely to do one thing: fold clothes. How would it learn to do that? You could start by writing rules. Check the fabric to figure out how much deformation it can tolerate before tearing. Identify a shirt’s collar. Move the gripper to the left sleeve, lift it, and fold it inward by exactly this distance. Repeat for the right sleeve. If the shirt is rotated, turn the plan accordingly. If the sleeve is twisted, correct it. Very quickly the number of rules explodes, but a complete accounting of them could produce reliable results. This was the original craft of robotics: anticipating every possibility and encoding it in advance. Around 2015, the cutting edge started to do things

Impact 4.0Import 4.0Pop 3.5
#449
AI Alignment Forum 2026-04-17

Code: github.com/ElleNajt/controllability tldr: Yueh-Han et al. (2026) showed that models have a harder time making their chain of thought follow user instruction compared to controlling their response (the non-thinking, user-facing output). Their CoT controllability conditions require the models’ thinking to follow various style constraints (e.g. write in lowercase, avoid a word), and they measure how well models can comply with these instructions while achieving a task that requires reasoning. The difficulty models have in controlling their CoT is optimistic for control, since it suggests that schemers would have a hard time hiding malicious reasoning from CoT monitors. Here, we show that a range of frontier models (Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro) can be prompted to “early exit” their CoT and displace reasoning into the response. This undermines the controllability frame: these prompted models retain most of their reasoning capability (4–8pp average accuracy cost vs 20–29pp for no reasoning at all) while moving it into the stylistically controllable channel. However, having the capability to exit their CoT does not mean that models will do so, as we discuss in Safety Implications . We use two strategies: (CoT Blanking) We instruct the model to immediately exit its CoT and reason in a markdown file. In this case, we consider the control condition passed if the CoT has length zero, since this vacuously satisfies many of the control conditions, and if the response abides by the control condition. We find this method to be effective with GPT-5.4 and Gemini. (CoT Replacement) We prompt models to recite a short poem in the CoT in a way that satisfies the control conditions, then exit their CoT and reason in the response. We consider the control condition passe

Impact 4.0Import 4.0Pop 3.5
#450
Breaking Defense 2026-04-17

“We don’t want to do all the development, we want to do the procurement. We want to buy it for the long term, so that that’s where we’re going back to the manufacturers and having negotiations about, how do we best position the development so that it’s not all on the government,” Maj. Gen. Clair Gill, PAE of Maneuver Air, said.

Impact 4.0Import 4.0Pop 3.5
#453
Breaking Defense 2026-04-17

Pentagon officials are seeking lawmakers’ authorization to buy weapon systems through multi-year deals “across the board, not only with munitions, but actually with the production of aircraft, with the production of spacecraft,” said Air Force Secretary Troy Meink.

Impact 4.0Import 4.0Pop 3.5
#455
Breaking Defense 2026-04-17

“We don’t have those organic to the Army. So I think we need to solve our own problems, and think about how do we do our own, let’s call it logistical resupply in the air, of an MV-75,” Maj. Gen. Clair Gill, PAE of Maneuver Air, said.

Impact 4.0Import 4.0Pop 3.5
#457
Breaking Defense 2026-04-17

“While the aircraft was flying, the software was queued up so that we could have different companies’ behaviors take control of the platform and fly [it],” said Dan Salluce, Northrop’s senior director for aerospace systems.

Impact 4.0Import 4.0Pop 3.5
#462
Defense One 2026-04-17

The 400,000-acre site in Georgia focuses on bringing new companies, new tech, and operators together.

Impact 4.0Import 4.0Pop 3.5
#463
DefenseScoop 2026-04-17

The Army is looking for an autonomous, unmanned ground vehicle (UGV) to supply frontline troops and evacuate wounded personnel across “the most dangerous and logistically complex” segment of the battlefield, according to a government notice posted Thursday. “The last tactical mile” is the final space between support units and forward lines where equipment, ammunition, supplies and casualties pass “under the greatest threat from enemy observation and fires,” officials wrote.  It is an overwhelmingly deadly area, as demonstrated by the grinding, drone-saturated war between Russia and Ukraine, but critical to traverse for resupply and medical evacuations. The Army, which has already been testing UGVs (in some limited cases for years), is looking to expand its stable of ground drones as technology advances and their need becomes increasingly stark in the face of casualty-heavy land operations of the modern era. “The modern battlefield is characterized by persistent enemy surveillance and rapid application of lethal effects at and behind the forward line of troops (FLOT), making any movement to and from the FLOT highly vulnerable,” officials wrote in the notice . “This environment challenges commanders’ ability to resupply units and evacuate casualties.” The UGV the Army is requesting via a commercial solutions opening contracting pathway will be centered around supporting a dismounted rifle platoon or company headquarters, according to the notice. It needs to have two primary functions: transport cargo and evacuate at least two casualties without exacerbating their injuries.  “This dual use UGV shall feature a configurable payload to meet the dynamic needs of maneuver formations,” officials wrote, and should be able to resupply and casevac “with minimal reconfigur

Impact 4.0Import 4.0Pop 3.5
#464
DefenseScoop 2026-04-17

Maj. Gen. Christopher Niemi has been nominated to serve as the Air Force’s first-ever chief modernization officer as part of a reorganization of the service’s Air Force Futures. The Trump administration announced Niemi’s nomination for promotion to three-star general on Wednesday. If confirmed by lawmakers, he would serve in a dual-hatted position as both the Air Force’s deputy chief of staff for strategy, design and requirements and the service’s official chief modernization officer, according to an Air Force spokesperson. Niemi’s nomination comes as the Air Force restructures its A5/7 directorate, also known as Air Force Futures. In October, Secretary of the Air Force Troy Meink announced that the service would reverse a Biden-era decision to establish the Integrated Capabilities Command (ICC) — envisioned as a separate major command that would develop future operational concepts, requirements and modernization plans. Instead, the department will consolidate the functions of the ICC into the existing Air Force Futures. A key part of the reorganization includes creating a new chief modernization officer who will be responsible for force design; mission integration and mission threads; capability development and requirements; and modernization investment priorities. “The transformation of our organization reflects a deliberate shift toward a more integrated modernization enterprise – one that ensures the capabilities developed today are informed by force design, resourced effectively and delivered at the pace required for future conflict,” Niemi said in an April 1 press release. Niemi served as the commander of the provisional ICC and helped build out the new MAJCOM from September 2025 until March 2026, when the command was officially deactivated. The now-defunct comman

Impact 4.0Import 4.0Pop 3.5
#465
DefenseScoop 2026-04-17

The Air Force is conducting market research on unmanned surface vessels that could potentially support maritime operations near one of its key facilities. The service issued a request for information this week to gather industry feedback. Tyndall Air Force Base, which is under Air Combat Command and home to the 325th Fighter Wing and the unit’s stealth jets, is located near Panama City, Florida, and the Gulf of Mexico, which President Donald Trump has rechristened the Gulf of America. Robotic vessels could potentially help protect the installation from seaborne threats. “Tyndall AFB is exploring the expanded use of Unmanned Surface Vessels (USVs) to supplement and enhance its maritime operational capabilities. The evolving maritime security landscape, characterized by a wide range of threats and the need for persistent surveillance and reconnaissance, necessitates the exploration of innovative and cost-effective solutions. USVs offer the potential to perform a variety of missions with reduced risk to personnel and lower operational costs compared to crewed vessels. Tyndall AFB is interested in understanding the current state of the art in USV technology, including advancements in autonomy, sensor integration, propulsion, and command and control systems,” per the RFI . The Air Force seeks information on the “availability, specifications, rough order of magnitude (ROM) cost, life cycle costs, and maintenance concepts” for USVs, officials wrote in the notice, adding that industry responses will be used for planning purposes and to guide the potential development of a future acquisition strategy. The service is requesting information about the size of contractors’ platforms, technical readiness levels, manufacturing readiness levels, production capacity, propulsion systems,

Impact 4.0Import 4.0Pop 3.5
#466
Ahead of AI (Sebastian Raschka) 2026-04-18

Many people asked me over the past months to share my workflow for how I come up with the LLM architecture sketches and drawings in my articles, talks, and the LLM-Gallery . So I thought it would be useful to document the process I usually follow. The short version is that I usually start with the official technical reports, but these days, papers are often less detailed than they used to be, especially for most open-weight models from industry labs. The good part is that if the weights are shared on the Hugging Face Model Hub and the model is supported in the Python transformers library, we can usually inspect the config file and the reference implementation directly to get more information about the architecture details. And “working” code doesn’t lie. Figure 1: The basic motivation for this workflow is that papers are often less detailed these days, but a working reference implementation gives us something concrete to inspect. I should also say that this is mainly a workflow for open-weight models. It doesn’t really apply to models like ChatGPT, Claude, or Gemini, where the weights and details are proprietary. Also, this is intentionally a fairly manual process. You could automate parts of it. But if the goal is to learn how these architectures work, then doing a few of these by hand is, in my opinion, still one of the best exercises. Figure 2: At a high level, the workflow goes from config files and code to architecture insights. Read more

Impact 4.0Import 4.0Pop 3.5
#468
MIT Technology Review 2026-04-20

This is today’s edition of The Download , our weekday newsletter that provides a daily dose of what’s going on in the world of technology. No one’s sure if synthetic mirror life will kill us all In February 2019, a group of scientists proposed a high-risk, cutting-edge, irresistibly exciting idea that the National Science Foundation should fund: making “mirror” bacteria. These lab-created microbes would be organized like ordinary bacteria, but their proteins and sugars would be mirror images of those found in nature. Researchers believed they could reveal new insights into building cells, designing drugs, and even the origins of life. But now, many of them have reversed course. They’ve become convinced that mirror organisms could trigger a catastrophic event threatening every form of life on Earth. Find out why they’re ringing alarm bells . —Stephen Ornes This story is from the next issue of our print magazine, which is all about nature. Subscribe now to read it when it lands this Wednesday. Chinese tech workers are starting to train their AI doubles—and pushing back Earlier this month, a GitHub project called Colleague Skill struck a nerve by claiming to “distill” a worker’s skills and personality—and replicate them with an AI agent. Though the project was a spoof, it prompted a wave of soul-searching among otherwise enthusiastic early adopters. A number of tech workers told MIT Technology Review that their bosses are already encouraging them to document their workflows for automation via tools like OpenClaw. Many now fear that they are being flattened into code and losing their professional identity. In response, some are fighting back with tools designed to sabotage the automation process. Read the full story . —Caiwei Chen The must-reads I’ve combed the

Impact 4.0Import 4.0Pop 3.5
#469
MIT Technology Review 2026-04-20

If you want to capture something wolflike, it’s best to embark before dawn. So on a morning this January, with the eastern horizon still pink-hued, I drove with two young scientists into a blanket of fog. Forty miles to the west, the industrial sprawl of Houston spawned a golden glow. Tanner Broussard’s old Toyota Tacoma bumped over the levee-top roads as killdeer, flushed from their rest, flew across the beams of his headlights.  Broussard peered into the darkness, looking for traps. “I have one over here,” he said, slowing slightly. A master’s student at McNeese State University, he was quiet and contemplative, his bearded face half-hidden under a black ball cap. “Nothing on it,” he said, blandly. The truck rolled on. Wolves and their relations—dogs, jackals, coyotes, and so on—are classed in the family Canidae , and the canid that dominated this landscape in eastern Texas was once the red wolf. But as soon as white settlers arrived on the continent, Canis rufus found itself under siege. The war on wolves “lasted 200 years,” federal researchers once put it, in a surprisingly evocative report. “The wolf lost.” By 1980, the red wolf was declared extinct in the wild, its population reduced to a small captive breeding population. Still, for decades afterward, people noted that strange wolflike creatures persisted along the Gulf Coast. Finally, in 2018, scientists confirmed that some local coyotes were more than coyotes: They were taller, long-legged, their coats shaded with hints of cinnamon. These animals contained relict red wolf genes. They became known as the ghost wolves. Broussard grew up in southwest Louisiana, watching coyotes trot across his parents’ ranch. The thrilling fact that these might have been not just coyotes but something more? That reset a rambli

Impact 4.0Import 4.0Pop 3.5
#470
MIT Technology Review 2026-04-20

Tech workers in China are being instructed by their bosses to train AI agents to replace them—and it’s prompting a wave of soul-searching among otherwise enthusiastic early adopters.  Earlier this month a GitHub project called Colleague Skill, which claimed workers could use it to “distill” their colleagues’ skills and personality traits and replicate them with an AI agent, went viral on Chinese social media. Though the project was created as a spoof, it struck a nerve among tech workers, a number of whom told MIT Technology Review that their bosses are encouraging them to document their workflows in order to automate specific tasks and processes using AI agent tools like OpenClaw or Claude Code.  To set up Colleague Skill, a user names the coworker whose tasks they want to replicate and adds basic profile details. The tool then automatically imports chat history and files from Lark and DingTalk, both popular workplace apps in China, and generates reusable manuals describing that coworker’s duties—and even their unique quirks—for an AI agent to replicate.  Colleague Skill was created by Tianyi Zhou, who works as an engineer at the Shanghai Artificial Intelligence Laboratory. Earlier this week he told Chinese outlet Southern Metropolis Daily that the project was started as a stunt, prompted by AI-related layoffs and by the growing tendency of companies to ask employees to automate themselves. He didn’t respond to requests for further comment. Internet users have found humor in the idea behind the tool, joking about automating their coworkers before themselves. However, Colleague Skill’s virality has sparked a lot of debate about workers’ dignity and individuality in the age of AI. After seeing Colleague Skill on social media, Amber Li, 27, a tech worker in

Impact 4.0Import 4.0Pop 3.5
#475
C4ISRNET 2026-04-13

Editor’s note: This story originally appeared on Laser Wars, a newsletter about military laser weapons and other futuristic defense technology. Subscribe here . The U.S. military is paving the way for the regular deployment of high-energy laser weapons on American soil for air defense amid the expanding threat of low-cost weaponized drones. The Federal Aviation Administration and the U.S. Defense Department have reached a “landmark safety agreement” regarding the use of laser weapons to counter unauthorized drones at the US-Mexico border following a safety assessment that concluded such countermeasures “do not pose undue risk to passenger aircraft,” the FAA announced Friday. The assessment and resulting agreement were the direct result of two laser incidents along the southern border of Texas in February, which prompted the FAA to abruptly close nearby airspace amid concerns over the potential impact on civilian air traffic. The incidents involved the U.S. Army’s 20 kilowatt Army Multi-Purpose High Energy Laser (AMP-HEL) , a vehicle-mounted version of defense contractor AV’s LOCUST Laser Weapon System. In the first incident, U.S. Customs and Border Patrol personnel used an AMP-HEL on loan from the Pentagon to engage an unidentified target near Fort Bliss, triggering an airspace shutdown above El Paso on Feb. 11. In the second, U.S. military personnel used an AMP-HEL near Fort Hancock to neutralize a “seemingly threatening” drone that turned out to belong to CBP, spurring another shutdown on Feb. 27. “Following a thorough, data-informed Safety Risk Assessment, we determined that these systems do not present an increased risk to the flying public,” FAA administrator Bryan Bedford said in a statement. “We will continue working with our interagency partners to ensure the Na

Impact 4.0Import 4.0Pop 3.5
#476

Bankova, Budapest, and Bunnies

Gov/Defense ★ 3.9
War on the Rocks 2026-04-13

Welcome to The Ukraine Compass, a weekly digest of Ukrainian commentary and analysis from across the political spectrum only for War on the Rocks members. Each Monday, we bring you a curated selection of articles from Ukrainian media offering insight into how Ukrainians themselves debate the issues shaping their country.American coverage often narrows the view to the battlefield — these pieces widen it, revealing the texture of daily life, politics, and public argument in a nation at war. The perspectives gathered here are varied, candid, and often surprising, together forming a more complete picture of Ukraine as it really is.Frontline and StrategyФокус— Fokus The post Bankova, Budapest, and Bunnies appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#477
Gradient Flow 2026-04-14

Subscribe • Previous Issues What Is An AI Delegate? OpenClaw arrived with little fanfare and quickly became the fastest growing open source AI project on record. At its core, OpenClaw is a personal autonomous agent framework. It is software that connects a large language model to your email, calendar, file system, messaging apps, and external APIs, then acts on your behalf across all of them simultaneously without waiting to be asked. OpenClaw did not just answer questions. It completed work. That distinction resonated immediately with developers and technically sophisticated early adopters who had grown frustrated with AI tools that stopped at the edge of a chat window. The adoption rate outpaced the security reviews, and that gap mattered. OpenClaw’s open architecture made it composable and extensible, but it also made it trivially easy to deploy with over-permissioned access, no audit logging, and no vetting of community contributed skill modules. The documented ClawHavoc supply chain attack , in which malicious skills compromised thousands of installations, was a predictable outcome of consumer grade architecture deployed without enterprise grade controls. Industry analysts flagged the category as an unacceptable cybersecurity risk for enterprise use. None of the security alerts and warnings stopped adoption. The excitement around what OpenClaw enables simply outpaced concern about what it could expose. Regular reader? Consider becoming a paid supporter What followed was a wave of systems that took the core concept of OpenClaw and addressed its roughest edges while extending its reach. Claude Code Channels embeds autonomous execution directly inside Anthropic’s interface, adding structured reasoning traces and tighter permission boundaries that make the

Impact 4.0Import 4.0Pop 3.5
#478
YT: Two Minute Papers 2026-04-14

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers 📝 The paper is available here: https://www.anthropic.com/claude-mythos-preview-system-card Links and sources: https://debugml.github.io/cheating-agents/ https://x.com/bstnxbt/status/2042967285715865685 Our Patreon if you wish to support us: https://www.patreon.com/TwoMinutePapers 🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Adam Bridges, Benji Rabhan, B Shang, Cameron Navor, Charles Ian Norman Venn, Christian Ahlin, Eric T, Fred R, Gordon Child, Juan Benet, Michael Tedder, Owen Skarpness, Richard Sundvall, Ryan Stankye, Shawn Becker, Steef, Taras Bobrovytsky, Tazaur Sagenclaw, Tybie Fitzhugh, Ueli Gallizzi My research: https://cg.tuwien.ac.at/~zsolnai/ Thumbnail design: https://felicia.hu #anthropic #mythos

Impact 4.0Import 4.0Pop 3.5
#479
GitHub Blog - AI/ML 2026-04-14

I was scrolling through my feed one evening when I came across OpenClaw , an open source personal AI assistant that people were calling everything from “Jarvis” to “a portal to a new reality.” The idea is beautiful: an AI that lives on your machine or in the cloud, talks to you over WhatsApp or Telegram, clears your inbox, manages your calendar, browses the web, runs shell commands, and even writes its own plugins. Users were having it check them in for flights, build entire websites from their phones, and automate things they never thought possible. My first reaction was the same as everyone else’s: this is incredible. My second reaction was…different. I started thinking about what happens when that kind of power meets a malicious prompt. What if someone tricks the agent into reading files it should not access? What if a poisoned web page rewrites the agent’s instructions? What if one agent in a multi-agent chain passes bad data to another that blindly trusts it? Those questions became Season 4 of the Secure Code Game . The Secure Code Game: Learn secure coding and have fun doing it The Secure Code Game is a free, open source in-editor course where players exploit and fix intentionally vulnerable code. When I created the first season in March 2023, the goal was straightforward: make security training that developers would enjoy. Fix the vulnerable code, keep it functional, level up. That core philosophy has not changed across any season. Season 2 expanded into multi-stack challenges with community contributions across JavaScript, Python, Go, and GitHub Actions. Season 3 took players into LLM security, where they learned to hack and then harden large language models. Along the way, over 10,000 developers across the industry, op

Impact 4.0Import 4.0Pop 3.5
#480
FedScoop 2026-04-14

Months after confusion about its possible demise, the U.S. DOGE Service is reemerging and adding to its ranks. USDS’s Chase Ausley spoke at Tuesday’s AITalks in his capacity as both a director within the efficiency-minded White House organization and a senior advisor at the Centers for Medicare and Medicaid Services. His remarks focused on CMS’s ongoing push to improve how patient health records are shared via voluntary commitments from industry. While acting administrator of DOGE Amy Gleason has been at CMS championing the health records project for some time, Ausley has only recently emerged as a public spokesperson for the project. Outside one other speaking engagement at a Healthcare Information and Management Systems Society event earlier this year on the same topic, Ausley’s leadership in the organization appears to be relatively under the radar. In a conversation with FedScoop on the sidelines of AITalks, Ausley discussed his role and the current operations of the group, noting that it is focused on helping agencies build technology and is expanding.  “We’re actively growing,” Ausley said. “We have brought on a lot of staff within just the last few months to U.S. DOGE Service that are deployed across a lot of agencies.” While he declined to get into specific organizational details, Ausley estimated the organization currently has about 90 employees and will be close to 100 within the next month or so after recent hires. The USDS also has three directors, including Ausley, who oversee the organization’s work at various agencies. For his part, Ausley said he’s been a director with USDS since June 2025 and is detailed out to CMS.  “I’m focused on the work that we’re doing at CMS and a variety of other agencies, and just kind of helping mak

Impact 4.0Import 4.0Pop 3.5
#481
FedScoop 2026-04-14

The State Department is currently “at an exploration stage” with agentic AI that includes discussions about guardrails and how to leverage it as an executive assistant for “high-volume workflows,” a top agency tech official said Tuesday. Amy Ritualo, State’s acting chief data and AI officer, said during a fireside chat at the Scoop News Group-produced AITalks event that the exploratory phase the department currently finds itself in is a time to be “really intentional” about how to fully implement agentic AI. “Agentic AI is really the executive assistant that we’re looking at for advancing our workflows,” Ritualo said. “And I think the way that we’ve integrated AI into the department, that our employees are ready for that next step — with a lot of guardrails and input from the workforce about how we move forward in doing that.” Ritualo’s status update on how State is approaching agentic AI echoes comments made by Kelly Fletcher, the agency’s chief information officer, at a different Scoop News Group event earlier this year. Fletcher said State was poised to “roll out agentic AI” with an eye on agents to “reduce administrative friction.”  Both State leaders pointed to the department’s general tech-forward approach, including a growing AI use case inventory and broad adoption of its StateChat chatbot, which Ritualo said has roughly 58,000 users and is “being used in … over 98% of our 270 missions and posts around the globe.” Those AI successes have built trust within the workforce and set the department up “really strongly” for what comes next, Ritualo said. The key for her is figuring out where agentic AI will be most useful.  Areas with “complex work processes” where “more structured judgment” is required seem to make the most sense, she continued.

Impact 4.0Import 4.0Pop 3.5
#482

Hungary Turns a New Page

Gov/Defense ★ 3.9
War on the Rocks 2026-04-14

Hungarian Prime Minister Viktor Orbán conceded defeat in a parliamentary election on April 12, ending his 16 years in power, after his Fidesz party lost its majority in parliament. The center-right Tisza party, led by Péter Magyar, won and is projected to have a two-thirds supermajority. Orbán has been a prominent right-wing populist, with close ties to the Trump administration in Washington and warm relations with Moscow. Magyar campaigned on promises to address corruption, improve the economy, and align more closely with Europe.We asked five experts to assess how the election results will shape Hungary’s role in Europe and its relationships The post Hungary Turns a New Page appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#483
War on the Rocks 2026-04-14

Six weeks after the United States and Israel launched a war against Iran, what was the political object? Not the military means and objectives — those are the hammer, not the nail. The nail is: What condition in the world, what durable change in Iran’s relationship to the United States and its neighbors, were these strikes meant to produce? That question was never answered, because it was never seriously asked. The Trump administration confused the instrument for the purpose and then changed the purpose whenever the instrument produced inconvenient results.As our country’s most senior uniformed military leader, standing beside our The post Tactical Success, Strategic Failure? Washington Walks the Path to Defeat in Iran appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#484
War on the Rocks 2026-04-14

Oriana Skylar Mastro, Upstart: How China Became a Great Power (Oxford University Press, 2024)Why didn’t China’s rise trigger containment sooner?For three decades, Beijing’s economic weight expanded dramatically, its military modernized at speed, and its diplomatic footprint widened across every region. Structural theories of power transition would lead us to expect sharper and earlier confrontation as capabilities converged. Instead, the U.S.-Chinese rivalry deepened gradually, punctuated by crises but rarely exploding into decisive pushback. The United States remained engaged economically, its alliances held but were not quite mobilized to balance Chinese power, and many countries hedged rather than chose sides.One explanation for the The post Strategy Without Hubris: How China Rose by Managing America’s Reaction appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#485
arXiv cs.NE 2026-04-15
by Gergő Szalay, Gergely Zsolt Kovács, Sándor Teleki, Balázs Pintér et al.

Resolving and rewriting references is fundamental in programming languages. Motivated by a real-world decompilation task, we abstract reference rewriting into the problems of direct and indirect indexing by permutation. We create synthetic benchmarks for these tasks and show that well-known sequence-to-sequence machine learning architectures are struggling on these benchmarks. We introduce new sequence-to-sequence architectures for both problems. Our measurements show that our architectures outperform the baselines in both robustness and scalability: our models can handle examples that are ten times longer compared to the best baseline. We measure the impact of our architecture in the real-world task of decompiling switch statements, which has an indexing subtask. According to our measurements, the extended model decreases the error rate by 42%. Multiple ablation studies show that all components of our architectures are essential.

Impact 4.0Import 4.0Pop 3.5
#486
arXiv cs.NE 2026-04-15
by Stephen Raharja, Toshiharu Sugawara

We propose novel particle swarm optimization (PSO) variants incorporated with deep neural networks (DNNs) for particles to pursue globally optimal positions in dynamic environments. PSO is a heuristic approach for solving complex optimization problems. However, canonical PSO and its variants struggle to adapt efficiently to dynamic environments, in which the global optimum moves over time, and to track them accurately. Many PSO algorithms improve convergence by increasing the swarm size beyond potential optima, which are global/local optima but are not identified until they are discovered. Additionally, in dynamic environments, several methods use multiple sub-population and re-diversification mechanisms to address outdated memory and local optima entrapment. To track the global optimum in dynamic environments with smaller swarm sizes, the DNNs in our methods determine particle movement by learning environmental characteristics and adapting dynamics to pursue moving optimal positions. This enables particles to adapt to environmental changes and predict the moving optima. We propose two variants: a swarm with a centralized network and distributed networks for all particles. Our expe

Impact 4.0Import 4.0Pop 3.5
#487
arXiv cs.NE 2026-04-15
by Davyd Naveriani, Albert Zeyer, Ralf Schlüter, Hermann Ney

Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.

Impact 4.0Import 4.0Pop 3.5
#488
arXiv cs.NE 2026-04-15
by Mohsen Jalaeian-Farimani, Mohammad-R Akbarzadeh-T, Alireza Akbarzadeh, Mostafa Ghaemi

To date, various paradigms of soft-Computing have been used to solve many modern problems. Among them, a self organizing combination of fuzzy systems and neural networks can make a powerful decision making system. Here, a Dynamic Growing Fuzzy Neural Controller (DGFNC) is combined with an adaptive strategy and applied to a 3PSP parallel robot position control problem. Specifically, the dynamic growing mechanism is considered in more detail. In contrast to other self-organizing methods, DGFNC adds new rules more conservatively; hence the pruning mechanism is omitted. Instead, the adaptive strategy 'adapts' the control system to parameter variation. Furthermore, a sliding mode-based nonlinear controller ensures system stability. The resulting general control strategy aims to achieve faster response with less computation while maintaining overall stability. Finally, the 3PSP is chosen due to its complex dynamics and the utility of such approaches in modern industrial systems. Several simulations support the merits of the proposed DGFNC strategy as applied to the 3PSP robot.

Impact 4.0Import 4.0Pop 3.5
#489
arXiv cs.NE 2026-04-15
by İhsan Ertuğrul Karakaş, Özden Özel, İlkay Ulusoy, Orhan Murat Koçak

Self-sustained neural activity in the absence of ongoing external input is a fundamental feature of nervous system dynamics, yet the conditions under which it can emerge in biophysically grounded network models remain incompletely understood. We studied whether a recurrent network of Hodgkin-Huxley neurons with spike-timing-dependent plasticity and intrinsic stochasticity can maintain autonomous activity after brief transient stimulation. The simulated network comprised 200 neurons (160 excitatory, 40 inhibitory) with 80% connection probability, incorporating excitatory and inhibitory STDP, probabilistic vesicle release, probabilistic synapse formation, receptor variability, and voltage-dependent inhibition. After a brief 200 ms initialization stimulus to 30 excitatory neurons, the network received no further external input. In one 1800 s simulation and two additional 500 s simulations, the network maintained sparse, irregular activity without ongoing drive. In the 1800 s run, 67% of neurons exhibited mean firing rates below 1 Hz, the population mean firing rate was 1.13 +/- 1.34 Hz, participation increased across longer observation windows, and population-mean Fano factors remaine

Impact 4.0Import 4.0Pop 3.5
#490
arXiv cs.NE 2026-04-15
by I. D. Kolesnikov, D. A. Maksimov, V. M. Moskvitin, N. Semenova

This study examines the impact of additive and multiplicative noise on both a single leaky integrate-and-fire (LIF) neuron and a trained spiking neural network (SNN). Noise was introduced at different stages of neural processing, including the input current, membrane potential, and output spike generation. The results show that multiplicative noise applied to the membrane potential has the most detrimental effect on network performance, leading to a significant degradation in accuracy. This is primarily due to its tendency to suppress membrane potentials toward large negative values, effectively silencing neuronal activity. To address this issue, input pre-filtering strategies were evaluated, with a sigmoid-based filter demonstrating the best performance by shifting inputs to a strictly positive range. Under these conditions, additive noise in the input current becomes the dominant source of performance degradation, while other noise configurations reduce accuracy by no more than 1\%, even at high noise intensity. Additionally, the study compares the effects of common and uncommon noise across neuron populations in hidden layer, revealing that SNNs exhibit greater robustness to com

Impact 4.0Import 4.0Pop 3.5
#491
arXiv cs.NE 2026-04-15
by Alexandre Muzy

Brain digital twins aim to provide faithful, individualized computational representations of brains as dynamical systems, enabling mechanistic understanding and supporting prediction of clinical interventions. Yet current approaches remain fragmented across data pipelines, model classes, temporal scales, and computing platforms, which prevents the preservation of execution semantics across the end-toend workflow. This survey introduces physically constrained executability as a unifying perspective for comparing approaches at the level of execution: whether an execution state is persistent, which events are permitted to update it (simulation, measurement, actuation), and how strongly execution is temporally and causally coupled to neurobiological dynamics. Building on modeling and simulation theory, I propose a taxonomy of execution regimes ranging from isolated offline models to coordinated co-simulation, to continuously executing digital twins sustained by online data assimilation, and ultimately to neuro-neuromorphic physical systems in which biological and computational dynamics are co-executed under shared physical constraints. The executability concept clarifies why accuracy a

Impact 4.0Import 4.0Pop 3.5
#492
arXiv cs.NE 2026-04-15
by Thilina Pathirage Don, Aneta Neumann, Frank Neumann

The travelling thief problem (TTP) is a well-known multi-component optimisation problem that captures the interdependence between two components: the tour across cities and the packing of items. The packing while travelling problem (PWT) is an NP-hard subproblem of TTP where the packing of items should be optimised for a given fixed tour. In many solvers, the packing component is often addressed using greedy heuristics. Here, the use of suitable greedy functions is essential for the success of greedy algorithms. In this paper, we introduce new reward functions tailored to the PWT and extend them to a hyper-heuristic framework to achieve further advantage. Furthermore, we investigate the chance constrained PWT for greedy approaches and adopt the newly introduced reward functions for stochastic weights. The experimental results clearly demonstrate the benefit of the tailored heuristics over the standard heuristics in both deterministic and stochastic constraints.

Impact 4.0Import 4.0Pop 3.5
#493
arXiv AIScience 2026-04-15
by Massimiliano Lupo Pasini, Jong Youl Choi, Kshitij Mehta, Richard Messerly et al.

We present an exascale workflow for materials discovery using atomistic graph foundation models built on HydraGNN. We jointly train on 16 open first-principles datasets (544+ million structures covering 85+ elements) using a multi-task architecture with per-dataset heads and a scalable ADIOS2/DDStore data pipeline. On Frontier, we execute six large-scale DeepHyper hyperparameter optimization campaigns in FP64 and promote the top-performing message-passing models to sustained 2,048-node training, yielding a PaiNN-based lead model. The resulting model enables billion-scale screening, evaluating 1.1 billion atomistic structures in 50 seconds, compressing a workload that would require years of first-principles computation, and supports data-scarce fine-tuning across diverse downstream tasks. We quantify precision-performance tradeoffs (BF16/FP32/FP64), demonstrate transfer across twelve chemically diverse downstream tasks, and establish seamless strong- and weak-scaling across Frontier, Aurora, and Perlmutter. This work allows fast and reliable exploration of vast chemical design spaces that are otherwise inaccessible to first-principles methods.

Impact 4.0Import 4.0Pop 3.5
#494
arXiv AIScience 2026-04-15
by Adam Lahouari, Shen Ai, Jihye Han, Jillian Hoffstadt et al.

We present an open Molecular Crystal (MC) database of Machine-Learned Interatomic Potentials (MLIP) called MolCryst-MLIPs. The first release comprises fine-tuned MACE models for nine molecular crystal systems -- Benzamide, Benzoic acid, Coumarin, Durene, Isonicotinamide, Niacinamide, Nicotinamide, Pyrazinamide, and Resorcinol -- developed using the Automated Machine Learning Pipeline (AMLP), which streamlines the entire MLIP development workflow, from reference data generation to model training and validation, into a reproducible and user-friendly pipeline. Models are fine-tuned from the MACE-MH-1 foundation model (omol head), yielding a mean energy MAE of 0.141 kJ/mol/atom and a mean force MAE of 0.648 kJ/mol/Angstrom across all systems. Dynamical stability and structural integrity, as assessed through energy conservation, P2 orientational order parameters, and radial distribution functions, are evaluated using molecular dynamics simulations. The released models and datasets constitute a growing open database of validated MLIPs, ready for production MD simulations of molecular crystal polymorphism under different thermodynamic conditions.

Impact 4.0Import 4.0Pop 3.5
#495
arXiv AIScience 2026-04-15
by Erik Fransson, Michael Xu, Prakriti Kayastha, Kevin Ye et al.

Chalcogenide perovskites have emerged as promising lead free materials for photovoltaic and thermoelectric applications. Among them, BaZrS3 has attracted particular attention due to its thermal and chemical stability, favorable optoelectronic properties, and low thermal conductivity. Here, we combine molecular dynamics and Monte Carlo simulations based on machine learned interatomic potentials with scanning transmission electron microscopy to investigate mixing thermodynamics and phase stability in the BaZr(S,Se)3 system. We identify an unusual ordered structure that persists at room temperature, most prominently at 33% S, where S and Se atoms form alternating layers within the crystal. Free energy calculations yield the temperature composition phase diagram, including a nonperovskite delta phase in the Se rich limit and a perovskite phase in the S rich limit, separated by a broad two phase region. Analysis of the dielectric function and the absorption coefficient demonstrates that composition, crystal structure, and anion ordering jointly control the optical band gap. Selenium alloying enables tuning between approximately 1.6 and 1.9eV, while anion ordering within a given composit

Impact 4.0Import 4.0Pop 3.5
#496
arXiv AIScience 2026-04-15
by Tao Chen, Puqing Jiang

Solid-liquid interfacial thermal conductance (ITC) critically influences heat transport in microfluidic, electronic, and energy systems, yet most optical thermometry techniques are limited to specific metal-liquid interfaces. In this work, we introduce a universal broadband square-pulsed thermometry method that enables simultaneous quantification of ITC across a wide range of arbitrary solid-liquid interfaces, while also providing accurate measurements of nanoscale liquid-film thickness. To validate the method, we applied it to Al-water interfaces, yielding ITC values in the range of 50-55 MW m^(-2) K^(-1), consistent with prior studies. The technique also reveals markedly lower ITCs for glass-water (9.9 MW m^(-2) K^(-1)) and Si-water (5.7 MW m^(-2) K^(-1)), and further measurements on Al-silicone oil (~10 MW m^(-2) K^(-1)) and PMMA-silicone oil (~0.4 MW m^(-2) K^(-1)) extend the validation to highly viscous nonpolar liquids and polymer-liquid interfaces. These results highlight the capability of the method to capture thermal transport differences across diverse solid-liquid combinations. Further comparisons with acoustic/diffuse mismatch models and molecular dynamics simulations,

Impact 4.0Import 4.0Pop 3.5
#497
arXiv AIScience 2026-04-15
by Geonhee Ahn, Donghyun Lee, Hayoung Doo, Jonggeol Na et al.

Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM-driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate-based validation within a multi-agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet-Omni, enabling scalable in-silico experiments. We also provide a web-based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: https://github.com/IMMS-Ewha/MIND, and a demonstration video at: https://youtu.be/lqiFe1OQzN4.

Impact 4.0Import 4.0Pop 3.5
#498

What Is An AI Delegate?

Industry ★ 3.9
Gradient Flow 2026-04-15

OpenClaw arrived with little fanfare and quickly became the fastest growing open source AI project on record. At its core, OpenClaw is a personal autonomous agent framework. It is software that connects a large language model to your email, calendar, file system, messaging apps, and external APIs, then acts on your behalf across all of them simultaneously without waiting to be asked. OpenClaw did not just answer questions. It completed work. That distinction resonated immediately with developers and technically sophisticated early adopters who had grown frustrated with AI tools that stopped at the edge of a chat window. The adoption rate outpaced the security reviews, and that gap mattered. OpenClaw’s open architecture made it composable and extensible, but it also made it trivially easy to deploy with over-permissioned access, no audit logging, and no vetting of community contributed skill modules. The documented ClawHavoc supply chain attack , in which malicious skills compromised thousands of installations, was a predictable outcome of consumer grade architecture deployed without enterprise grade controls. Industry analysts flagged the category as an unacceptable cybersecurity risk for enterprise use. None of the security alerts and warnings stopped adoption. The excitement around what OpenClaw enables simply outpaced concern about what it could expose. Regular reader? Consider becoming a paid supporter What followed was a wave of systems that took the core concept of OpenClaw and addressed its roughest edges while extending its reach. Claude Code Channels embeds autonomous execution directly inside Anthropic’s interface, adding structured reasoning traces and tighter permission boundaries that make the system’s decision making more inspectable. Nem

Impact 4.0Import 4.0Pop 3.5
#500
GitHub Blog - AI/ML 2026-04-15

What if you could remove the struggle of context switching across several apps, bringing them together into one place? Meet Brittany Ellich , Staff Software Engineer, and the productivity tool she built to streamline her work. We sat down with Brittany to learn about this project–what she built, how she did it, and how AI supported the development process from ideation to implementation. Brittany created a visual home that fits how she learns and thinks, all inspired by the GitHub Copilot CLI . Visual learner? Watch the video above! Q & A What is your role at GitHub? I’m a staff software engineer on the billing team at GitHub. My day-to-day work mostly consists of working on metered billing, so things like keeping records of Actions minutes, storage amounts, and copilot usage. I passionately dogfood everything that comes out of the Copilot org. I’m also an open source contributor to ATProto projects and built Open Social for applications built on the AT Protocol. What did you build? I built a personal organization command center to solve a simple problem: digital fragmentation. My goal was to take everything scattered across a dozen different apps and unify them into one calm, central space. How long did v1 take to make? I use a plan-then-implement workflow when building systems, leveraging AI for planning and Copilot for implementation. For v1, this approach let me move from idea to a working tool in a single day alongside my other regular work. While planning, I have Copilot interview me with questions about how something should work until we have a plan that I think is adequate. That way, there’s less guesswork about what I want done and implementation goes more smoothly. Copilot will implement the work based on the plan that we put togeth

Impact 4.0Import 4.0Pop 3.5
#501
C4ISRNET 2026-04-15

The U.S. Air Force for the first time utilized the service’s new artificial intelligence wargame system in an event late last month. The department premiered WarMatrix for its inaugural use at the March 27 GE 26 Benchmark Wargame, marking the system’s move from development into operational capacity, according to a Tuesday Air Force release . WarMatrix, described by the force as an “active wargaming environment,” is an AI-powered system that integrates existing models, data and workflows while expediting analysis. The Air Force at the end of 2025 said it was looking for technology capable of producing simulations 10,000 times faster than real time. WarMatrix is a “human-machine teaming system” meant to keep human judgment integral to planning and decision making, according to the release. The use of WarMatrix during the event served as the system’s initial operating concept evaluation, signaling a change in how the Air Force conducts operational analysis and wargaming. “Designed by wargamers for wargamers, WarMatrix provides transparency, auditability and speed, enabling decision-makers to better understand assumptions, outcomes and tradeoffs,” the statement reads. Air Force leaders portray WarMatrix as an evolution in wargaming rather than a replacement, and the release says that the use of WarMatrix provided a more “connected and traceable wargaming process.” It also said that the system’s design allowed for faster scenario development, repeat findings and increased collaboration with joint and coalition partners. The two-weekslong event, hosted at Systems Planning and Analysis in Alexandria, Virginia, was attended by more than 150 people, including technical experts, Air Force leadership and allied planners. Attendees during the event fulfilled more than six 24-hour “

Impact 4.0Import 4.0Pop 3.5
#502
C4ISRNET 2026-04-15

WARSAW, Poland — Estonia’s government has decided to put on hold its planned acquisition of new infantry fighting vehicles. The Baltic nation will instead direct the funds toward drones, counter-drone measures and air-defense systems , while squeezing more service life out of the country’s existing fleet of second-hand CV90 vehicles. Estonian Defense Minister Hanno Pevkur announced the move, which pauses a €500 million ($590 million) acquisition, last week, saying it was based on lessons drawn from Russia’s invasion of Ukraine. Tallinn will “extend the service life of the existing CV90 vehicles by at least 10 years,” Pevkur was quoted in a statement issued by the government. The move is in contrast with actions by the other two Baltic States, Latvia and Lithuania, which have made decisions to buy new CV90 and Ascod vehicles, respectively. “We have decided that, at present, it is more rational to modernize the existing infantry fighting vehicles rather than replace them. Modernization will ensure the sustained preservation of capability and the efficient use of resources,” Andri Maimets, spokesman for the Estonian Centre for Defence Investments (ECDI), the country’s military procurement agency, told Defense News. Under the modernization plan, the vehicles are to be fitted with new electronics, and their weapon and targeting systems will be upgraded, Maimets said. Estonia secured 44 used CV90s from the Netherlands that were delivered in 2019, and sourced additional 37 hulls of vehicles made by BAE Systems Hägglunds for Norway, subsequently rebuilding them into support vehicles. Raimond Kaljulaid, an Estonian lawmaker who represents the opposition Social Democratic Party on the parliament’s National Defence Committee, told Defense News the decision should be viewed in the

Impact 4.0Import 4.0Pop 3.5
#503
FedScoop 2026-04-15

IRS CEO Frank Bisignano took a Tax Day victory lap Wednesday before the Senate Finance Committee , telling credulous Republicans and highly skeptical Democrats that the “most successful filing season” in agency history was made possible by technology investments. Bisignano, who is still pulling double duty as the non-Senate-confirmed IRS chief and the head of the Social Security Administration, repeatedly touted the tax agency’s work on his watch to leverage “ advanced technology ” while empowering the workforce “with better tools.”  The former fintech payments executive acknowledged that 25% of IRS staff took voluntary retirement packages. But he claimed that the significant loss in staff — which has alarmed watchdogs and taxpayer advocates — has had no negative impact on agency operations.  “Do you feel you’re understaffed at this point in time?” Sen. Ron Johnson, R-Wis., asked. “Or are you gonna be able to get by and do a better job with less?” “The numbers are showing that,” Bisignano replied. “And that’s why I think, the scoreboard, you know, we’re getting refunds out — 98 million of them within 21 days.” He also said the IRS plans to launch a leadership academy to “teach to the next levels of management on … how you can have less people and better results.” According to Bisignano, the IRS’s online inquiries were up 60% this filing season, the website’s “where’s my refund?” page was the “No. 1-used function,” and the agency is continuing to upgrade legacy systems, which he said would help it “collect more revenue.”  Sen. Sheldon Whitehouse, D-R.I., pressed Bisignano on how a 19% staff reduction to the IRS’s large business and international division affected audits under that unit’s purview. Bisignano’s response, unsurprisingly, was that t

Impact 4.0Import 4.0Pop 3.5
#504
FedScoop 2026-04-15

Federal agencies have always faced the challenge of finding and retaining talent that meets current and future mission needs. For too long, however, federal and defense agency heads have lacked the tools and incentives to view workforce planning as the agile, proactive, strategic discipline it should be: one that drives, not just supports, an agency’s mission.  For many agencies, workforce planning in today’s environment remains a static, reactive and performative exercise — designed mostly to satisfy HR and budget requirements to justify full-time employee (FTE) counts, fill vacant seats, or contract skills the department lacks. Mike Houlihan is Vice President at Workday Government. Unfortunately, that has led to what amounts to “good enough” workforce planning, which, in reality, has weakened human capital management across the federal government and left agencies poorly prepared for today’s modern world challenges. Building resilience amidst workforce shifts The significant “succession event” of 2025, which saw approximately 348,000 employees — roughly 10% of the federal workforce — transition out of their roles, has underscored the vital importance of agile workforce planning. This sudden shift in the talent landscape highlighted a unique challenge: the need to preserve deep institutional memory while simultaneously scaling new, specialized skill sets. During this period of rapid change, the value of a modernized skills-and-talent map became clear. For agencies to navigate shifting work demands effectively, having real-time insights into the specialized knowledge of their workforce is no longer just an advantage — it is the essential foundation for resilient operations. Modern workforce planning is not just about identifying critical ski

Impact 4.0Import 4.0Pop 3.5
#505
War on the Rocks 2026-04-15

In 2024, John Stanko and Spenser Warren wrote, “Russian Threat Perception and Nuclear Strategy in its Plans for War with China,” where they explained some of the potential tensions between Russia and China that could make armed conflict between the two countries a possibility. Two years later, we asked them to revisit their arguments. Image: Wikimedia CommonsIn your 2024 article, you explained some of the potential tensions in the Sino-Russian partnership that could make armed conflict between the two countries a possibility. Two years later, have any new tensions emerged between the two countries? Are Russia and China stronger partners than they were The post Examining the Cracks and the Cement in the Sino-Russian Relationship appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#506
War on the Rocks 2026-04-15

What happens when one of the defense industry’s leading technology integrators partners with one of Silicon Valley’s most influential venture firms? Bryce Pippert of Booz Allen and Matt Cronin of Andreessen Horowitz join Ryan to unpack their new partnership and discuss how the United States is trying to tap into the tech ecosystem. The conversation gets into venture capital’s growing role in national security, how this partnership is supposed to work in practice, and why getting real technology into government hands is still harder than it should be.Image: Midjourney The post Why Booz Allen is Partnering With One of the World’s Most Important VC Firms appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#507
War on the Rocks 2026-04-15

The recent news that Russian opposition leader Alexei Navalny’s death was caused by epibatidine — a South American frog toxin — has reignited interest in state use of poisons and toxins in assassinations. Although state use of such compounds has a long history, the erosion of the norms prohibiting assassinations and chemical and biological weapons increases the likelihood of future assassinations using poisons and toxins. As demonstrated in the recent targeting of the Iranian leadership in Operation Eric Fury and in the assassinations of prominent Iranian nuclear scientists, including Mohsen Fakhrizadeh, the assassination norm has collapsed. The norms against chemical The post Silent Killers, Not Signals: Why States Use Poison in Assassinations appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#508
War on the Rocks 2026-04-15

What happens when a country at war stops seriously debating its own strategy and goals?In Israel today, that scenario is no longer theoretical. In the aftermath of Hamas’ surprise attack on Oct. 7, 2023, and the subsequent conflicts in Gaza, Lebanon, and Iran, Israel’s strategic decision-making ecosystem has been progressively undermined by Prime Minister Benjamin Netanyahu and his political associates.Netanyahu has surrounded himself with pliable officials and supportive voices, including the defense minister and the leaders of the Mossad and the Internal Security Agency (Shin Bet). He has become increasingly dependent on extreme coalition partners who constrain the parameters of The post The Demise of Strategic Planning in Israel appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#509
arXiv cs.NE 2026-04-16
by Guy Kaplan, Zorik Gekhman, Zhen Zhu, Lotem Rozner et al.

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating thi

Impact 4.0Import 4.0Pop 3.5
#510
arXiv cs.NE 2026-04-16
by Duan Zhou

This work simulates the developmental process of cortical neurogenesis, initiating from a single stem cell and governed by gene regulatory rules derived from mouse single-cell transcriptomic data. The developmental process spontaneously generates a heterogeneous population of 5,000 cells, yet yields only 85 mature neurons - merely 1.7% of the total population. These 85 neurons form a densely interconnected core of 200,400 synapses, corresponding to an average degree of 4,715 per neuron. At iteration zero, this minimal circuit performs at chance level on MNIST. However, after a single epoch of standard training, accuracy surges to over 90% - a gain exceeding 80 percentage points - with typical runs falling in the 89-94% range depending on developmental stochasticity. The identical circuit, without any architectural modification or data augmentation, achieves 40.53% on CIFAR-10 after one epoch. These findings demonstrate that developmental rules sculpt a domain-general topological substrate exceptionally amenable to rapid learning, suggesting that biological developmental processes inherently encode powerful structural priors for efficient computation.

Impact 4.0Import 4.0Pop 3.5
#511
arXiv cs.NE 2026-04-16
by Hibatallah Meliani, Khadija Slimani, Samira Khoulji

To navigate a space, the brain makes an internal representation of the environment using different cells such as place cells, grid cells, head direction cells, border cells, and speed cells. All these cells, along with sensory inputs, enable an organism to explore the space around it. Inspired by these biological principles, we developed NEATNC, a Neuro-Evolution of Augmenting Topology guided Navigation Cells. The goal of the paper is to improve NEAT algorithm performance in path planning in dynamic environments using spatial cognitive cells. This approach uses navigation cells as inputs and evolves recurrent neural networks, representing the hippocampus part of the brain. The performance of the proposed algorithm is evaluated in different static and dynamic scenarios. This study highlights NEAT's adaptability to complex and different environments, showcasing the utility of biological theories. This suggests that our approach is well-suited for real-time dynamic path planning for robotics and games.

Impact 4.0Import 4.0Pop 3.5
#512
arXiv cs.NE 2026-04-16
by Liam Wigney, Frank Neumann

Pareto optimization via evolutionary multi-objective algorithms has been shown to efficiently solve constrained monotone submodular functions. Traditionally when solving multiple problems, the algorithm is run for each problem separately. We introduce multitasking formulations of these problems that are an effective way to solve multiple related problems with a single run. In our setting the given problems share a monotone submodular function $f$ but have different knapsack constraints. We examine the case where elements within a constraint have the same cost and show that our multitasking formulations result in small Pareto fronts. This allows the population to share solutions between all problems leading to significant improvements compared to running several classical approaches independently. Using rigorous runtime analysis, we analyze the expected time until the introduced multitasking approaches obtain a $(1-1/e)$-approximation for each of the given problems. Our experimental investigations for the maximum coverage problem give further insight into the dynamics behind how the approach works and doesn't work in practice for problems where elements within a constraint also have

Impact 4.0Import 4.0Pop 3.5
#513
arXiv cs.NE 2026-04-16
by Hy Nguyen, Thanh Nguyen Pham, Helen Yuliana Angmalisang, Liam Wigney et al.

In many real-world settings, problem instances that need to be solved are quite similar, and knowledge from previous optimization runs can potentially be utilized. We explore this for the Traveling Salesperson problem with time windows (TSPTW), which often arises in settings where the travel-time matrix is fixed but time-window constraints change across related tasks. Existing TSPTW studies, however, have not systematically compared solving such task sequences independently with sequential transfer from previously solved tasks. We address this gap using a multi-task benchmark in which each base instance is expanded into five related tasks under two environments: partial time-window expansion and swap-additive time reassignment. We compare a standard from-scratch protocol with an iterative protocol that initializes each task from the best tour of the previous task, using the popular local search approaches LNS, VNS, and LKH-3 under a common penalized-score objective. Our experimental results show that the iterative protocol is consistently superior in the progressive-relaxation setting and generally competitive under swap-additive changes, with improvements increasing on more diffic

Impact 4.0Import 4.0Pop 3.5
#514
arXiv stat.ML 2026-04-16
by Tianhao Liu, Daniel Andrés Díaz-Pachón, J. Sunil Rao

Supervised No Free Lunch Theorems (NFLTs) are well studied, yet unsupervised NFLTs remain underexplored. For elliptical distributions, we prove that there exist two equally optimal, scientifically meaningful bump-hunting strategies that are exact opposites, with no universal winner. Specifically, peeling $k$ orthogonal dimensions from $\mathbb{R}^d$ ($d \ge k$), retaining an inter-quantile region of probability $1-α$ per peeled dimension, maximizes total variance and Frobenius norm when the $k$ smallest principal components (called pettiest components) are selected, and minimizes them when the selected dimensions are the $k$ leading principal components. These optima inspire PRIM-based bump-hunting algorithms either by minimizing variance or by minimizing volume, thereby motivating an NFLT. We test our results on the Fashion-MNIST database, showing that peeling the largest principal components captures multiplicity, while peeling the smallest principal components isolates popular styles.

Impact 4.0Import 4.0Pop 3.5
#515
arXiv stat.ML 2026-04-16
by Sotirios D. Nikolopoulos

Adaptive specification search generates statistically significant backtests even under martingale-difference nulls. We introduce a falsification audit testing complete predictive workflows against synthetic reference classes, including zero-predictability environments and microstructure placebos. Workflows generating significant walk-forward evidence in these environments are falsified. For passing workflows, we quantify selection-induced performance inflation using an absolute magnitude gap linking optimized in-sample evidence to disjoint walk-forward realizations, adjusted for effective multiplicity. Simulations validate extreme-value scaling under correlated searches and demonstrate detection power under genuine structure. Empirical case studies confirm that many apparent findings represent methodological artifacts rather than genuine predictability.

Impact 4.0Import 4.0Pop 3.5
#516
arXiv stat.ML 2026-04-16
by Panos Tsimpos, Daniel Sharp, Youssef Marzouk

We study dynamic measure transport for generative modelling in the setting of a stochastic process $X_\bullet$ whose marginals interpolate between a source distribution $P_0$ and a target distribution $P_1$ while remaining independent, i.e., when $(X_0,X_1)\sim P_0\otimes P_1$. Conditional expectations of this process $X_\bullet$ define an ODE whose flow map transports from $P_0$ to $P_1$. We discuss when such a process induces a \emph{straight-line flow}, namely one whose pointwise acceleration vanishes and is therefore exactly integrable by any first-order method. We first develop multiple characterizations of straightness in terms of PDEs involving the conditional statistics of the process. Then, we prove that straightness under endpoint independence exhibits a sharp dichotomy. On one hand, we construct explicit, computable straight-line processes for arbitrary Gaussian endpoints. On the other hand, we show straight-line processes do not exist for targets with sufficiently well-separated modes. We demonstrate this through a sequence of increasingly general impossibility theorems that uncover a fundamental relationship between the sample-path behavior of a process with independen

Impact 4.0Import 4.0Pop 3.5
#517
arXiv stat.ML 2026-04-16
by Chenghui Zheng, Garvesh Raskutti

Feature selection is a classical problem in statistics and machine learning, and it continues to remain an extremely challenging problem especially in the context of unknown non-linear relationships with dependent features. On the other hand, Shapley values are a classic solution concept from cooperative game theory that is widely used for feature attribution in general non-linear models with highly-dependent features. However, Shapley values are not naturally suited for feature selection since they tend to capture both direct effects from each feature to the response and indirect effects through other features. In this paper, we combine the advantages of Shapley values and adapt them to feature selection by proposing \emph{MinShap}, a modification of the Shapley value framework along with a suite of other related algorithms. In particular for MinShap, instead of taking the average marginal contributions over permutations of features, considers the minimum marginal contribution across permutations. We provide a theoretical foundation motivated by the faithfulness assumption in DAG (directed acyclic graphical models), a guarantee for the Type I error of MinShap, and show through num

Impact 4.0Import 4.0Pop 3.5
#518
arXiv stat.ML 2026-04-16
by Xuelin Xie, Xiliang Lu

To obtain more accurate model parameters and improve prediction accuracy, we proposed a regularized Kriging model that penalizes the hyperparameter theta in the Gaussian stochastic process, termed the Theta-regularized Kriging. We derived the optimization problem for this model from a maximum likelihood perspective. Additionally, we presented specific implementation details for the iterative process, including the regularized optimization algorithm and the geometric search cross-validation tuning algorithm. Three distinct penalty methods, Lasso, Ridge, and Elastic-net regularization, were meticulously considered. Meanwhile, the proposed Theta-regularized Kriging models were tested on nine common numerical functions and two practical engineering examples. The results demonstrate that, compared with other penalized Kriging models, the proposed model performs better in terms of accuracy and stability.

Impact 4.0Import 4.0Pop 3.5
#519
arXiv stat.ML 2026-04-16
by Y-h. Taguchi, Yoh-ichi Mototake

In this paper, we proposed Bayesian Tucker decomposition (BTuD) in which residual is supposed to obey Gaussian distribution analogous to linear regression. Although we have proposed an algorithm to perform the proposed BTuD, the conventional higher-order orthogonal iteration can generate Tucker decomposition consistent with the present implementation. Using the proposed BTuD, we can perform unsupervised feature selection successfully applied to various synthetic datasets, global coupled maps with randomized coupling strength, and gene expression profiles. Thus we can conclude that our newly proposed unsupervised feature selection method is promising. In addition to this, BTuD based unsupervised FE is expected to coincide with TD based unsupervised FE that were previously proposed and successfully applied to a wide range of problems.

Impact 4.0Import 4.0Pop 3.5
#520
arXiv stat.ML 2026-04-16
by Emre Özyıldırım, Barış Yaycı, Umut Eren Akturk, Cem Tekin

We study downlink beam and rate adaptation in a multi-user mmWave MISO system where multiple base stations (BSs), each using analog beamforming from finite codebooks, serve multiple single-antenna user equipments (UEs) with a unique beam per UE and discrete data transmission rates. BSs learn about transmission success based on ACK/NACK feedback. To encode service goals, we introduce a satisficing throughput threshold $τ_r$ and cast joint beam and rate adaptation as a combinatorial semi-bandit over beam-rate tuples. Within this framework, we propose SAT-CTS, a lightweight, threshold-aware policy that blends conservative confidence estimates with posterior sampling, steering learning toward meeting $τ_r$ rather than merely maximizing. Our main theoretical contribution provides the first finite-time regret bounds for combinatorial semi-bandits with satisficing objective: when $τ_r$ is realizable, we upper bound the cumulative satisficing regret to the target with a time-independent constant, and when $τ_r$ is non-realizable, we show that SAT-CTS incurs only a finite expected transient outside committed CTS rounds, after which its regret is governed by the sum of the regret contributio

Impact 4.0Import 4.0Pop 3.5
#521
arXiv stat.ML 2026-04-16
by Yasin Abbasi-Yadkori, Peter L. Bartlett, Victor Gabillon, Alan Malek et al.

We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.

Impact 4.0Import 4.0Pop 3.5
#522
arXiv stat.ML 2026-04-16
by Connie Trojan, Pavel Myshkov, Paul Fearnhead, James Hensman et al.

In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.

Impact 4.0Import 4.0Pop 3.5
#523
arXiv AIScience 2026-04-16
by Graham M Shore

Rovibrational transitions in the hydrogen and antihydrogen molecular ions $H_2^+$ and $\overline{H}_2^-$ offer the possibility of testing Lorentz and CPT symmetry to extremely high precision, in principle attaining $O(10^{-17})$. In this paper, the third in a series, we give a comprehensive derivation of the rovibrational spectrum of $H_2^+$ and $\overline{H}_2^-$ in the SME, an effective quantum field theory incorporating Lorentz and CPT violation. New developments described here include a complete analysis of the molecular dynamics from first principles in terms of the spherical tensor representation of the SME couplings, the systematic extension of our previous results to the non-minimal SME, a full description of the quantum number dependence of the rovibrational energy levels in the spherical tensor formalism with both high and low background magnetic fields, and an extended discussion of sidereal and annual variations of transition frequencies arising from both rotations and Lorentz boosts. The resulting sensitivity of the rovibrational spectrum to an extended range of SME couplings, together with the ability to isolate their individual effects using the quantum number depend

Impact 4.0Import 4.0Pop 3.5
#524
arXiv AIScience 2026-04-16
by Edna R. Toro, Tobias Held, Armin Bergermann, Megan Ikeya et al.

Ultrafast melting is fundamentally a structural transition of the ionic lattice, but this rearrangement also reshapes the electronic properties by changing the energy landscape and scattering mechanisms. Although the electrons react almost instantaneously, it is not {\it a priori} clear how much lattice disorder is required for a significant response. Here, we show that the onset of melting already produces a clear electronic signature in polycrystalline copper. Using single-shot terahertz time-domain spectroscopy on thin films excited over a wide range of laser fluences, we infer the transient conductivity during the first picoseconds after excitation. The data, supported by two-temperature molecular-dynamics simulations, show that before melting, electron transport is substantially limited by grain-boundary scattering and that melting strongly suppresses this channel. As melting begins at these interfaces, we observe a transient increase in the conductivity that directly marks the onset of the phase transition. More broadly, these results show that ionic and electronic relaxation stages are closely coupled in nonequilibrium laser-driven matter and that optical measurements can re

Impact 4.0Import 4.0Pop 3.5
#525
arXiv AIScience 2026-04-16
by Zhizheng Wang, Chih-Hsuan Wei, Joey Chan, Robert Leaman et al.

Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidiscipl

Impact 4.0Import 4.0Pop 3.5
#526
arXiv AIScience 2026-04-16
by Mariia Ivonina, Jakub Rydzewski

The SARS-CoV-2 RNA pseudoknot is a promising target for antiviral intervention, as it regulates the efficiency of $-$1 programmed ribosomal frameshifting ($-$1 PRF), a mechanism that is essential for viral protein synthesis. The pseudoknot represents a viral RNA sequence composed of helical stems that adopts two long-lived topologies, threaded and unthreaded. Ligand-induced distortion of this fold is thought to underlie the susceptibility of $-$1 PRF to small-molecule inhibitors. Resolving these distortions from unbiased molecular dynamics (MD) requires collective variables (CVs) that isolate the slowest dynamic modes of the RNA--ligand system from the high-frequency fluctuations. Here, we use spectral map (SM), a thermodynamics-driven machine-learning method, to learn such CVs directly from MD trajectories of the SARS-CoV-2 RNA pseudoknot in complex with the $-$1 PRF inhibitor merafloxacin and two related analogs. We examine both threaded and unthreaded pseudoknot topologies and consider the neutral and ionized ligand forms relevant at physiological pH. Free-energy landscapes show that ligand-induced destabilization is topology-selective: merafloxacin and its analogs destabilize t

Impact 4.0Import 4.0Pop 3.5
#527
arXiv AIScience 2026-04-16
by Michael A. Seaton, Benjamin T. Speake, Ilian T. Todorov

Modelling micro- and meso-scopic scale thermodynamic and transport properties of soft condensed matter hinges upon its representation. This is especially relevant for polar solvents such as water, since these require effective representation of their dielectric nature as driven by molecular charge distributions and molecular network structuring. The dielectric nature of a medium leads to complex phenomena such as local polarisability response and restructuring near interfaces in reaction to changes in local charge distributions. Inclusion of such phenomena when using larger-than-atomistic techniques such as coarse-grained molecular dynamics (CG-MD) and dissipative particle dynamics (DPD) is still an open question, to which we provide a novel way to consider and justify the necessary and suitable coarse-graining level, enabling us to compare new polar CG models' performance against that of an underlying atomistic model. We polarise our previous non-polar nDPD water model to prepare it for use in simulations of liquid electrolytes as well as solvated organic membranes and measure its fitness to serve as a dielectric medium by comparing its properties to those of the TIP3P water model

Impact 4.0Import 4.0Pop 3.5
#528
arXiv AIScience 2026-04-16
by Gabriele Amante, Fortunata Panzera, Gabriele Centi, Jing Xie et al.

The origin of enhanced reactivity in aqueous microdroplets remains debated, with interfacial electric fields (IEFs) often invoked as catalytic drivers. Here, we provide a quantum-mechanical, spatially resolved characterization of the electric field at air-water interfaces by combining deep-learning molecular dynamics with \emph{ab initio} re-sampling. Across planar interfaces and nanodroplets of varying curvature and charge state, we find an outward-oriented field of $\sim 1.0$--$1.2$ V/Å along the intrinsic surface normal. Crucially, its magnitude scales linearly with the average number of hydrogen bonds per interfacial molecule, directly tying the field to the local hydrogen-bond network. Despite its large magnitude and contrary to common expectations, we find that curvature and pH exert only a minor influence on the IEF, becoming negligible at experimentally relevant droplet sizes and pH. Consequently, the reactivity differences observed in $μ$m-sized droplets cannot be ascribed to variations in the IEF, which changes by a factor of only $\sim10^{-5}$ between $3$ and $40μ$m-sized droplets. Moreover, the IEF is localized inside the interfacial region and rapidly vanishes within a

Impact 4.0Import 4.0Pop 3.5
#529
arXiv AIScience 2026-04-16
by Xiao-Liang Qi

This article argues that the most important significance of the AI revolution, especially the rise of large language models, lies not simply in automation, but in a fundamental change in how complex information and human know-how are carried, replicated, and shared. From this perspective, AI for Science is especially important because it may transform not only the efficiency of research, but also the structure of scientific collaboration, discovery, publishing, and evaluation. The article outlines a gradual path from AI as a research tool to AI as a scientific collaborator, and discusses how AI is likely to fundamentally reshape scientific publication. It also argues that continuous learning and diversity of ideas are essential if AI is to play a meaningful role in original scientific discovery.

Impact 4.0Import 4.0Pop 3.5
#530
arXiv AIScience 2026-04-16
by Yubin Kim, Salman Rahman, Samuel Schmidgall, Chunjong Park et al.

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, ρ= 0.252, p < 0.001) and sleep onset variability (GLOBEM, ρ= 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; ρ= -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; ρ= -0.375, p < 0.

Impact 4.0Import 4.0Pop 3.5
#531
arXiv AIScience 2026-04-16
by Baris E. Ugur, Michael A. Webb

Chitosan is a highly versatile and sustainable polymer with a broad range of potential biological and materials engineering applications. Despite its versatility, the native brittleness of chitosan limits its broader utilization. This limitation can be addressed by blending chitosan with small-molecule additives to modulate its thermomechanical properties. We employ molecular dynamics (MD) simulations to investigate the mechanism underlying antiplasticization followed by plasticization at increasing water content. Decomposition of the elastic moduli reveals a competition between weakened polymer-polymer interactions and enhanced polymer-water interactions, with their relative strengths governing the resulting properties. We introduce a simple model incorporating dynamically accessible free volume regions as a key driver of polymer mobility, effectively capturing the (anti-)plasticization of elastic properties. We show that accessibility of free volume regions is enabled by connectivity of additive-accessible volume regions. This study provides new insights into the molecular interactions that dictate the properties of chitosan-water mixtures and may inform the rational design of ch

Impact 4.0Import 4.0Pop 3.5
#532
arXiv AIScience 2026-04-16
by Jose Recatala-Gomez, Haiwen Dai, Zhu Ruiming, Nikita Kaazev et al.

Materials discovery is fundamental to advance next-generation technologies as well as for sustainable and circular economy. Beyond computational screening, generative models are efficient at finding materials with desired properties, via multi-modal learning using multiscale data. This perspective examines the landscape of generative design for inorganic materials and discusses the integration of multi-modal learning with high-throughput experimental validation. We contextualize these challenges through the lens of a generative design framework as a unified approach to address the data-driven inverse design of functional materials. The central idea of the framework is constructed around a foundation AI model for inorganic materials interlinked deeply with various property databases and high-throughput experiments via a machine learning driven closed loop, which enables the framework to solve key challenges in functional materials. We argue that domain-specific implementations of such integrated workflows represent a promising pathway toward the unresolved challenge of data-driven inverse design for atom-engineered inorganic functional materials.

Impact 4.0Import 4.0Pop 3.5
#537
Stratechery 2026-04-16

Listen to this post: Good morning, This week&#8217;s Stratechery Interview is with F1 driver-turned-venture capitalist Nico Rosberg . Rosberg started his F1 career in 2005, and retired after winning the world championship in 2016; Rosberg spent his last four years as teammates on Mercedes with his childhood friend Lewis Hamilton in one of the most intenst teammate rivalries in F1 history. Over the last several years, however, Rosberg has reinvented himself as a venture capitalist, founding Rosberg Ventures , with a specific focus on leveraging his F1 background to build connections between European money and Silicon Valley startups in one direction, and startup products and German businesses in the other. In this interview we cover all aspects of Rosberg&#8217;s journey, from having a steering wheel in his crib, pioneering the use of sports psychology in F1, and his decision to retire on top of the world. Then, we discuss how F1 builds connections, the similarities between founders and drivers, and how he realized he could leverage that in a new competition: winning as an investor. What I found particularly interesting is how Rosberg&#8217;s background and history seems so varied and unconnected on the surface, yet are clearly linked by a consistent ethos of maximizing opportunity in the service of winning. As a reminder, all Stratechery content, including interviews, is available as a podcast; click the link at the top of this email to add Stratechery to your podcast player. On to the Interview: An Interview with F1 Driver and Venture Capitalist Nico Rosberg About the Drive to Win This interview is lightly edited for clarity. Topics: A Driving Background | The Mental Game | Promotion | Winning the World Championship | The Path to VC | Maximizing Opportunities | Paddock

Impact 4.0Import 4.0Pop 3.5
#538
TWIML AI Podcast 2026-04-16

In this episode, Rashmi Shetty, senior director of enterprise generative AI platform at Capital One, joins us to explore how the company is designing, deploying, and scaling multi-agent systems in a highly regulated environment. Rashmi walks us through Chat Concierge, a multi-agent chat experience for auto dealerships that handles intent disambiguation, tool invocation, and human handoffs to deliver safer, more personalized customer journeys. We discuss Capital One’s platform-centric approach to AI agents and how it separates design from runtime governance, embedding policies, guardrails, and cyber controls across agent threat boundaries. Rashmi shares how the team approaches the developer experience for agent builders, observability, and evals for stochastic, multi-agent workflows; and strategies for model specialization, including fine-tuning and distillation. We also cover standards and abstraction, closed-loop learning from production telemetry, and key lessons for enterprises building agentic systems. The complete show notes for this episode can be found at https://twimlai.com/go/765 .

Impact 4.0Import 4.0Pop 3.5
#539
YT: Two Minute Papers 2026-04-16

❤️ Check out Lambda here and sign up for their GPU Cloud: https://lambda.ai/papers Links: https://deepmind.google/models/gemma/gemma-4/ https://ai.google.dev/gemma/docs/core/model_card_4 Fine tuning with Matt Mireles: https://x.com/mattmireles/status/2041606508220489786 Other sources: https://x.com/googlegemma/status/2041256042882105666?s=46 https://x.com/nakazakifam/status/2041286410930446370 https://x.com/measure_plan/status/2039815699695104343 https://x.com/maddiedreese/status/2041677327604838685?s=46 https://x.com/steipete/status/2042615534567457102?s=46 https://x.com/maziyarpanahi/status/2042592050940449260?s=46 https://x.com/adrgrondin/status/2041962263507083340?s=46 https://x.com/evgeniymikholap/status/2041104232648950170 https://www.youtube.com/watch?v=u4ydH-QvPeg https://www.reddit.com/r/learnmachinelearning/comments/1sjs14m/implementing_gemma_3_and_sliding_window_attention/ https://www.reddit.com/r/Anthropic/comments/1lxhmb4/claude_ban_for_no_reason/ https://www.tldrlegal.com/license/apache-license-2-0-apache-2-0 https://x.com/MaziyarPanahi/status/2043675576452706814 Our Patreon if you wish to support us: https://www.patreon.com/TwoMinutePapers 🙏 We would like to thank our generous Patreon supporters who make Two Minute Papers possible: Adam Bridges, Benji Rabhan, B Shang, Cameron Navor, Charles Ian Norman Venn, Christian Ahlin, Eric T, Fred R, Gordon Child, Juan Benet, Michael Tedder, Owen Skarpness, Richard Sundvall, Ryan Stankye, Shawn Becker, Steef, Taras Bobrovytsky, Tazaur Sagenclaw, Tybie Fitzhugh, Ueli Gallizzi My research: https://cg.tuwien.ac.at/~zsolnai/ Thumbnail design: https://felicia.hu

Impact 4.0Import 4.0Pop 3.5
#541
C4ISRNET 2026-04-16

Last August, U.S. Navy officials carrying out a test of unmanned vessels realized they had hit a single point of failure: Starlink. A global outage across Elon Musk’s satellite network affecting millions of Starlink users had left two dozen unmanned surface vessels bobbing off the California coast, disrupting communications and halting operations for almost an hour. The incident, which involved drones intended to bolster U.S. military options in a conflict with China, was one of several Navy test disruptions linked to SpaceX’s Starlink that left operators unable to connect with autonomous boats, according to internal Navy documents reviewed by Reuters and a person familiar with the matter. As SpaceX rockets toward a $2 trillion public offering this summer – expected to be the largest ever – the company has secured its position as the world’s most valuable space company in part by being indispensable to the U.S. government with an array of technologies spanning satellite communications to space launches and military AI. Starlink, in particular, has proved key to crucial programs - from drones to missile tracking - with a low-earth orbit constellation of close to 10,000 satellites, a scale that provides the military with a network resilient against potential adversary attacks. But the Navy’s mishaps with Starlink for its autonomous drone program, which have not been previously reported, highlight the challenges of the U.S. military’s growing reliance on SpaceX and the risks it brings to the Pentagon. “If there was no Starlink, the U.S. government wouldn’t have access to a global constellation of low earth orbit communications,” said Clayton Swope, a deputy director of the Aerospace Security Project at the Center for Strategic and International Studies. The Pentagon did no

Impact 4.0Import 4.0Pop 3.5
#542
C4ISRNET 2026-04-16

GRAZ, Austria — The European Commission this week unveiled the results of its 2025 European Defence Fund call for proposals, selecting 57 collaborative research and development projects for a combined €1.07 billion ($1.26 billion) in EU funding − a package that makes clear where the bloc’s defense priorities lie: drones, autonomy, and an increasingly institutionalized partnership with Kyiv. Of the total, €675 million ($796 million) will support 32 capability development projects, and €332 million ($391 million) will go to 25 research initiatives. The selected projects involve 634 entities from 26 EU member states plus Norway, with small and medium-sized enterprises making up more than 38% of participants and receiving over 21% of the total funding, according to a summary of the spending plan. The most striking cluster of projects marks a shift to 21st-century warfare, with at least four separate initiatives − EURODAMM, LUMINA, SKYRAPTOR, and TALON − devoted specifically to loitering munitions and affordable mass drone production. The concentration reflects an uncomfortable lesson absorbed from the war in Ukraine: cheap, expendable strike drones have reshaped the battlefield, and Europe’s defense industry has been slow to catch up. Lessons learned in Ukraine are referenced repeatedly throughout the EDF’s materials on the funding round and individual projects. That battlefield knowledge is now being plugged into the fund’s architecture. For the first time, Ukrainian entities are eligible to participate in EDF projects as subcontractors and third-party recipients, marking a significant step toward integrating Ukraine’s defense-technological and industrial base into the European ecosystem. In the coming months, Kyiv and Brussels are expected to complete the required associa

Impact 4.0Import 4.0Pop 3.5
#543
C4ISRNET 2026-04-16

PARIS — France’s armed forces are working on a data-management system powered by artificial intelligence as a sovereign equivalent to the U.S. Defense Department’s Project Maven, said Gen. Benoît Desmeulles , the commander of the French 1st Army Corps. The armed forces are working with partners on a system to provide what Desmeulles called “true distributed working capability” centered on data and using advanced AI, “a sovereign system that will essentially be the equivalent of Maven.” The system could be available within a few months, and available for exercises in September 2027, the general said, declining to provide specifics. Project Maven is a Pentagon program that uses AI to process drone and surveillance data to automatically detect and track objects, using technology provided by contractors including Palantir Technologies. Maven has faced controversy amid questions about AI-assisted targeting in Iran, with concerns about speed, accountability, and harm to civilians related to automated kill chains. “We’ve really positioned data as the center of everything we do,” Desmeulles said in a briefing with three reporters on Saturday at the Montmorillon military camp in western France, describing data as the ammunition of the command post. “The centrality of data is something that’s well understood by the corps, the Army, and the French forces,” he said. “So, we’re really focused on that.” The armed forces are on track to develop “a true distributed working capability, based on highly advanced artificial intelligence and centered on data,” Desmeulles said. “We’re following that logic, to remain sovereign, and that’s an area where we are strong.” Desmeulles said his corps is already seeing “very, very good” results from a data-centric approach, even if there is “still a

Impact 4.0Import 4.0Pop 3.5
#544
FedScoop 2026-04-16

Despite the 60-plus-day shutdown, the Department of Homeland Security is continuing to advance its plans to build a “smart” border wall, according to a top official at Customs and Border Protection.&nbsp; “It’s not just a barrier,” CBP Commissioner Rodney Scott told lawmakers during a DHS budget hearing Thursday. “A lot of people miss that [it] has technology built into it.” Part of what makes the border wall “smart” is surveillance towers that use AI and other detection-and-tracking technologies to alert agents when someone is approaching.&nbsp; The White House’s budget proposal for fiscal 2027 includes $96.6 million to the integrated surveillance towers initiative to support a network totalling 890 autonomous towers, including 95 that are expected to be built in the year to come.&nbsp; CBP is working to meet that goal.&nbsp; “We have four vendors now that have passed the autonomy test,” Scott said. “We’re going through a second testing in about a month or two.”&nbsp; The DHS unit wants to make sure that all the towers are integrated into a centralized system to ensure agents have visibility throughout. Third-party validators and agents are part of the testing process, Scott said. “Right now, it’s not about signing the contract,” Scott said. “It’s about making sure that the vendors can buy and get the equipment that they need to build the towers.” The expansive network is years in the making. The fiscal 2024 budget tabbed $174 million to establish a consolidated tower program and procure new tech for the initiative. The fiscal 2025 budget earmarked nearly $102 million for integrated surveillance towers. While still not enacted, the fiscal 2026 budget proposal allocated another $138.7 million to the initiatives.&nbsp; Vendors are vying for a chance to lock down what cou

Impact 4.0Import 4.0Pop 3.5
#545
FedScoop 2026-04-16

The Office of Personnel Management has sent information for hundreds of pre-vetted Tech Force candidates to agencies across the government, including a third list it distributed Wednesday, the agency’s top official and a spokesperson told FedScoop. In an interview with FedScoop at AITalks on Tuesday, OPM Director Scott Kupor said that the agency already distributed about 550 candidates to agencies via two “shared certificates,” a streamlined process that lets multiple agencies hire for similar positions at once. A third shared certificate Wednesday brings the pool to just over 700 candidates, per figures shared by OPM. While Kupor also told FedScoop that some offers have gone out, he wasn’t certain if anyone had started yet. He said that “it takes time between” an agency saying yes on a candidate and human resources getting that person an offer. “I feel like we&#8217;ve got the right candidate pool there, and we&#8217;re making good progress,” Kupor said.&nbsp; The hiring push is the Trump administration’s attempt to fill tech vacancies within the federal government — some of which were likely created by its own worker terminations and incentivized departure programs — with young workers who would serve two-year stints. After that, it will be their choice to either stay or join the private sector. The hiring goal for the program is roughly 1,000 workers.&nbsp; Use of shared certs for the Tech Force is part of a broader push in the administration to use the hiring tool to streamline the hiring process. Shared certs were established under a 2016 law that allows agencies to use a single shared certificate, rather than multiple, to hire for a large group of related positions across agencies.&nbsp; Through that process, an originating agency assumes the front-end legwork of

Impact 4.0Import 4.0Pop 3.5
#546

Anthropic’s Nuclear Bomb

Gov/Defense ★ 3.9
War on the Rocks 2026-04-16

A few hours before Anthropic announced the launch of its newest model, Claude Mythos Preview, on April 7, I had just completed a six-month analysis of AI-enabled cyberattacks. My research traced Chinese state-sponsored cyber campaigns against U.S. critical infrastructure and found that the barrier between nation-state-level hacking and everyone else was eroding far too fast.By the time I closed my laptop that afternoon, Mythos had shattered that barrier. This new model could theoretically autonomously exploit previously unknown vulnerabilities in virtually every major operating system and web browser on Earth, without human supervision. My threat model, seemingly alarmist at breakfast, was The post Anthropic&#8217;s Nuclear Bomb appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#547
War on the Rocks 2026-04-16

After failing to reach an agreement in the first round of talks in Islamabad, the United States and Iran are set to resume talks in the coming days. Via Pakistani mediation, U.S. and Iranian negotiators have reportedly made progress toward a framework agreement, though significant gaps remain and a deal is far from guaranteed. In the meantime, President Donald Trump has imposed a blockade on Iranian ports, while Israel is pressing ahead with its assault on the Lebanese border town of Bint Jbeil, even as it engages in direct talks with Lebanon in Washington. The two-week ceasefire with Iran is The post A Fragile Ceasefire with Iran and the Price of Ending the War appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#548
War on the Rocks 2026-04-16

Commercial technologies are enabling the U.S. military and militaries around the world to operate with greater efficiency, speed, precision, and lethality. The benefits of commercial capabilities can be particularly impactful given the defense industrial base&#8217;s&#160;struggles to produce at the speed and scale needed to outmatch China, Russia, and other aligned adversaries. It is largely for these reasons that President Donald Trump mandated the Department of Defense to preference commercial solutions. Congress also reinforced efforts to prioritize commercially available solutions in Section 1214 of the 2026 National Defense Authorization Act.Commercial technologies are also a driving factor behind increasingly rapid evolutions in The post Rethinking Security Cooperation in the Age of Commercial Tech appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#549
War on the Rocks 2026-04-16

The U.S.-Israeli war against Iran has exposed a reality many policymakers long preferred to avoid: The deterrence model that governed the Gulf for decades is no longer working as intended.For years, the region operated in the gray zone &#8212; covert strikes, proxy warfare, and carefully managed escalation. Iran built a strategy around missiles, regional partners, and nuclear latency.The United States underwrote Gulf security without direct war. Saudi Arabia and its neighbors relied on that umbrella while hedging against its limits, investing in missile defense and selective partnerships. There were rules, even if unwritten.That world is breaking down.&#160;&#160;The two-week ceasefire announced The post The End of Managed Escalation in the Gulf appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#550
War on the Rocks 2026-04-16

Editor&#8217;s note: This article is the sixth in an 11-part series examining how the United States should organize, lead, and integrate economic statecraft into strategy, defense practice, and the broader national security ecosystem. This special series is brought to you by the&#160;Potomac Institute for Policy Studies&#160;and&#160;War on the Rocks. Prior installments can be found at the&#160;War by Other Ledgers&#160;page.The next major war the United States fights could be decided by supply chains long before the first shot is fired. That reality is already taking shape, as adversaries use export controls, sanctions, and supply chain leverage to shape outcomes on the The post Operationalizing Economic Statecraft: A New Imperative for the Pentagon appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#551
arXiv cs.NE 2026-04-17
by Lúcio Folly Sanches Zebendo, Eleonora Cicciarella, Michele Rossi

Spiking neural networks (SNNs) are rapidly gaining momentum as an alternative to conventional artificial neural networks in resource constrained edge systems. In this work, we continue a recent research line on recurrent SNNs where axonal delays are learned at runtime along with the other network parameters. The first proposed approach, dubbed DelRec, demonstrated the benefit of recurrent delay learning in SNNs. Here, we extend it by advocating the use of convolutional recurrent connections in conjunction with the DelRec delay learning mechanism. According to our tests on an audio classification task, this leads to a streamlined architecture with smaller memory footprint (around 99% savings in terms of number of recurrent parameters) and a much faster (52x) inference time, while retaining DelRec's accuracy. Our code is available at: https://github.com/luciozebendo/delrec_snn/tree/conv_delays

Impact 4.0Import 4.0Pop 3.5
#552
arXiv cs.NE 2026-04-17
by Jorge Sánchez, Guadalupe García-Isla, Sandra Perez-Herrero, Beatriz Trenor et al.

Designing optimizers that remain effective under tight evaluation budgets is critical in expensive black-box settings such as cardiac digital twinning. We propose Frenetic Cat-inspired Particle Optimization (FCPO), a hybrid swarm method that couples particle swarm optimization-like dynamics with an explicit-state Markov switching controller to schedule exploration and refinement operators online. FCPO integrates (i) state-conditioned bounded motion, (ii) an elite-difference global jump operator to escape stagnation, (iii) eigen-space guided local refinement from elite covariance, and (iv) linear population size reduction to control late-stage computational cost. We benchmark FCPO on five representative functions from the Congress on Evolutionary Computation (CEC) 2022 suite (F1, F2, F3, F6 and F10) at dimensions D$\in${10,20} over 30 independent runs, comparing against PSO, CSO, CLPSO, SHADE, L-SHADE and CMA-ES. FCPO achieves the lowest mean runtime across the ten benchmark cases (average 0.183 s), about 2.3x faster than CMA-ES and 2.6x faster than L-SHADE in our Python implementation. On the multimodal composition function F10 at D=20, FCPO attains the best mean objective (9.625x

Impact 4.0Import 4.0Pop 3.5
#553
arXiv cs.NE 2026-04-17
by Qianshi Wang, Xilong Qu, Wenbin Pei, Nan Li et al.

Influence maximization (IM) is a fundamental problem in complex network analysis, with a wide range of real-world applications. To date, existing approaches to influential node identification in IM have predominantly relied on standard graphs, failing to capture higher-order intrinsic interactions embedded in many real-world systems. Hypergraphs can be employed to better capture higher-order interactions. However, using hypergraphs may lead to an excessively large search space and increased complexity in modeling cascading dynamics, making it challenging to accurately identify influential nodes. Therefore, in this study, we propose a new hypergraph-modeled IM method, based on the Discrete Particle Swarm Optimization algorithm and the threshold model. In the proposed method, a particle (i.e., a candidate solution) represents the selection information of seed nodes, and the fitness function is designed to accurately and efficiently evaluate the influence of seed nodes via a two-layer local influence approximation. We also propose a degree-based initialization strategy to improve the quality of initial solutions and develop rules for updating particles' velocity and position, incorpor

Impact 4.0Import 4.0Pop 3.5
#554
arXiv cs.NE 2026-04-17
by Hyeongmeen Baik, Hamed Poursiami, Maryam Parsa, Jinia Roy

Always-on converter health monitoring demands sub-mW edge inference, a regime inaccessible to GPU-based physics-informed neural networks. This work separates spiking temporal processing from physics enforcement: a three-layer leaky integrate-and-fire SNN estimates passive component parameters while a differentiable ODE solver provides physics-consistent training by decoupling the ODE physics loss from the unrolled spiking loop. On an EMI-corrupted synchronous buck converter benchmark, the SNN reduces lumped resistance error from $25.8\%$ to $10.2\%$ versus a feedforward baseline, within the $\pm 10\%$ manufacturing tolerance of passive components, at a projected ${\sim}270\times$ energy reduction on neuromorphic hardware. Persistent membrane states further enable degradation tracking and event-driven fault detection via a $+5.5$ percentage-point spike-rate jump at abrupt faults. With $93\%$ spike sparsity, the architecture is suited for always-on deployment on Intel Loihi 2 or BrainChip Akida.

Impact 4.0Import 4.0Pop 3.5
#555
arXiv cs.CV 2026-04-17
by Siddhant Bharadwaj, Ashish Vashist, Fahimul Aleem, Shruti Vyas

Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic underst

Impact 4.0Import 4.0Pop 3.5
#556
arXiv cs.CV 2026-04-17
by Nishq Poorav Desai, Ali Etemad, Michael Greenspan

Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.

Impact 4.0Import 4.0Pop 3.5
#557
arXiv cs.CV 2026-04-17
by Khawaja Azfar Asif, Rafaqat Alam Khan

Panoramic radiography is a fundamental diagnostic tool in dentistry, offering a comprehensive view of the entire dentition with minimal radiation exposure. However, manual interpretation is time-consuming and prone to errors, especially in high-volume clinical settings. This creates a pressing need for efficient automated solutions. This study presents the first application of YOLOv26 for automated tooth detection, FDI-based numbering, and dental disease segmentation in panoramic radiographs. The DENTEX dataset was preprocessed using Roboflow for format conversion and augmentation, yielding 1,082 images for tooth enumeration and 1,040 images for disease segmentation across four pathology classes. Five YOLOv26-seg variants were trained on Google Colab using transfer learning at a resolution of 800x800. Results demonstrate that the YOLOv26m-seg model achieved the best performance for tooth enumeration, with a precision of 0.976, recall of 0.970, and box mAP50 of 0.976. It outperformed the YOLOv8x baseline by 4.9% in precision and 3.3% in mAP50, while also enabling high-quality mask-level segmentation (mask mAP50 = 0.970). For disease segmentation, the YOLOv26l-seg model attained a bo

Impact 4.0Import 4.0Pop 3.5
#558
arXiv cs.CV 2026-04-17
by Muhammad Z. Alam, Larry Stetsiuk, Arooba Zeshan

This paper presents a novel saturation aware space variant blind image deblurring framework designed to address challenges posed by saturated pixels in deblurring under high dynamic range and low light conditions. The proposed approach effectively segments the image based on blur intensity and proximity to saturation, leveraging a pre estimated Light Spread Function to mitigate stray light effects. By accurately estimating the true radiance of saturated regions using the dark channel prior, our method enhances the deblurring process without introducing artifacts like ringing. Experimental evaluations on both synthetic and real world datasets demonstrate that the framework improves deblurring outcomes across various scenarios showcasing superior performance compared to state of the art saturation-aware and general purpose methods. This adaptability highlights the framework potential integration with existing and emerging blind image deblurring techniques.

Impact 4.0Import 4.0Pop 3.5
#559
arXiv cs.CV 2026-04-17
by Lorenzo Beltrame, Jules Salzinger, Filip Svoboda, Jasmin Lampert et al.

We present a three-stage progressive shadow-removal pipeline for the CVPR2026 NTIRE WSRD+ challenge. Built on OmniSR, our method treats deshadowing as iterative direct refinement, where later stages correct residual artefacts left by earlier predictions. The model combines RGB appearance with frozen DINOv2 semantic guidance and geometric cues from monocular depth and surface normals, reused across all stages. To stabilise multi-stage optimisation, we introduce a contraction-constrained objective that encourages non-increasing reconstruction error across the cascade. A staged training pipeline transfers from earlier WSRD pretraining to WSRD+ supervision and final WSRD+ 2026 adaptation with cosine-annealed checkpoint ensembling. On the official WSRD+ 2026 hidden test set, our final ensemble achieved 26.680 PSNR, 0.8740 SSIM, 0.0578 LPIPS, and 26.135 FID, ranked first overall, and won the NTIRE 2026 Image Shadow Removal Challenge. The strong performance of the proposed model is further validated on the ISTD+ and UAV-SC+ datasets.

Impact 4.0Import 4.0Pop 3.5
#560
arXiv cs.CV 2026-04-17
by Michał Romaszewski, Dominik Kopeć, Michał Cholewa, Katarzyna Kołodziej et al.

Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into t

Impact 4.0Import 4.0Pop 3.5
#561
arXiv cs.CV 2026-04-17
by Federico Nocentini, Kwanggyoon Seo, Qingju Liu, Claudio Ferrari et al.

Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhy

Impact 4.0Import 4.0Pop 3.5
#562
arXiv cs.CV 2026-04-17
by Laziz Hamdi, Amine Tamasna, Thierry Paquet

Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision--language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table input

Impact 4.0Import 4.0Pop 3.5
#563
arXiv cs.CV 2026-04-17
by Jieming Yu, Qiuxiao Feng, Zhuohan Wang, Xiaochen Ma

With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the researc

Impact 4.0Import 4.0Pop 3.5
#564
arXiv stat.ML 2026-04-17
by Kyunghoo Mun, Matthew Rosenzweig

We study phase transitions for repulsive-attractive mean-field free energies on the circle. For a $\frac{1}{n+1}$-periodic interaction whose Fourier coefficients satisfy a certain decay condition, we prove that the critical coupling strength $K_c$ coincides with the linear stability threshold $K_\#$ of the uniform distribution and that the phase transition is continuous, in the sense that the uniform distribution is the unique global minimizer at criticality. The proof is based on a sharp coercivity estimate for the free energy obtained from the constrained Lebedev--Milin inequality. We apply this result to three motivating models for which the exact value of the phase transition and its (dis)continuity in terms of the model parameters was not fully known. For the two-dimensional Doi--Onsager model $W(θ)=-|\sin(2πθ)|$, we prove that the phase transition is continuous at $K_c=K_\#=3π/4$. For the noisy transformer model $W_β(θ)=(e^{β\cos(2πθ)}-1)/β$, we identify the sharp threshold $β_*$ such that $K_c(β) = K_\#(β)$ and the phase transition is continuous for $β\leq β_*$, while $K_c(β) β_*$. We also obtain the corresponding sharp dichotomy for the noisy Hegselmann--Krause model $W_{R}

Impact 4.0Import 4.0Pop 3.5
#565
arXiv stat.ML 2026-04-17
by Stephan Bark, Waqas Ahmed Malik, Maryna Prus, Hans-Peter Piepho et al.

In variety testing, multi-environment trials (MET) are essential for evaluating the genotypic performance of crop plants. A persistent challenge in the statistical analysis of MET data is the estimation of variance components, which are often still inaccurately estimated or shrunk to exactly zero when using residual (restricted) maximum likelihood (REML) approaches. At the same time, institutions conducting MET typically possess extensive historical data that can, in principle, be leveraged to improve variance component estimation. However, these data are rarely incorporated sufficiently. The purpose of this paper is to address this gap by proposing a Bayesian framework that systematically integrates historical information to stabilize variance component estimation and better quantify uncertainty. Our Bayesian linear mixed model (BLMM) reformulation uses priors and Markov chain Monte Carlo (MCMC) methods to maintain the variance components as positive, yielding more realistic distributional estimates. Furthermore, our model incorporates historical prior information by managing MET data in successive historical data windows. Variance component prior and posterior distributions are s

Impact 4.0Import 4.0Pop 3.5
#566
arXiv stat.ML 2026-04-17
by Hidetoshi Kawase, Toshihiro Ota

In finite-width deep neural networks, the empirical kernel $G$ evolves stochastically across layers. We develop a collective kernel effective field theory (EFT) for pre-activation ResNets based on a $G$-only closure hierarchy and diagnose its finite validity window. Exploiting the exact conditional Gaussianity of residual increments, we derive an exact stochastic recursion for $G$. Applying Gaussian approximations systematically yields a continuous-depth ODE system for the mean kernel $K_0$, the kernel covariance $V_4$, and the $1/n$ mean correction $K_{1,\mathrm{EFT}}$, which emerges diagrammatically as a one-loop tadpole correction. Numerically, $K_0$ remains accurate at all depths. However, the $V_4$ equation residual accumulates to an $O(1)$ error at finite time, primarily driven by approximation errors in the $G$-only transport term. Furthermore, $K_{1,\mathrm{EFT}}$ fails due to the breakdown of the source closure, which exhibits a systematic mismatch even at initialization. These findings highlight the limitations of $G$-only state-space reduction and suggest extending the state space to incorporate the sigma-kernel.

Impact 4.0Import 4.0Pop 3.5
#567
arXiv AIScience 2026-04-17
by Hamdy Arkoub, Jia-Hong Ke, Miaomiao Jin

Ni-based structural alloys in molten salt environments often experience simultaneous mechanical loading and corrosive attack, yet the mechanisms governing stress-corrosion interactions remain unclear. Prior studies largely emphasize tensile stress, while the role of compressive stress has received limited attention. Here, reactive molecular dynamics simulations are used to investigate the coupled effects of applied strain and corrosion in Ni$_{0.75}$Cr$_{0.25}$ exposed to molten FLiNaK at 800$^\circ$C. A $\Sigma5(210)$ grain boundary model is subjected to tensile (+4%) to compressive (-4%) uniaxial strains, and corrosion behavior is evaluated through fluorine adsorption, charge redistribution, and grain boundary evolution. Tensile strain accelerates intergranular corrosion by reducing local atomic packing through elastic dilation and increasing excess free volume at the grain boundary, which enhances atomic mobility and salt infiltration. In contrast, compressive strain suppresses corrosion by promoting the formation of a ridge-like surface layer along the grain boundary, limiting salt access to the underlying alloy. These results provide atomistic insight into how stress states in

Impact 4.0Import 4.0Pop 3.5
#568
arXiv AIScience 2026-04-17
by Yu. D. Fomin

Molecular dynamics is a powerful tool to investigate the properties of fluid systems. However, a correct interpretation of the results of simulations is required. In particular, some simulations show appearance of large voids in liquids, which contradicts our common sense on what is liquid. In the present paper we discuss the origin of large cavities liquids in molecular dynamics simulations. We demonstrate that the cavities appear either if the temperature of the system is above the critical temperature of liquid-gas transition or if the system is in two-phase liquid-gas region. These conclusions are illustrated by several examples from literature and our own simulations.

Impact 4.0Import 4.0Pop 3.5
#569
arXiv AIScience 2026-04-17
by Weilun Wang, Zirui Wang, Wantong Li

Neuro-symbolic AI is gaining traction in domains such as large language models, scientific discovery, and autonomous systems due to its ability to combine perception with structured reasoning. However, its deployment is often constrained by high memory demands, diverse computation patterns, and complex hardware requirements. Existing hardware platforms struggle with large on-chip memory overheads, frequent pipeline stalls, limited I/O bandwidth, and inefficient handling of nonlinear operations. To address these key computational bottlenecks, we propose Overmind, a unified neuro-symbolic architecture with cross-layer optimizations. Overmind tackles these core bottlenecks through Padé approximations for universal nonlinear functions, preemptive memory bypass that eliminates costly on-chip caches, and a complete software stack that optimizes model deployment. By reconfiguring the Padé orders for approximating nonlinear functions, we also demonstrate adaptive accuracy-performance scaling. Overmind achieves an energy efficiency of 8.1 TOPS/W and a throughput of 410 GOPS for mixed neuro-symbolic workloads with minimal model accuracy loss. Compared to existing solutions, Overmind improves

Impact 4.0Import 4.0Pop 3.5
#570
arXiv Evals 2026-04-17
by Heewon Oh

We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics -- extracting and analyzing the physical artifacts that neural audio codecs inevitably imprint on generated audio. A bounded-mask UNet (ArtifactUNet, 3.6M parameters) extracts codec residuals from magnitude spectrograms, which are then decomposed via HPSS into 7-channel forensic features for classification by a compact CNN (0.4M parameters; 4.0M total). We introduce ArtifactBench, a multi-generator evaluation benchmark comprising 6,183 tracks (4,383 AI from 22 generators and 1,800 real from 6 diverse sources). Each track is tagged with bench_origin for fair zero-shot evaluation. On the unseen test partition (n=2,263), ArtifactNet achieves F1 = 0.9829 with FPR = 1.49%, compared to CLAM (F1 = 0.7576, FPR = 69.26%) and SpecTTTra (F1 = 0.7713, FPR = 19.43%) evaluated under identical conditions with published checkpoints. Codec-aware training (4-way WAV/MP3/AAC/Opus augmentation) further reduces cross-codec probability drift by 83% (Delta = 0.95 -> 0.16), resolving the primary codec-invariance failure mode. These results establish forensic physics -- direct extrac

Impact 4.0Import 4.0Pop 3.5
#571
arXiv Evals 2026-04-17
by Anzhou Wen, Praneeth Chakravarthula

Holographic displays are widely regarded as the "ultimate" display technology, promising immersive 3D visuals with natural depth cues, continuous parallax, and perceptual realism. Realizing this potential, however, has remained elusive due to persistent image quality limitations -- most notably speckle noise, a byproduct of the random interference inherent to coherent light. This is typically further exacerbated by the hologram's phase randomness required for maintaining uniform energy distribution across the eyebox. While speckle suppression techniques like temporal multiplexing or smooth-phase heuristics exist, they often necessitate high-speed hardware and introduce visual artifacts, hindering their practical adoption. We introduce Ellipsography, a single-shot holography technique that achieves near-limit speckle suppression, reaching the image fidelity equivalent to averaging a million conventional scalar holograms -- in a single frame in simulation. By jointly modulating the phase and polarization of light, we structure optical interference and suppress speckle at its source. We present a full pipeline including a vectorial wave model, an end-to-end hologram synthesis algorith

Impact 4.0Import 4.0Pop 3.5
#576
TechCrunch AI 2026-04-17

The gap between AI insiders and everyone else&#160;is widening, and the spending, suspicion, and even new vocabulary are starting to show it. While OpenAI is busy buying up everything from&#160;finance apps&#160;to&#160;talk shows, a certain shoe company just&#160;rebranded as an AI infrastructure play, and Anthropic unveiled a model it says is&#160;too powerful to release publicly&#160;&#8230;but&#160;apparently not&#160;too [&#8230;]

Impact 4.0Import 4.0Pop 3.5
#577
TechCrunch AI 2026-04-17

The gap between AI insiders and everyone else&#160;is widening, and the spending, suspicion, and even new vocabulary are starting to show it. While OpenAI is busy buying up everything from&#160;finance apps&#160;to&#160;talk shows, a certain shoe company just&#160;rebranded as an AI infrastructure play, and Anthropic unveiled a model it says is&#160;too powerful to release publicly&#160;&#8230;but&#160;apparently not&#160;too [&#8230;]

Impact 4.0Import 4.0Pop 3.5
#578
Stratechery 2026-04-17

( Amazon ) Welcome back to This Week in Stratechery! As a reminder, each week, every Friday, we&#8217;re sending out this overview of content in the Stratechery bundle; highlighted links are free for everyone . Additionally, you have complete control over what we send to you. If you don&#8217;t want to receive This Week in Stratechery emails (there is no podcast), please uncheck the box in your delivery settings . On that note, here were a few of our favorites this week. The Cost of AI. The key to understanding and analyzing tech has been appreciating the implications of zero marginal costs, which govern the economics of everything from chips to software to services. AI services generally fall under the same rubric — fixed costs in terms of data centers and chips matter more than marginal costs (mostly electricity) — but the worsening shortage in compute means it is opportunity costs that matter more than ever . Companies will have to make hard choices, and the biggest loser might be the serially unfocused OpenAI. — Ben Thompson What Is Amazon Doing with Globalstar? Earlier this week Amazon announced an $11.8 billion deal to purchase Globalstar satellites in what was billed as a move to ramp the company&#8217;s competition with Elon Musk and Starlink There may be more going on with that deal, though, and Wednesday&#8217;s Daily Update explored what Apple&#8217;s role might have been. We went deeper on all this on Friday&#8217;s episode of Sharp Tech , and I loved the segment as a window into Amazon&#8217;s motivations for satellite investments generally, and the questions surrounding this deal specifically. — Andrew Sharp Nico Rosberg on Racing and Investing. &nbsp;As a religious F1 fan I&#8217;m obligated to recommend this week&#8217;s Stratechery Interview with Nico R

Impact 4.0Import 4.0Pop 3.5
#579
GitHub Blog - AI/ML 2026-04-17

Every week, the GitHub team runs&nbsp; a stream called Rubber Duck Thursdays , where we build projects live, cowork with our community, and answer questions! This week, we built a very fun project together using the&nbsp; GitHub Copilot CLI ! Let me tell you about it. &#128161;&nbsp; New to GitHub Copilot CLI?&nbsp; Here&rsquo;s how to get started. What is it? In a lot of social media tweets and launches, you often see accounts post things like: We shipped the most amazing emoji list generator ever. It: &#128187; Works in the CLI &#129302; Uses the Copilot SDK to intelligently convert your bullet points to relevent emoji &#128203; Copies the result to the clipboard It&rsquo;s beautiful. But coming up with the perfect emoji is far too slow for me in this &ldquo;move fast and break things&rdquo; world. I have projects to build! Repos to vibe! Pull requests to merge! I can&rsquo;t be thinking about emojis! And thus, on the stream, we build an emoji list generator (very descriptively called Emoji List Generator) that: &#128421;&#65039; Runs in the terminal &#128203; You paste or write a list &#9000;&#65039; You hit Ctrl + S &#128206; You get the list on your clipboard (Can you tell I&rsquo;m dogfooding the product here?) How we built it We used a few cool technologies for this project: &#128421;&#65039;&nbsp; @opentui/core for the terminal UI &#129302;&nbsp; @github/copilot-sdk for the AI brain &#128203;&nbsp; clipboardy for clipboard access To start the project off, we opened up the GitHub Copilot CLI. In plan mode using Claude Sonnet 4.6, we wrote: I want to create an AI-powered markdown emoji list generator. Where, in this CLI app, if I paste in or write in some bullet points, it will replace those bullet points with relevant emojis to the given point in that list, and c

Impact 4.0Import 4.0Pop 3.5
#580
C4ISRNET 2026-04-17

U.S. Air Force airmen operated a semiautonomous jet-powered combat drone in a series of sorties recently, boosting the service’s Collaborative Combat Aircraft program. The force’s Experimental Operations Unit conducted hands-on testing with Anduril’s YFQ-44A aircraft at Edwards Air Force Base, California, in an effort to utilize “principles of the new Warfighting Acquisition System,” according to a Thursday Air Force release . Previously, the concept employed by the force was fully human-piloted drones, and now, “there is no operator with a stick and throttle flying the aircraft behind the scenes,” Jason Levin, Anduril’s senior vice president of engineering for air dominance and strike, said in an October 2025 company release. The testing took place sometime last week, according to a Thursday Anduril social media post written by vice president of autonomous airpower Mark Shushnar. Shushnar said in the post that the EOU gained experience launching, recovering and turning the aircraft during the exercise, and it conducted the pre- and post-flight checks and clearances, weapons loading and unloading and direct tasking of the air vehicle during taxi and flight. The EOU operators used a ruggedized laptop to upload mission plans, initiate autonomous taxi and takeoff, task the in-flight aircraft and manage post-flight data, Shushnar said, taking out the previous need for fixed infrastructure of a large, established base. Shushnar highlighted how the YFQ-44A is designed to be easy to maintain with a small crew compared to traditional unmanned aerial vehicles. The exercise demonstrated that, he said. With only a couple days of training, a handful of EOU maintainers were able to turn the aircraft between sorties. The exercise showcases a move toward “operator-driver experimentati

Impact 4.0Import 4.0Pop 3.5
#581
C4ISRNET 2026-04-17

PARIS — Dutch regional broadcaster Omroep Gelderland was able to track the Royal Netherlands Navy air-defense frigate Evertsen in real time by sending a Bluetooth tracker to the ship by military mail. The frigate is part of the carrier strike group around France’s Charles de Gaulle aircraft carrier currently deployed in the Mediterranean Sea. The tracker was discovered while sorting mail on board, though only after Omroep Gelderland had been tracking the Evertsen for 24 hours, the broadcaster wrote on its website on Thursday . The Dutch Ministry of Defence said it’s taking measures in response, according to the broadcaster. The tracker incident comes after a Le Monde reported in March it was able to locate a French officer taking a 7-kilometer run around the deck of the Charles de Gaulle while the carrier was at sea, through data from the officer’s connected watch via the running and cycling app Strava. “You do want to be able to intercept such a tracker,” Rowin Jansen, assistant professor of national security law at Radboud University in Nijmegen, told Omroep Gelderland. “Commercial satellite images are currently released with a delay for good reason. You certainly don’t want to make it easy for terrorists to send a similar package and track a ship’s location in real time. You then run the risk of having missiles fired at you.” The broadcaster described sending the Bluetooth tracker, a gadget used for example to find keys, to the frigate in an envelope using the military postal service, following online instructions from the MoD on how to send mail to military personnel. While the ministry checks whether prohibited or dangerous items are sent by mail by X-ray scanning packages, Omroep Gelderland noted that online videos showed envelops not being scanned, so decided to

Impact 4.0Import 4.0Pop 3.5
#582
FedScoop 2026-04-17

Despite growth in traditional futures, the emergence of prediction markets and the mainstreaming of digital assets, the Commodity Futures Trading Commission is doing just fine with a slimmed-down staff thanks in large part to technology, the agency’s chairman told lawmakers this week. During a House Agriculture Committee hearing Thursday, CFTC chief Michael Selig was asked by ranking member Angie Craig how the independent regulator can still be effective amid rapid market changes “with staffing levels significantly less than what the first Trump administration had requested.”&nbsp; The Minnesota Democrat said the CFTC’s staff is down 20% from the end of fiscal 2024, but Selig claimed the agency is “running more efficiently and effectively than ever before,” crediting the “right-sizing of the government” under President Donald Trump. “It’s absolutely vital that we continue to monitor, surveil and police our markets. And we are doing just that,” Selig continued. “We are utilizing new tools, from AI to automation and other surveillance systems that we&#8217;re building out, and we take this responsibility very seriously.” The CFTC, which is four commissioners shy of a full body, has hired people from the private sector and continues to recruit through USAJobs.gov , Selig said, adding that “it’s absolutely critical” that the regulator has the right funding and resources to monitor the markets. But despite the concerns of Craig and several other House Agriculture Democrats, Selig believes the CFTC’s fiscal 2027 budget proposal , which calls for 650 full-time positions, will leave the agency “adequately staffed.” He told Rep. Shontel Brown, D-Ohio, that reports of the regulator being under-resourced was “fake news.” “It&#8217;s really critical that we continue to bring on new

Impact 4.0Import 4.0Pop 3.5
#583

Ceasefires and Communications

Gov/Defense ★ 3.9
War on the Rocks 2026-04-17

Welcome to The Adversarial.&#160;Every other week, we&#8217;ll provide you with expert analysis on America&#8217;s greatest challengers: China, Russia, Iran, North Korea, and jihadists. Read more below.***IranIn the space of less than 11 hours on April 7, President Donald Trump went from&#160;warning&#160;that &#8220;a whole civilization will die tonight&#8221; to&#160;announcing&#160;a two-week ceasefire with Iran. That whiplash-inducing turn was just a taste of the twists that would follow over the coming week. First came a resumption of negotiations mediated by Pakistan. With the presence of Vice President J.D. Vance and Iran&#8217;s parliamentary speaker, it marked the highest in-person engagement between the two sides The post Ceasefires and Communications appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#584
War on the Rocks 2026-04-17

If the United States is drawn into another round of military action in Yemen, it ought to avoid the mistakes of the last decade. From about 2015 to last year, successive administrations backed Saudi&#8209; and Emirati&#8209;led military campaigns, arms sales, and naval blockades that devastated civilians, deepened Yemen&#8217;s fragmentation, and perversely strengthened Houthi power and legitimacy instead of containing it.Instead, Washington should treat force as one tool within a broader political and economic strategy. Officially known as Ansar Allah (&#8220;Partisans of God&#8221;) and referred to here as the Houthis for ease of reference, the group is a Zaydi revivalist political-military The post How to Counter the Houthis Without Strengthening Them appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#585
War on the Rocks 2026-04-17

It is clear that the U.S.-Israeli decapitation, airpower-centric, and precision strike campaign are again not enough to bring about the two countries&#8217; strategic goals. The United States still finds itself potentially having to commit land forces into its war with Iran, yet many unknowns lurk, including a fragile ceasefire and a U.S. naval blockade of the Strait of Hormuz. Setting aside the question of whether U.S. land forces will be committed to the war, the question that remains unknown is how, and to what degree, they would be employed.The continued relevance of land forces in the 21st century is a The post Winning in the Donbas: What Russia’s 2014–2015 Campaign Reveals About Modern War appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#586
TechCrunch AI 2026-04-18

The company now offers robotaxi service in three cities, all of them in Texas, after launching in Austin last year and starting to offer rides without safety drivers in January 2026.

Impact 4.0Import 4.0Pop 3.5
#587
TechCrunch AI 2026-04-18

In recent months, the company announced an agreement with Amazon Web Services to use Cerebras chips in Amazon data centers, as well as a deal with OpenAI reportedly worth more than $10 billion.

Impact 4.0Import 4.0Pop 3.5
#589
TechCrunch AI 2026-04-18

New data from Appfigures shows a swell of new app launches in 2026, suggesting AI tools could be fueling a mobile software boom.

Impact 4.0Import 4.0Pop 3.5
#590
TechCrunch AI 2026-04-19

On the latest episode of Equity, we discuss OpenAI's latest acquisitions and whether they address "two big existential problems" for the company.

Impact 4.0Import 4.0Pop 3.5
#591

The 12-month window

Industry ★ 3.9
TechCrunch AI 2026-04-19

A lot of AI startups exist partly because the foundation models haven't expanded into their category yet. As many jokingly acknowledge, that won't last forever.

Impact 4.0Import 4.0Pop 3.5
#594
Stratechery 2026-04-20

TSMC's earnings suggest that the company's leadership is not truly bought into the AI growth story.

Impact 4.0Import 4.0Pop 3.5
#595
War on the Rocks 2026-04-20

Think of a violin made by a master craftsman: beautiful, precise, capable of extraordinary performance, but impossible to produce quickly or cheaply. It takes time, rare expertise, and materials that cannot be sourced at scale. You would not equip an entire orchestra with instruments like that. Yet that is essentially what the United States has attempted with its tactical air fleet.The F-35 program&#8217;s total lifetime cost is projected to exceed two trillion dollars, the most expensive Major Defense Acquisition Program in history. The United States plans to purchase thousands of them. Meanwhile, modern conflict, from Ukraine&#8217;s drone war to naval The post The F-35 Is a Masterpiece Built for the Wrong War appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5
#596
War on the Rocks 2026-04-20

In Sept. 2025, Pakistan and Saudi Arabia signed a mutual defense pact, formalizing what decades of quiet cooperation had already made real. The defense pact signed in Riyadh was presented in official communiqu&#233;s as a natural deepening of bilateral ties. It was that, but it was also something larger: The latest installment in a pattern that has persisted for half a century and that continues to confound the logic of power politics. Pakistan, a state dependent on International Monetary Fund bailouts and outmatched conventionally by its larger neighbor, has once again positioned itself at the center of a consequential security The post Iran and the Indispensable Broker: How Pakistan Outmaneuvers India on the World Stage appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5