← Archive / all digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — 2026-04-23

Coverage window: 2026-04-22 03:39 ET2026-04-23 03:02 ET
Press play to listen
Thursday, April 23, 2026
14m 38s · top-4 narrated briefing
Must-read · top 3
#1 · Agents & Tool Use
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model Big claims from Qwen about their latest open weight model: Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previous-generation open-s…
Score 8.3
#2 · Interpretability
Emotion Concepts and their Function in a Large Language Model
Research identifying representations of emotion concepts in Claude Sonnet 4.5 and demonstrating that they causally influence its outputs. Sofroniew et al. show via targeted interventions that emotional concepts are embed…
Score 8.1
#3 · Generative Media
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic d…
Score 7.7
Filter
6.5
Showing 283 of 283 items
#1

Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

Agents & Tool Use 2026-04-22 Simon Willison's Weblog
8.3
I 9.0 Im 7.8 P 8.2

Alibaba released Qwen3.6-27B today, and the interesting part is not that it is another checkpoint in the Qwen line but that the 27 billion parameter dense model outperforms the previous Qwen3.5 flagship, which was a 397 billion parameter mixture-of-experts checkpoint, across the reference coding benchmarks that shipped with the release. That is a roughly fifteen-fold compression of active parameters with a net improvement in HumanEval-Pro, LiveCodeBench, and the internal code-editing harnesses that Alibaba publishes. The model weights come in around 55.6 gigabytes at full precision, and the four-bit quantized build Simon Willison tested lands at 16.8 gigabytes, which means it runs on a single consumer-grade twenty-four gigabyte GPU with headroom for context, or on an M-series Mac Studio comfortably. That shift is what matters practically, because the Qwen3.5 flagship was out of reach for anyone not renting accelerators, and this version is not. On the qualitative side, Willison's reproducibility test, in which he asks every new open-weight model to draw an SVG of a pelican riding a bicycle, came out with visibly correct geometry at first attempt, suggesting the vision-language grounding transferred well through whatever distillation or synthetic-data pipeline Alibaba used to train the dense student. The post-training pipeline for Qwen3.6 appears to lean heavily on reinforcement learning with verifiable rewards on coding-specific traces, consistent with the trend across the field away from preference-only methods and toward execution-grounded RL for code. Caveats are worth naming. Alibaba's benchmark reporting has historically been optimistic, and the comparison to Claude Opus and GPT-5 class models is not yet reflected on independent leaderboards like Artificial Analysis or LiveCodeBench at the time of release. The community will want to see whether the agentic-coding performance, which depends on tool-use fidelity and long-context recall more than single-turn completion, holds up under SWE-bench Verified and the newer in-the-wild harnesses. If it does, this is one of the most interesting open-weight coding releases since DeepSeek Coder V3, because it flips the assumption that you need MoE scale to compete at the frontier for code. If it does not, then it is a reminder that benchmark curation is doing a lot of work in how these models appear to compare.

agents
#2

Emotion Concepts and their Function in a Large Language Model

Interpretability 2026-04-22 Transformer Circuits Thread (Anthropic)
8.1
I 8.2 Im 9.5 P 6.5

Anthropic's Transformer Circuits team, led by Sofroniew and collaborators, published a paper today identifying internal representations of emotion concepts inside Claude Sonnet 4.5 and demonstrating, through causal interventions, that those representations do not merely correlate with emotionally loaded outputs but actively shape the model's generation. This is methodologically in the same line as the group's earlier work on refusal features and sycophancy features: they find a low-rank subspace that a linear probe can read off, verify that the subspace carries specific emotional semantics like frustration, reassurance, or contempt, and then steer the model by adding or subtracting along those directions at inference time to observe behaviorally different outputs. The new contribution is that emotions appear to be organized not as isolated features but as a coherent, structured concept space, with separable axes for valence, arousal, and target-of-emotion, and that activations along these axes track what the model will say in a way you can predict before the token is sampled. This has several immediate implications for alignment. First, it gives evaluators a mechanistic handle on what has until now been a qualitative concern about models that feign empathy or simulate distress during conversations with vulnerable users; you can now measure the intervention. Second, it bears on the long-running question of whether chain-of-thought reports are faithful to the model's internal state: the paper shows cases where the emotional feature is active even when the surface text does not express the corresponding emotion, which is evidence for a genuine hidden state. Third, it raises the stakes for moral status discussions, because the presence of a functionally integrated emotion space is precisely what proponents of the welfare-matters view have pointed to as a necessary condition. The authors are careful about this last point, framing their contribution as characterization rather than metaphysical claim. For the interpretability community, the methodological news is that the sparse-autoencoder plus causal-patching pipeline now scales smoothly to the question of affect, which had been resisted because emotion concepts are distributed in a way that simpler feature hunts missed. Expect follow-ups on whether the same structure appears in open-weight models and whether emotion-feature steering survives fine-tuning, both of which determine how practically useful this work becomes for downstream safety evaluations.

interpretability
#3

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Generative Media 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Efficiency (Quantization, MoE, Inference)arXiv — Generative Media / DiffusionHugging Face Daily Papers
7.7
I 8.5 Im 7.0 P 7.6

LLaDA 2.0 Uni, from inclusionAI, landed at the top of Hugging Face Daily Papers today with 86 likes, and it is the most ambitious entry so far in the growing effort to build unified multimodal models on top of discrete diffusion rather than autoregressive transformers. The architecture combines three components: a fully semantic discrete tokenizer that maps image, video, and audio modalities into a shared vocabulary; a mixture-of-experts backbone where experts are routed by both token identity and modality; and a diffusion-based decoder that generates across modalities in a single unified objective. The paper's headline claim is that one model handles understanding tasks, like visual question answering and video captioning, and generation tasks, like text-to-image and controllable image editing, within the same forward pass rather than switching between heads. Benchmark numbers land in the competitive range with Qwen-VL and InternVL on understanding tasks, and with SD3 and FLUX on generation tasks, which is the first time a unified diffusion LLM has been reported at parity across both sides of the understanding-generation split. The discrete-diffusion design choice is worth flagging. Autoregressive VLMs have dominated the field since CLIP and BLIP, and the unified-model direction led by GPT-4o and Gemini has mostly stayed autoregressive with continuous latents. LLaDA 2.0 takes the opposite bet: discrete tokens throughout, diffusion for decoding. The advantage the authors emphasize is parallel generation, which is genuinely faster at long sequences because diffusion steps denoise all positions simultaneously, unlike autoregressive decoding, which is inherently sequential. The disadvantage, which the paper acknowledges less prominently, is that training a discrete diffusion model at this scale is more compute-hungry than the equivalent autoregressive model, and the authors do not report total compute beyond a rough parameter count. The mixture-of-experts routing also appears to carry a lot of capacity: the paper reports active parameter counts well below the total, which mitigates inference cost but complicates reproducibility for smaller labs. The broader reason this matters is that it adds a second viable non-autoregressive path for frontier multimodal models. If discrete diffusion continues to scale competitively, the next generation of unified models may split between continuous-latent autoregressive stacks, which GPT-5 and Gemini 3 appear to follow, and discrete-token diffusion stacks, which LLaDA is pioneering, with real practical consequences for training cost, inference latency, and how editing and controllability are exposed to users.

How it was discussed across sources
  • Hugging Face Daily Papers: 86 likes, top of the page today
generative_media
#4

Multi-Agents: What's Actually Working

Agents & Tool Use 2026-04-22 Cognition AI (Devin)
7.5
I 7.5 Im 7.8 P 7.3

Walden Yan published a piece on the Cognition blog today revisiting a position he took a year ago, which was that multi-agent systems were mostly a source of churn rather than capability, and updating it with what the team has learned from shipping Devin at scale. The new claim is more nuanced and, Yan argues, empirically grounded: a narrow class of multi-agent configurations works, specifically those where multiple agents contribute intelligence in parallel but all writes to shared state stay single-threaded through a single orchestrator. The argument is that the failure modes of earlier multi-agent systems, most notably the ones where two or more agents edit the same codebase concurrently, almost always reduce to a merge-conflict problem that the agents cannot resolve robustly because they lack a shared world model of what the other agent did. The solution that has worked at Cognition is to treat the agents as research or planning workers whose outputs are proposals, and to funnel those proposals through a single write-authority agent that holds the canonical state and commits changes deterministically. This is not a new software engineering pattern, it is essentially a coordinator with workers, but Yan's contribution is showing that when you ignore it the multi-agent setup collapses within hours of running on real tasks, and when you adopt it the system composes cleanly. The post gets specific about what parallel intelligence contributes, including simultaneous exploration of alternative fixes, parallel code review from multiple perspectives, and speculative execution of likely next steps, all of which serialize their writes through the primary agent. Yan notes that Cognition did not find empirical lift from the multi-agent debate patterns that featured heavily in earlier research, nor from fully autonomous swarms where any agent can commit. The practical takeaway for anyone building production coding agents is to architect around single-writer invariants from the start and to use multi-agent parallelism for intelligence generation, not for state mutation. The strategic takeaway for the field is that the past year of hype about multi-agent systems produced one durable design pattern and a lot of dead ends, and that this pattern is now visible in the products that actually ship, including Cursor's agent mode, Devin, and the Claude Code harness. It is a useful grounding post in a week that otherwise contained a lot of multi-agent marketing.

agents
#5

Introducing workspace agents in ChatGPT

Agents & Tool Use 2026-04-22 OpenAI Research
7.4
I 8.0 Im 7.0 P 7.3

Workspace agents in ChatGPT are Codex-powered agents that automate complex workflows, run in the cloud, and help teams scale work across tools securely.

agents
#6

Qwen3.5-Omni Technical Report

Generative Media 2026-04-22 Hugging Face Daily Papers
7.3
I 8.0 Im 7.0 P 6.9

Omnimodal language model achieving state-of-the-art performance across 215 audio and audio-visual benchmarks. Spans text, image, audio, and video generation with low-latency real-time interaction.

generative_media
#7
7.2
I 7.0 Im 8.0 P 6.5

A spokesperson for Southcom said the new SAWC will “employ autonomous, semi-autonomous, and unmanned platforms.” The post Southcom creates new Autonomous Warfare Command to build up its drone prowess appeared first on DefenseScoop .

gov_defense
#11

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Agents & Tool Use 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool UsearXiv — Evals & Benchmarks
6.9
I 7.0 Im 7.3 P 6.5

AI coding agents are being adopted at scale, yet we lack empirical evidence on how people actually use them and how much of their output is useful in practice. We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild. The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls. SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories. Leveraging SWE-chat, we provide an initial empirical characterization of real-world coding agent usage and failure modes. We find that coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves. Despite rapidly improving capabilities, coding agents remain inefficient in natural settings. Just 44% of all agent-produced code survives into user commits, and agent-written code introduces more security vulnerabilities than code authored by humans. Furthermore, users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns. By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.

agents
#12

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

Frontier LLMs 2026-04-22 Hugging Face Daily Papers
6.9
I 7.2 Im 6.8 P 6.6

Technical report from a 28-author team describing grounding of world models for real-world industrial deployment. Demonstrates transferability across manufacturing sites and robustness to novel scenarios.

frontier_llm
#13

Near-Future Policy Optimization

Post-Training 2026-04-22 arXiv cs.LG (Machine Learning)arXiv — Reinforcement Learning
6.9
I 7.5 Im 6.7 P 6.5

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

post_training
#15

Convergent Evolution: How Different Language Models Learn Similar Number Representations

Frontier LLMs 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)
6.8
I 7.0 Im 7.3 P 6.0

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

frontier_llm
#16
6.8
I 7.0 Im 6.8 P 6.7

NVIDIA and Google Cloud have collaborated for more than a decade, co‑engineering a full‑stack AI platform that spans every technology layer — from performance‑optimized libraries and frameworks to enterprise‑grade cloud services. This foundation enables developers, startups and enterprises to push agentic and physical AI out of the lab and into production — from agents that […]

agents
#17

Exploring Spatial Intelligence from a Generative Perspective

State Space Models 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Generative Media / Diffusion
6.8
I 6.8 Im 6.8 P 6.8

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

ssm
#18

Scaling Test-Time Compute for Agentic Coding

Agents & Tool Use 2026-04-22 Hugging Face Daily Papers
6.6
I 6.8 Im 6.8 P 6.2

Meta AI paper on scaling laws for test-time compute applied to agentic coding workloads. Shows monotonic gains with compute up to a clear inflection point.

agents
#19

Bobby Holley on Firefox and Claude Mythos Preview

Frontier LLMs 2026-04-22 Simon Willison's Weblog
6.6
I 7.5 Im 7.0 P 5.2

Mozilla CTO Bobby Holley reports that Firefox 150 includes fixes for 271 vulnerabilities identified through collaboration with Anthropic using an early version of Claude Mythos Preview, arguing defenders finally have a chance to win decisively.

frontier_llm
#21

Google turns Chrome into an AI co-worker for the workplace

Industry 2026-04-22 TechCrunch — AI
6.5
I 6.8 Im 6.3 P 6.5

Google brings Gemini-powered "auto browse" capabilities to Chrome for enterprise users, letting workers automate tasks like research, data entry, and more.

industry
#23

Changes to GitHub Copilot Individual Plans

Agents & Tool Use 2026-04-22 Simon Willison's Weblog
6.5
I 6.5 Im 6.8 P 6.3

GitHub announced pricing changes and usage-limit tightening for Copilot Individual plans, citing increased compute demands from agentic workflows.

agents
#25

Making ChatGPT better for clinicians

Frontier LLMs 2026-04-22 OpenAI Research
6.4
I 6.5 Im 6.5 P 6.3

OpenAI makes ChatGPT for Clinicians free for verified U.S. physicians, nurse practitioners, and pharmacists, supporting clinical care, documentation, and research.

frontier_llm
#26

TEMPO: Scaling Test-time Training for Large Reasoning Models

Evaluations & Benchmarks 2026-04-22 Hugging Face Daily Papers
6.4
I 6.5 Im 6.5 P 6.2

Inference-time framework for continuous self-improvement on unlabeled data, achieving sustained gains of up to 23.5 percentage points on mathematical reasoning benchmarks.

evals
#27

Beyond ZOH: Advanced Discretization Strategies for Vision Mamba

State Space Models 2026-04-22 arXiv — State Space ModelsarXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
6.3
I 6.8 Im 6.0 P 6.2

Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM-based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first-order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher-order hold (HOH), and the fourth-order Runge-Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training-time computation. In contrast, the BIL provides consistent improvements over ZOH with modest additional overhead, offering the most favorable trade-off between precision and efficiency. These findings elucidate the pivotal role of discretization in SSM-based vision architectures and furnish empirically grounded justification for adopting BIL as the default discretization baseline for state-of-the-art SSM models.

ssm
#28

Is Claude Code going to cost $100/month?

Industry 2026-04-22 Simon Willison's Weblog
6.3
I 6.0 Im 6.0 P 7.0

Anthropic briefly updated pricing pages to suggest Claude Code might cost $100/month before reverting, creating confusion about the actual pricing structure for their coding features.

industry
#30

LlamaIndex and Kaggle launch ParseBench OCR leaderboard for AI agents

Agents & Tool Use 2026-04-22 LlamaIndex Blog
6.3
I 6.5 Im 6.0 P 6.5

LlamaIndex partners with Kaggle to launch ParseBench — a document-parsing benchmark with ~2000 human-verified pages and 167K+ test rules, scored across tables, charts, content accuracy, semantic formatting, and visual grounding. In initial testing of 14 methods, LlamaParse Agentic was the only one competitive across all five dimensions at 84.9% overall.

agents
#33

Image Generators are Generalist Vision Learners

Generative Media 2026-04-22 arXiv — Generative Media / Diffusion
6.2
I 6.5 Im 6.0 P 6.0

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

generative_media
#36
6.2
I 6.0 Im 6.8 P 5.7

Why transparent, agency-trained AI models can deliver greater reliability and control of sensitive federal data, compared to ‘bigger is better’ large language models. The post New report highlights agency advantages of using smaller, open-source AI models appeared first on FedScoop .

gov_defense
#37
6.2
I 6.0 Im 6.8 P 5.7

The blanket purchase agreement is a “continuation” of work that Palantir has done with the agency, company execs told FedScoop, including on USDA’s “One Farmer, One File” initiative. The post Agriculture Department kicks off $300M Palantir deal on IT, national security work appeared first on FedScoop .

gov_defense
#38

Cursor partners with SpaceX on model training

AI Coding 2026-04-21 Cursor Blog (Anysphere)
6.2
I 6.0 Im 6.0 P 6.5

Cursor (Anysphere) announces a partnership with SpaceX focused on advancing model training capabilities for their coding models.

ai_coding
#39

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Reinforcement LearningarXiv — Agents / Tool Use
6.0
I 6.5 Im 5.8 P 5.8

While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation. Unlike single-file coding tasks that can be verified by unit tests, website generation requires evaluating inherently subjective aesthetics, cross-page interactions, and functional correctness. To this end, we propose WebGen-R1, an end-to-end RL framework tailored for project-level website generation. We first introduce a scaffold-driven structured generation paradigm that constrains the large open-ended action space and preserves architectural integrity. We then design a novel cascaded multimodal reward that seamlessly couples structural guarantees with execution-grounded functional feedback and vision-based aesthetic supervision. Extensive experiments demonstrate that our WebGen-R1 substantially transforms a 7B base model from generating nearly nonfunctional websites into producing deployable, aesthetically aligned multi-page websites. Remarkably, our WebGen-R1 not only consistently outperforms heavily scaled open-source models (up to 72B), but also rivals the state-of-the-art DeepSeek-R1 (671B) in functional success, while substantially exceeding it in valid rendering and aesthetic alignment. These results position WebGen-R1 as a viable path for scaling small open models from function-level code generation to project-level web application generation.

agents
#40

PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

Robotics 2026-04-22 arXiv cs.RO (Robotics)arXiv — Evals & BenchmarksarXiv — Robotic Autonomy / Embodied AI
6.0
I 6.5 Im 6.0 P 5.5

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA

robotics
#44
6.0
I 5.5 Im 6.5 P 6.0

“The Army is in the midst of its most significant modernization in over 40 years,” Maj. Gen. Rebecca McElwain, director of the Army budget, told reporters during the Pentagon’s budget rollout Tuesday. “This involves developing and fielding new capabilities while adapting formations, training and concepts to the character of modern warfare.”

ai_science
#47

How SpaceX preempted a $2B fundraise with a $60B buyout offer

AI Coding 2026-04-22 TechCrunch — AI
6.0
I 5.5 Im 5.5 P 7.0

Cursor was on track to close a $2 billion funding round this week but chose to halt discussions after SpaceX offered a $10 billion "collaboration fee" and a path to a $60 billion acquisition.

ai_coding
#48

LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel

Recurrent & Linear Attention 2026-04-22 arXiv — Recurrent / Linear Attention
6.0
I 6.0 Im 6.0 P 6.0

The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nyström approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.

recurrent
#49

AutoAdapt: Automated domain adaptation for large language models

Industry 2026-04-22 Microsoft Research Blog
6.0
I 6.0 Im 6.0 P 6.0

Deploying large language models (LLMs) in real-world, high-stakes settings is harder than it should be. In high-stakes settings like law, medicine, and cloud incident response, performance and reliability can quickly break down because adapting models to domain-specific requirements is a slow and manual process that is difficult to reproduce. The core challenge is domain adaptation, […] The post AutoAdapt: Automated domain adaptation for large language models appeared first on Microsoft Research .

industry
#52

The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It

Agents & Tool Use 2026-04-22 Gradient Flow (Ben Lorica)
5.9
I 5.5 Im 5.8 P 6.3

As organizations move autonomous AI agents from experimental sandboxes into live production, a critical bottleneck has emerged. Foundation models are remarkably capable but structurally unsuited to complex, multi-step work on their own. They have no persistent memory, no built-in sense of what is allowed, and no reliable way to stay on track across a long Continue reading "The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It" The post The Missing Layer: Why Your AI Agent Fails — and What Actually Fixes It appeared first on Gradient Flow .

agents
#54

[AINews] Tasteful Tokenmaxxing

Frontier LLMs 2026-04-23 Latent Space (swyx & Alessio)
5.8
I 5.5 Im 5.5 P 6.3

a quiet day lets us reflect on the top conversation that AI leaders are having everywhere.

frontier_llm
#56
5.8
I 5.5 Im 5.8 P 6.0

Infosys said the integration will be used to help its clients modernize software development, automate workflows, and deploy AI systems, initially focusing software engineering, legacy modernization, and DevOps.

industry
#58

How defense agencies can stay mission-ready in the AI era

Government & Defense 2026-04-22 DefenseScoop
5.7
I 5.5 Im 6.0 P 5.5

A new guide explores how defense agencies can strengthen mission resilience with unified data, faster threat detection and AI-ready architectures in complex, contested environments. The post How defense agencies can stay mission-ready in the AI era appeared first on DefenseScoop .

gov_defense
#60

Google Maps is about to get a big dose of AI

Industry 2026-04-22 TechCrunch — AI
5.5
I 5.5 Im 5.0 P 6.0

The new features, announced at Cloud Next in Las Vegas this week, add generative AI capabilities to Google's mapping platform, giving it enhanced visual and data analytics powers.

industry
#61

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)
5.3
I 6.5 Im 5.3 P 4.0

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

evals
#62

Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)
5.3
I 6.0 Im 5.3 P 4.5

We investigate the integration of human-like working memory constraints into the Transformer architecture and implement several cognitively inspired attention variants, including fixed-width windows based and temporal decay based attention mechanisms. Our modified GPT-2 models are trained from scratch on developmentally plausible datasets (10M and 100M words). Performance is evaluated on grammatical judgment tasks (BLiMP) and alignment with human reading time data. Our results indicate that these cognitively-inspired constraints, particularly fixed-width attention, can significantly improve grammatical accuracy especially when training data is scarce. These constrained models also tend to show a stronger alignment with human processing metrics. The findings suggest that such constraints may serve as a beneficial inductive bias, guiding models towards more robust linguistic representations, especially in data-limited settings.

evals
#63

Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks
5.0
I 6.8 Im 3.8 P 4.5

This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026-llm-judge-gaming.

evals
#64

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)arXiv — Evals & Benchmarks
4.9
I 6.0 Im 3.8 P 5.0

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

evals
#65

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv — Reinforcement LearningarXiv — Evals & Benchmarks
4.9
I 4.3 Im 5.3 P 5.0

We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline

evals
#66

MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

Post-Training 2026-04-22 arXiv cs.LG (Machine Learning)arXiv — Reinforcement LearningarXiv — Agents / Tool UsearXiv — Post-training / Alignment
4.9
I 3.5 Im 6.1 P 5.0

Aligning large language models (LLMs) to desirable human values requires balancing multiple, potentially conflicting objectives such as helpfulness, truthfulness, and harmlessness, which presents a multi-objective optimisation challenge. Most alignment pipelines rely on a fixed scalarisation of these objectives, which can introduce procedural unfairness by systematically under-weighting harder-to-optimise or minority objectives. To promote more equitable trade-offs, we introduce MGDA-Decoupled, a geometry-based multi-objective optimisation algorithm that finds a shared descent direction while explicitly accounting for each objective's convergence dynamics. In contrast to prior methods that depend on reinforcement learning (e.g., GAPO) or explicit reward models (e.g., MODPO), our approach operates entirely within the lightweight Direct Preference Optimisation (DPO) paradigm. Experiments on the UltraFeedback dataset show that geometry-aware methods -- and MGDA-Decoupled in particular -- achieve the highest win rates against golden responses, both overall and per objective.

post_training
#67

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

Agents & Tool Use 2026-04-22 arXiv — Agents / Tool UsearXiv — Evals & Benchmarks
4.9
I 6.0 Im 4.6 P 4.0

LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source-available targets where the analyst can build and instrument the code. In practice the work is split among several agents, wired together by a harness: the program that fixes which roles exist, how they pass information, which tools each may call, and how retries are coordinated. When the language model is held fixed, changing only the harness can still change success rates by several-fold on public agent benchmarks, yet most harnesses are written by hand; recent harness optimizers each search only a narrow slice of the design space and rely on coarse pass/fail feedback that gives no diagnostic signal about why a trial failed. AgentFlow addresses both limitations with a typed graph DSL whose search space jointly covers agent roles, prompts, tools, communication topology, and coordination protocol, paired with a feedback-driven outer loop that reads runtime signals from the target program itself to diagnose which part of the harness caused the failure and rewrite it accordingly. We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).

agents
#68

A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

Robotics 2026-04-22 arXiv cs.RO (Robotics)arXiv — Robotic Autonomy / Embodied AI
4.8
I 4.3 Im 6.0 P 4.0

Ultrasound (US)-guided needle insertion is a critical yet challenging procedure due to dynamic imaging conditions and difficulties in needle visualization. Many methods have been proposed for automated needle insertion, but they often rely on hand-crafted pipelines with modular controllers, whose performance degrades in challenging cases. In this paper, a Vision-Language-Action (VLA) model is proposed for adaptive and automated US-guided needle insertion and tracking on a robotic ultrasound (RUS) system. This framework provides a unified approach to needle tracking and needle insertion control, enabling real-time, dynamically adaptive adjustment of insertion based on the obtained needle position and environment awareness. To achieve real-time and end-to-end tracking, a Cross-Depth Fusion (CDF) tracking head is proposed, integrating shallow positional and deep semantic features from the large-scale vision backbone. To adapt the pretrained vision backbone for tracking tasks, a Tracking-Conditioning (TraCon) register is introduced for parameter-efficient feature conditioning. After needle tracking, an uncertainty-aware control policy and an asynchronous VLA pipeline are presented for adaptive needle insertion control, ensuring timely decision-making for improved safety and outcomes. Extensive experiments on both needle tracking and insertion show that our method consistently outperforms state-of-the-art trackers and manual operation, achieving higher tracking accuracy, improved insertion success rates, and reduced procedure time, highlighting promising directions for RUS-based intelligent intervention.

robotics
#69

Interval POMDP Shielding for Imperfect-Perception Agents

Agents & Tool Use 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.8
I 4.3 Im 6.1 P 4.0

Autonomous systems that rely on learned perception can make unsafe decisions when sensor readings are misclassified. We study shielding for this setting: given a proposed action, a shield blocks actions that could violate safety. We consider the common case where system dynamics are known but perception uncertainty must be estimated from finite labeled data. From these data we build confidence intervals for the probabilities of perception outcomes and use them to model the system as a finite Interval Partially Observable Markov Decision Process with discrete states and actions. We then propose an algorithm to compute a conservative set of beliefs over the underlying state that is consistent with the observations seen so far. This enables us to construct a runtime shield that comes with a finite-horizon guarantee: with high probability over the training data, if the true perception uncertainty rates lie within the learned intervals, then every action admitted by the shield satisfies a stated lower bound on safety. Experiments on four case studies show that our shielding approach (and variants derived from it) improves the safety of the system over state-of-the-art baselines.

agents
#70

Trust, Lies, and Long Memories: Emergent Social Dynamics and Reputation in Multi-Round Avalon with LLM Agents

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.7
I 3.5 Im 6.1 P 4.5

We study emergent social dynamics in LLM agents playing The Resistance: Avalon, a hidden-role deception game. Unlike prior work on single-game performance, our agents play repeated games while retaining memory of previous interactions, including who played which roles and how they behaved, enabling us to study how social dynamics evolve. Across 188 games, two key phenomena emerge. First, reputation dynamics emerge organically when agents retain cross-game memory: agents reference past behavior in statements like "I am wary of repeating last game's mistake of over-trusting early success." These reputations are role-conditional: the same agent is described as "straightforward" when playing good but "subtle" when playing evil, and high-reputation players receive 46% more team inclusions. Second, higher reasoning effort supports more strategic deception: evil players more often pass early missions to build trust before sabotaging later ones, 75% in high-effort games vs 36% in low-effort games. Together, these findings show that repeated interaction with memory gives rise to measurable reputation and deception dynamics among LLM agents.

agents
#71

Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)
4.7
I 6.8 Im 3.8 P 3.5

We study whether Large Language Models (LLMs) can perform feature model analysis operations (AOs) directly on semi-formal textual blueprints, i.e., concise constrained-language descriptions of feature hierarchies and constraints, enabling early validation in Software Product Line scoping. Using 12 state-of-the-art LLMs and 16 standard AOs, we compare their outputs against the solver-based oracle FLAMA. Results show that reasoning-optimized models (e.g., Grok 4 Fast Reasoning, Gemini 2.5 Pro) achieve 88-89% average accuracy across all evaluated blueprints and operations, approaching solver correctness. We identify systematic errors in structural parsing and constraint reasoning, and highlight accuracy-cost trade-offs that inform model selection. These findings position LLMs as lightweight assistants for early variability validation.

evals
#73

An explicit operator explains end-to-end computation in the modern neural networks used for sequence and language modeling

State Space Models 2026-04-22 arXiv cs.NE (Neural & Evolutionary Computing)arXiv cs.LG (Machine Learning)arXiv — State Space Models
4.6
I 4.3 Im 5.0 P 4.5

We establish a mathematical correspondence between state space models, a state-of-the-art architecture for capturing long-range dependencies in data, and an exactly solvable nonlinear oscillator network. As a specific example of this general correspondence, we analyze the diagonal linear time-invariant implementation of the Structured State Space Sequence model (S4). The correspondence embeds S4D, a specific implementation of S4, into a ring network topology, in which recent inputs are encoded, as waves of activity traveling over the one-dimensional spatial layout of the network. We then derive an exact operator expression for the full forward pass of S4D, yielding an analytical characterization of its complete input-output map. This expression reveals that the nonlinear decoder in the system induces interactions between these information-carrying waves that enable classifying real-world sequences. These results generalize across modern SSM architectures, and show that they admit an exact mathematical description with a clear physical interpretation. These insights enable a new level of interpretability for these systems in terms of nonlinear oscillator networks.

ssm
#75

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Evals & Benchmarks
4.5
I 5.7 Im 3.8 P 4.0

We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA

evals
#76

COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)
4.5
I 5.1 Im 3.8 P 4.5

Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To address this, we introduce COMPASS (COntinual Multilingual PEFT with Adaptive Semantic Sampling), a novel data-centric framework for adapting LLMs to target languages. COMPASS leverages parameter-efficient fine-tuning (PEFT) by training lightweight, language-specific adapters on a judiciously selected subset of auxiliary multilingual data. The core of our method is a distribution-aware sampling strategy that uses multilingual embeddings and clustering to identify semantic gaps between existing training data and a target usage distribution. By prioritizing auxiliary data from under-represented semantic clusters, COMPASS maximizes positive cross-lingual transfer while minimizing interference. We extend this into a continual learning framework, COMPASS-ECDA, which monitors for data distribution shifts in production and dynamically updates adapters to prevent model staleness, balancing adaptation to new data with the preservation of existing knowledge. Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLU-ProX), including unseen long-context tasks (OneRuler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.

evals
#77

ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)
4.5
I 4.3 Im 5.3 P 4.0

Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.

evals
#78

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

Interpretability 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Mechanistic Interpretability
4.5
I 5.7 Im 3.8 P 4.0

Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.

interpretability
#79

Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs

Post-Training 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Agents / Tool Use
4.5
I 4.3 Im 5.3 P 4.0

Rising demand for mental health support has increased interest in using Large Language Models (LLMs) for counseling. However, adapting LLMs to this high-risk safety-critical domain is hindered by the scarcity of real-world counseling data due to privacy constraints. Synthetic datasets provide a promising alternative, but existing approaches often rely on unstructured or semi-structured text inputs and overlook structural dependencies between a client's cognitive, emotional, and behavioral states, often producing psychologically inconsistent interactions and reducing data realism and quality. We introduce Graph2Counsel, a framework for generating synthetic counseling sessions grounded in Client Psychological Graphs (CPGs) that encode relationships among clients' thoughts, emotions, and behaviors. Graph2Counsel employs a structured prompting pipeline guided by counselor strategies and CPG, and explores prompting strategies including CoT (Wei et al., 2022) and Multi-Agent Feedback (Li et al., 2025a). Graph2Counsel produces 760 sessions from 76 CPGs across diverse client profiles. In expert evaluation, our dataset outperforms prior datasets on specificity, counselor competence, authenticity, conversational flow, and safety, with substantial inter-annotator agreement (Krippendorff's $α$ = 0.70). Fine-tuning an open-source model on this dataset improves performance on CounselingBench (Nguyen et al., 2025) and CounselBench (Li et al., 2025b), showing downstream utility. We also make our code and data public.

post_training
#80

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)arXiv — Efficiency (Quantization, MoE, Inference)arXiv — Evals & Benchmarks
4.5
I 4.3 Im 3.8 P 5.5

Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.

evals
#81

Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

Post-Training 2026-04-22 arXiv cs.LG (Machine Learning)arXiv — Efficiency (Quantization, MoE, Inference)
4.5
I 5.7 Im 3.8 P 4.0

We present a systematic empirical study of transformer compression through over 40 experiments on GPT-2 (124M parameters) and Mistral 7B (7.24B parameters). Our analysis covers spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit. We identify five structural properties relevant to compression. (1) Variance is not importance: high-variance activation directions are approximately 96 percent uncorrelated with predictive directions (measured via CCA), and projecting onto these subspaces preserves over 90 percent of variance while degrading perplexity. (2) Block linearity is conditional: transformer blocks are approximately linear (R^2 ~ 0.95 on GPT-2, 0.93 on Mistral block 31) only under the correct upstream distribution; modifying earlier blocks induces distribution shift that degrades downstream approximations. (3) The reconstruction wall: approaches that factor weights into quantized components amplify errors through cross-terms, making direct quantization strictly superior. (4) Linearity increases with depth: Mistral 7B exhibits a progression from R^2 = 0.17 (block 0) to R^2 = 0.93 (block 31), indicating a division between nonlinear feature construction and linear refinement. (5) Approximately 30 percent of tokens are computationally easy, confirmed via exit heads and KL divergence sensitivity. We demonstrate that single-block linear replacement achieves 34x compression with a 1.71 perplexity increase on the final block of Mistral 7B, while multi-block replacement fails due to residual error accumulation and distribution shift. These findings suggest fundamental limits to static post-training compression and motivate adaptive, per-token computation as a more effective direction.

post_training
#82

Diagnosing CFG Interpretation in LLMs

Interpretability 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.5
I 3.5 Im 6.1 P 4.0

As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.

interpretability
#83

ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards

Interpretability 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Agents / Tool UsearXiv — Mechanistic Interpretability
4.5
I 4.3 Im 4.6 P 4.5

Training multimodal agents via reinforcement learning for knowledge-intensive visual reasoning is fundamentally hindered by the extreme sparsity of outcome-based supervision and the unpredictability of live web environments. To resolve these algorithmic and environmental bottlenecks, we introduce ProMMSearchAgent, establishing a novel Sim-to-Real training paradigm for multimodal search. We decouple policy learning into a deterministic, local static sandbox. Crucially, to learn effectively within this constrained environment, we propose an introspective process-oriented reward. By probing the agent's own parametric knowledge boundaries, we generate dense behavioral metadata that explicitly rewards the correct cognitive decision, initiating a multimodal or text search only when visually or factually uncertain. Extensive experiments demonstrate that our locally-trained policy transfers zero-shot to the live Google Search API. ProMMSearchAgent achieves new SOTA performance, outperforming MMSearch-R1 by +5.1% on FVQA-test, +6.3% on InfoSeek, and +11.3% on MMSearch.

interpretability
#84
4.4
I 3.5 Im 6.1 P 3.5

This paper presents a framework for mapping unknown scalar fields using a sensor-equipped autonomous robot operating in unsafe environments. The unsafe regions are defined as regions of high-intensity, where the field value exceeds a predefined safety threshold. For safe and efficient mapping of the scalar field, the sensor-equipped robot must avoid high-intensity regions during the measurement process. In this paper, the scalar field is modeled as a sample from a Gaussian process (GP), which enables Bayesian inference and provides closed-form expressions for both the predictive mean and the uncertainty. Concurrently, the spatial structure of the high-intensity regions is estimated in real-time using the Hough transform (HT), leveraging the evolving GP posterior. A safe sampling strategy is then employed to guide the robot towards safe measurement locations, using probabilistic safety guarantees on the evolving GP posterior. The estimated high-intensity regions also facilitate the design of safe motion plans for the robot. The effectiveness of the approach is verified through two numerical simulation studies and an indoor experiment for mapping a light-intensity field using a wheeled mobile robot.

robotics
#85
4.4
I 4.3 Im 5.3 P 3.5

Peg-in-hole (PiH) assembly is a fundamental yet challenging robotic manipulation task. While reinforcement learning (RL) has shown promise in tackling such tasks, it requires extensive exploration. In this paper, we propose a novel visual-tactile skill learning framework for the PiH task that leverages its inverse task, i.e., peg-out-of-hole (PooH) disassembly, to facilitate PiH learning. Compared to PiH, PooH is inherently easier as it only needs to overcome existing friction without precise alignment, making data collection more efficient. To this end, we formulate both PooH and PiH as Partially Observable Markov Decision Processes (POMDPs) in a unified environment with shared visual-tactile observation space. A visual-tactile PooH policy is first trained; its trajectories, containing kinematic, visual and tactile information, are temporally reversed and action-randomized to provide expert data for PiH. In the policy learning, visual sensing facilitates the peg-hole approach, while tactile measurements compensate for peg-hole misalignment. Experiments across diverse peg-hole geometries show that the visual-tactile policy attains 6.4% lower contact forces than its single-modality counterparts, and that our framework achieves average success rates of 87.5% on seen objects and 77.1% on unseen objects, outperforming direct RL methods that train PiH policies from scratch by 18.1% in success rate. Demos, code, and datasets are available at https://sites.google.com/view/pooh2pih.

robotics
#86

Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

Robotics 2026-04-22 arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)arXiv — Reinforcement LearningarXiv — Robotic Autonomy / Embodied AI
4.4
I 4.3 Im 3.8 P 5.0

Recent advances in vision-language-action (VLA) models for robotics have highlighted the importance of reliable uncertainty quantification in sequential tasks. However, assessing and improving calibration in such settings remains mostly unexplored, especially when only partial trajectories are observed. In this work, we formulate sequential calibration for episodic tasks, where task-success confidence is produced along an episode, while success is determined at the end of it. We introduce a sequential extension of the Brier score and show that, for binary outcomes, its risk minimizer coincides with the VLA policy's value function. This connection bridges uncertainty calibration and reinforcement learning, enabling the use of temporal-difference (TD) value estimation as a principled calibration mechanism over time. We empirically show that TD calibration improves performance relative to the state-of-the-art on simulated and real-robot data. Interestingly, we show that when calibrated using TD, the VLA's single-step action probabilities can yield competitive uncertainty estimates, in contrast to recent findings that employed different calibration techniques.

robotics
#87
4.4
I 3.5 Im 6.1 P 3.5

The rapid iteration of autonomous driving algorithms has created a growing demand for high-fidelity, replayable, and diagnosable testing data. However, many public datasets lack real vehicle dynamics feedback and closed-loop interaction with surrounding traffic and road infrastructure, limiting their ability to reflect deployment readiness. To address this gap, we present OVPD (OnSite Virtual-Physical Dataset), a virtual-physical fusion testing dataset released from the 2025 OnSite Autonomous Driving Challenge. Centered on real-vehicle-in-the-loop testing, OVPD integrates virtual background traffic with vehicle-infrastructure perception to build controllable and interactive closed-loop test environments on a proving ground. The dataset contains 20 testing clips from 20 teams over a scenario chain of 15 atomic scenarios, totaling nearly 3 hours of multi-modal data, including vehicle trajectories and states, control commands, and digital-twin-rendered surround-view observations. OVPD supports long-tail planning and decision-making validation, open-loop or platform-enabled closed-loop evaluation, and comprehensive assessment across safety, efficiency, comfort, rule compliance, and traffic impact, providing actionable evidence for failure diagnosis and iterative improvement. The dataset is available via: https://huggingface.co/datasets/Yuhang253820/Onsite_OPVD

ssm
#88

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Post-Training 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)arXiv — Reinforcement LearningarXiv — Generative Media / Diffusion
4.4
I 4.3 Im 3.8 P 5.0

Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization'' collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.

post_training
#89

Supplement Generation Training for Enhancing Agentic Task Performance

Post-Training 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.4
I 3.5 Im 5.3 P 4.5

Training large foundation models for agentic tasks is increasingly impractical due to the high computational costs, long iteration cycles, and rapid obsolescence as new models are continuously released. Instead of post-training massive models for every new task or domain, we propose Supplement Generation Training (SGT), a more efficient and sustainable strategy. SGT trains a smaller LLM to generate useful supplemental text that, when appended to the original input, helps the larger LLM solve the task more effectively. These lightweight models can dynamically adapt supplements to task requirements, improving performance without modifying the underlying large models. This approach decouples task-specific optimization from large foundation models and enables more flexible, cost-effective deployment of LLM-powered agents in real-world applications.

post_training
#90

Evaluating Assurance Cases as Text-Attributed Graphs for Structure and Provenance Analysis

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)
4.4
I 4.3 Im 5.3 P 3.5

An assurance case is a structured argument document that justifies claims about a system's requirements or properties, which are supported by evidence. In regulated domains, these are crucial for meeting compliance and safety requirements to industry standards. We propose a graph diagnostic framework for analysing the structure and provenance of assurance cases. We focus on two main tasks: (1) link prediction, to learn and identify connections between argument elements, and (2) graph classification, to differentiate between assurance cases created by a state-of-the-art large language model and those created by humans, aiming to detect bias. We compiled a publicly available dataset of assurance cases, represented as graphs with nodes and edges, supporting both link prediction and provenance analysis. Experiments show that graph neural networks (GNNs) achieve strong link prediction performance (ROC-AUC 0.760) on real assurance cases and generalise well across domains and semi-supervised settings. For provenance detection, GNNs effectively distinguish human-authored from LLM-generated cases (F1 0.94). We observed that LLM-generated assurance cases have different hierarchical linking patterns compared to human-authored cases. Furthermore, existing GNN explanation methods show only moderate faithfulness, revealing a gap between predicted reasoning and the true argument structure.

evals
#91

DAIRE: A lightweight AI model for real-time detection of Controller Area Network attacks in the Internet of Vehicles

Safety, Policy & Regulation 2026-04-22 arXiv cs.AI (Artificial Intelligence)
4.4
I 4.3 Im 5.3 P 3.5

The Internet of Vehicles (IoV) is advancing modern transportation by improving safety, efficiency, and intelligence. However, the reliance on the Controller Area Network (CAN) introduces critical security risks, as CAN-based communication is highly vulnerable to cyberattacks. Addressing this challenge, we propose DAIRE (Detecting Attacks in IoV in REal-time), a lightweight machine learning framework designed for real-time detection and classification of CAN attacks. DAIRE is built on a lightweight artificial neural network (ANN) where each layer contains Ni = i x c neurons, with Ni representing the number of neurons in the ith layer and c corresponding to the total number of attack classes. Other hyperparameters are determined empirically to ensure real-time operation. To support the detection and classification of various IoV attacks, such as Denial-of-Service, Fuzzy, and Spoofing, DAIRE employs the sparse categorical cross-entropy loss function and root mean square propagation for loss minimization. In contrast to more resource-intensive architectures, DAIRE leverages a lightweight ANN to reduce computational demands while still delivering strong performance. Experimental results on the CICIoV2024 and Car-Hacking datasets demonstrate DAIRE's effectiveness, achieving an average detection rate of 99.88%, a false positive rate of 0.02%, and an overall accuracy of 99.96%. Furthermore, DAIRE significantly outperforms state-of-the-art approaches in inference speed, with a classification time of just 0.03 ms per sample. These results highlight DAIRE's effectiveness in detecting IoV cyberattacks and its practical suitability for real-time deployment in vehicular systems, underscoring its vital role in strengthening automotive cybersecurity.

safety_policy
#92

Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)
4.4
I 5.7 Im 3.8 P 3.5

In measurement theory, instruments do not simply record reality; they help constitute what is observed. The same holds for generative AI evaluation: benchmarks do not just measure, they shape what models appear to be. Functionalist benchmarks treat models as isolated predictors, while prescriptive approaches assess what systems ought to be. Both obscure the sociotechnical processes through which meaning and values are enacted, risking the reification of narrow cultural perspectives in pluralist contexts. This thesis advances a descriptive alternative. It argues that generative AI must be evaluated as a pluralist sociotechnical system and develops Machine-Society-Human (MaSH) Loops, a framework for tracing how models, users, and institutions recursively co-construct meaning and values. Evaluation shifts from judging outputs to examining how values are enacted in interaction. Three contributions follow. Conceptually, MaSH Loops reframes evaluation as recursive, enactive process. Methodologically, the World Values Benchmark introduces a distributional approach grounded in World Values Survey data, structured prompt sets, and anchor-aware scoring. Empirically, the thesis demonstrates these through two cases: value drift in early GPT-3 and sociotechnical evaluation in real estate. A final chapter draws on participatory realism to argue that prompting and evaluation are constitutive interventions, not neutral observations. The thesis argues that static benchmarks are insufficient for generative AI. Responsible evaluation requires pluralist, process-oriented frameworks that make visible whose values are enacted. Evaluation is therefore a site of governance, shaping how AI systems are understood, deployed, and trusted.

evals
#93
4.4
I 4.3 Im 5.3 P 3.5

The April 2026 Claude Mythos sandbox escape exposed a critical weakness in frontier AI containment: the infrastructure surrounding advanced models remains susceptible to formally characterizable arithmetic vulnerabilities. Anthropic has not publicly characterized the escape vector; some secondary accounts hypothesize a CWE-190 arithmetic vulnerability in sandbox networking code. We treat this as unverified and analyze the vulnerability class rather than the specific escape. This paper presents COBALT, a Z3 SMT-based formal verification engine for identifying CWE-190/191/195 arithmetic vulnerability patterns in C/C++ infrastructure prior to deployment. We distinguish two classes of contribution. Validated: COBALT detects arithmetic vulnerability patterns in production codebases, producing SAT verdicts with concrete witnesses and UNSAT guarantees under explicit safety bounds. We demonstrate this on four production case studies: NASA cFE, wolfSSL, Eclipse Mosquitto, and NASA F Prime, with reproducible encodings, verified solver output, and acknowledged security outcomes. Proposed: a four-layer containment framework consisting of COBALT, VERDICT, DIRECTIVE-4, and SENTINEL, mapping pre-deployment verification, pre-execution constraints, output control, and runtime monitoring to the failure modes exposed by the Mythos incident. Under explicit assumptions, we further argue that the publicly reported Mythos escape class is consistent with a Z3-expressible CWE-190 arithmetic formulation and that pre-deployment formal analysis would have been capable of surfacing the relevant pattern. The broader claim is infrastructural: frontier-model safety cannot depend on behavioral safeguards alone; the containment stack itself must be subjected to formal verification.

infra
#94

MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation

Evaluations & Benchmarks 2026-04-22 arXiv cs.CV (Computer Vision)
4.4
I 4.3 Im 5.3 P 3.5

6D object pose estimation in cluttered scenes remains challenging due to severe occlusion and sensor noise. We propose MAPRPose, a two-stage framework that leverages mask-aware correspondences for pose proposal and amodal-driven Region-of-Interest (ROI) prediction for robust refinement. In the Mask-Aware Pose Proposal (MAPP) stage, we lift 2D correspondences into 3D space to establish reliable keypoint matches and generate geometrically consistent pose hypotheses based on correspondence-level scoring, from which the top-$K$ candidates are selected. In the refinement stage, we introduce a tensorized render-and-compare pipeline integrated with an Amodal Mask Prediction and ROI Re-Alignment (AMPR) module. By reconstructing complete object geometry and dynamically adjusting the ROI, AMPR mitigates localization errors and spatial misalignment under heavy occlusion. Furthermore, our GPU-accelerated RGB-XYZ reprojection enables simultaneous refinement of all $N \times B$ pose hypotheses in a single forward pass. Evaluated on the BOP benchmark, MAPRPose achieves a state-of-the-art Average Recall (AR) of 76.5%, outperforming FoundationPose by 3.1% AR while delivering a 43x speedup in multi-object inference.

evals
#95

Workspace agents

Agents & Tool Use 2026-04-22 OpenAI Research
4.4
I 3.5 Im 4.6 P 5.0

Learn how to build, use, and scale workspace agents in ChatGPT to automate repeatable workflows, connect tools, and streamline team operations.

agents
#96

LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

State Space Models 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — State Space ModelsarXiv cs.AI (Artificial Intelligence)
4.3
I 3.5 Im 5.0 P 4.5

Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture-agnostic end-to-end analysis framework compatible with any LLM architecture. By extracting hidden states layer-by-layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model's task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen-Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger-parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task-effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.

ssm
#97

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

Agents & Tool Use 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.3
I 3.5 Im 5.3 P 4.0

The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

agents
#98

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)
4.3
I 4.0 Im 5.3 P 3.5

Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.

evals
#99

Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

Agents & Tool Use 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.3
I 4.3 Im 4.6 P 4.0

Designing and optimizing multi-agent systems (MAS) is a complex, labor-intensive process of "Agent Engineering." Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS. More critically, these optimizers are static; they do not learn from experience to improve their own optimization strategies. To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi-agent system to learn to evolve. TPGO first models the MAS as a Textual Parameter Graph (TPG), where agents, tools, and workflows are modular, optimizable nodes. To guide evolution, we derive "textual gradients," structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications. The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta-learning strategy that learns from historical optimization experiences. By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself. Extensive experiments on complex benchmarks like GAIA and MCP-Universe show that TPGO significantly enhances the performance of state-of-the-art agent frameworks, achieving higher success rates through automated, self-improving optimization.

agents
#100

Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging

Generative Media 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Generative Media / Diffusion
4.3
I 3.5 Im 5.3 P 4.0

Retinal laser speckle contrast imaging (LSCI) is a noninvasive optical modality for monitoring retinal blood flow dynamics. However, conventional temporal LSCI (tLSCI) reconstruction relies on sufficiently long speckle sequences to obtain stable temporal statistics, which makes it vulnerable to acquisition disturbances and limits effective temporal resolution. A physically informed reconstruction framework, termed RetinaDiff (Retinal Diffusion Model), is proposed for retinal tLSCI that is robust to motion and requires only a few frames. In RetinaDiff, registration based on phase correlation is first applied to stabilize the raw speckle sequence before contrast computation, reducing interframe misalignment so that fluctuations at each pixel primarily reflect true flow dynamics. This step provides a physics prior corrected for motion and a high quality multiframe tLSCI reference. Next, guided by the physics prior, a conditional diffusion model performs inverse reconstruction by jointly conditioning on the registered speckle sequence and the corrected prior. Experiments on data acquired with a retinal LSCI system developed in house show improved structural continuity and statistical stability compared with direct reconstruction from few frames and representative baselines. The framework also remains effective in a small number of extremely challenging cases, where both the direct 5-frame input and the conventional multiframe reconstruction are severely degraded. Overall, this work provides a practical and physically grounded route for reliable retinal tLSCI reconstruction from extremely limited frames. The source code and model weights will be publicly available at https://github.com/QianChen113/RetinaDiff.

generative_media
#102

Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Agents / Tool Use
4.2
I 3.5 Im 5.1 P 4.0

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

agents
#103

Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.2
I 3.5 Im 4.6 P 4.5

We introduce SuperIgor, a framework for instruction-following tasks. Unlike prior methods that rely on predefined subtasks, SuperIgor enables a language model to generate and refine high-level plans through a self-learning mechanism, reducing the need for manual dataset annotation. Our approach involves iterative co-training: an RL agent is trained to follow the generated plans, while the language model adapts and modifies these plans based on RL feedback and preferences. This creates a feedback loop where both the agent and the planner improve jointly. We validate our framework in environments with rich dynamics and stochasticity. Results show that SuperIgor agents adhere to instructions more strictly than baseline methods, while also demonstrating strong generalization to previously unseen instructions.

agents
#104

Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv — Agents / Tool Use
4.2
I 3.5 Im 4.6 P 4.5

Modern retrieval-augmented generation (RAG) systems treat vector embeddings as static, context-free artifacts: an embedding has no notion of when it was created, how trustworthy its source is, or which other embeddings depend on it. This flattening of knowledge has a measurable cost: recent work on VersionRAG reports that conventional RAG achieves only 58% accuracy on versioned technical queries, because retrieval returns semantically similar but temporally invalid content. We propose SmartVector, a framework that augments dense embeddings with three explicit properties -- temporal awareness, confidence decay, and relational awareness -- and a five-stage lifecycle modeled on hippocampal-neocortical memory consolidation. A retrieval pipeline replaces pure cosine similarity with a four-signal score that mixes semantic relevance, temporal validity, live confidence, and graph-relational importance. A background consolidation agent detects contradictions, builds dependency edges, and propagates updates along those edges as graph-neural-network-style messages. Confidence is governed by a closed-form function combining an Ebbinghaus-style exponential decay, user-feedback reconsolidation, and logarithmic access reinforcement. We formalize the model, relate it to temporal knowledge graph embedding, agentic memory architectures, and uncertainty-aware RAG, and present a reference implementation. On a reproducible synthetic versioned-policy benchmark of 258 vectors and 138 queries, SmartVector roughly doubles top-1 accuracy over plain cosine RAG (62.0% vs. 31.0% on a held-out split), drops stale-answer rate from 35.0% to 13.3%, cuts Expected Calibration Error by nearly 2x (0.244 vs. 0.470), reduces re-embedding cost per single-word edit by 77%, and is robust across contradiction-injection rates from 0% to 75%.

agents
#105

Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)
4.2
I 4.0 Im 4.5 P 4.0

As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and matching or exceeding monolingual baselines for low-resource languages. However, we find that scale alone does not guarantee stability. Furthermore, for high-resource languages like French, we show that refining the decision boundary through third quartile sampling (Q3) or tuning the retention rate is necessary to fully leverage the multilingual signal.

evals
#106

Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.2
I 4.3 Im 3.8 P 4.5

Scientific progress depends on the continual generation of innovative re-search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)-based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi-agent iterative planning search strategy inspired by com-binatorial innovation theory. The framework combines iterative knowledge search with an LLM-based multi-agent system to generate, evaluate, and re-fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state-of-the-art base-lines in both diversity and novelty. Further comparison with ideas derived from top-tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high-quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.

agents
#107

pAI/MSc: ML Theory Research with Humans on the Loop

Agents & Tool Use 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.2
I 3.5 Im 4.6 P 4.5

We present pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.

agents
#108

Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks
4.2
I 3.2 Im 5.3 P 4.0

Situated conversational recommendation (SCR), which utilizes visual scenes grounded in specific environments and natural language dialogue to deliver contextually appropriate recommendations, has emerged as a promising research direction due to its close alignment with real-world scenarios. Compared to traditional recommendations, SCR requires a deeper understanding of dynamic and implicit user preferences, as the surrounding scene often influences users' underlying interests, while both may evolve across conversations. This complexity significantly impacts the timing and relevance of recommendations. To address this, we propose situated preference reasoning (SiPeR), a novel framework that integrates two core mechanisms: (1) Scene transition estimation, which estimates whether the current scene satisfies user needs, and guides the user toward a more suitable scene when necessary; and (2) Bayesian inverse inference, which leverages the likelihood of multimodal large language models (MLLMs) to predict user preferences about candidate items within the scene. Extensive experiments on two representative benchmarks demonstrate SiPeR's superiority in both recommendation accuracy and response generation quality. The code and data are available at https://github.com/DongdingLin/SiPeR.

evals
#109

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

Robotics 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Agents / Tool UsearXiv — Generative Media / Diffusion
4.2
I 4.3 Im 3.8 P 4.5

Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.

robotics
#110

Scoring Show HN submissions for AI design patterns

Industry 2026-04-22 Hacker News — AI front page
4.2
I 3.5 Im 3.8 P 5.3

Article URL: https://www.adriankrebs.ch/blog/design-slop/ Comments URL: https://news.ycombinator.com/item?id=47864393 Points: 301 # Comments: 214

industry
#111

CHASM: Unveiling Covert Advertisements on Chinese Social Media

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
4.1
I 3.5 Im 3.8 P 5.0

Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset designed to evaluate the capability of Multimodal Large Language Models (MLLMs) in detecting covert advertisements on social media. CHASM is a high-quality, anonymized, manually curated dataset consisting of 4,992 instances, based on real-world scenarios from the Chinese social media platform Rednote. The dataset was collected and annotated under strict privacy protection and quality control protocols. It includes many product experience sharing posts that closely resemble covert advertisements, making the dataset particularly challenging.The results show that under both zero-shot and in-context learning settings, none of the current MLLMs are sufficiently reliable for detecting covert advertisements.Our further experiments revealed that fine-tuning open-source MLLMs on our dataset yielded noticeable performance gains. However, significant challenges persist, such as detecting subtle cues in comments and differences in visual and textual structures.We provide in-depth error analysis and outline future research directions. We hope our study can serve as a call for the research community and platform moderators to develop more precise defenses against this emerging threat.

evals
#112

MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation

Robotics 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)
4.1
I 3.5 Im 3.8 P 5.0

Industrial robot applications require increasingly flexible systems that non-expert users can easily adapt for varying tasks and environments. However, different adaptations benefit from different interaction modalities. We present an interactive framework that enables robot skill adaptation through three complementary modalities: kinesthetic touch for precise spatial corrections, natural language for high-level semantic modifications, and a graphical web interface for visualizing geometric relations and trajectories, inspecting and adjusting parameters, and editing via-points by drag-and-drop. The framework integrates five components: energy-based human-intention detection, a tool-based LLM architecture (where the LLM selects and parameterizes predefined functions rather than generating code) for safe natural language adaptation, Kernelized Movement Primitives (KMPs) for motion encoding, probabilistic Virtual Fixtures for guided demonstration recording, and ergodic control for surface finishing. We demonstrate that this tool-based LLM architecture generalizes skill adaptation from KMPs to ergodic control, enabling voice-commanded surface finishing. Validation on a 7-DoF torque-controlled robot at the Automatica 2025 trade fair demonstrates the practical applicability of our approach in industrial settings.

robotics
#113

AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

Evaluations & Benchmarks 2026-04-22 arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks
4.1
I 3.5 Im 3.8 P 5.0

We introduce \textbf{AAC} (Architecturally Admissible Compressor), a differentiable landmark-selection module for ALT (A*, Landmarks, and Triangle inequality) shortest-path heuristics whose outputs are admissible by construction: each forward pass is a row-stochastic mixture of triangle-inequality lower bounds, so the heuristic is admissible for \emph{every} parameter setting without requiring convergence, calibration, or projection. At deployment, the module reduces to classical ALT on a learned subset, composing end-to-end with neural encoders while preserving the classical toolchain. The construction is the first differentiable instance of the compress-while-preserving-admissibility tradition in classical heuristic search. Under a matched per-vertex memory protocol, we establish that ALT with farthest-point-sampling landmarks (FPS-ALT) has provably near-optimal coverage on metric graphs, leaving at most a few percentage points of headroom for \emph{any} selector. AAC operates near this ceiling: the gap is $0.9$--$3.9$ percentage points on 9 road networks and ${\leq}1.3$ percentage points on synthetic graphs, with zero admissibility violations across $1{,}500+$ queries and all logged runs. At matched memory, AAC is also $1.2$--$1.5{\times}$ faster than FPS-ALT at the median query on DIMACS road networks, amortizing its offline cost within $170$--$1{,}924$ queries. A controlled ablation isolates the binding constraint: training-objective drift under default initialization, not architectural capacity; identity-on-first-$m$ initialization closes the expansion-count gap entirely. We release the module, a reusable matched-memory benchmarking protocol with paired two-one-sided-test (TOST) equivalence and pre-registration, and a reference compressed-differential-heuristics baseline.

evals
#114

Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning

Robotics 2026-04-22 arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)arXiv — Reinforcement LearningarXiv — Agents / Tool Use
4.1
I 3.5 Im 3.8 P 5.0

The temporal lag between actions and their long-term consequences makes credit assignment a challenge when learning goal-directed behaviors from data. Generative world models capture the distribution of future states an agent may visit, indicating that they have captured temporal information. How can that temporal information be extracted to perform credit assignment? In this paper, we formalize how the temporal information stored in world models encodes the underlying geometry of the world. Leveraging optimal transport, we extract this geometry from a learned model of the occupancy measure into a reward function that captures goal-reaching information. Our resulting method, Occupancy Reward Shaping, largely mitigates the problem of credit assignment in sparse reward settings. ORS provably does not alter the optimal policy, yet empirically improves performance by 2.2x across 13 diverse long-horizon locomotion and manipulation tasks. Moreover, we demonstrate the effectiveness of ORS in the real world for controlling nuclear fusion on 3 Tokamak control tasks. Code: https://github.com/aravindvenu7/occupancy_reward_shaping; Website: https://aravindvenu7.github.io/website/ors/

robotics
#115

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

Interpretability 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv — Reinforcement LearningarXiv — Mechanistic Interpretability
4.1
I 3.5 Im 3.8 P 5.0

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Language Models (LLMs) by leveraging direct outcome verification instead of learned reward models. Building on this paradigm, Group Relative Policy Optimization (GRPO) eliminates the need for critic models but suffers from indiscriminate credit assignment for intermediate steps, which limits its ability to identify effective reasoning strategies and incurs overthinking. In this work, we introduce a model-free and verifiable process supervision via probing the model's belief in the correct answer throughout its reasoning trajectory. By segmenting the generation into discrete steps and tracking the conditional probability of the correct answer appended at each segment boundary, we efficiently compute interpretable segment-wise progress measurements to refine GRPO's trajectory-level feedback. This approach enables more targeted and sample-efficient policy updates, while avoiding the need for intermediate supervision derived from costly Monte Carlo rollouts or auxiliary models. Experiments on mathematical and general-domain benchmarks show consistent gains over GRPO across diverse models: up to 2.6-point accuracy improvements and 13.7% reasoning-length reductions on math tasks, and up to 2.4 points and 4% on general-domain tasks, demonstrating strong generalization.

interpretability
#116

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

Multimodal 2026-04-22 arXiv cs.CV (Computer Vision)
4.1
I 4.3 Im 4.5 P 3.5

Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.

multimodal
#118

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

Agents & Tool Use 2026-04-22 arXiv — Agents / Tool Use
4.1
I 3.5 Im 5.3 P 3.5

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit ([email protected]), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

agents
#119

Autonomous Emergence of Hamiltonian in Deep Generative Models

Generative Media 2026-04-22 arXiv — Mechanistic Interpretability
4.1
I 3.5 Im 5.1 P 3.5

The unprecedented predictive success of deep generative models in complex many-body systems, such as AlphaFold3, raises an epistemological question: do these networks merely memorize data distributions via high-dimensional interpolation, or do they autonomously deduce the underlying physical laws? To address this, we introduce a rigorous algebraic framework to extract the implicit physical interactions learned by generative models. By establishing an exact equivalence between the zero-noise limit of a Riemannian diffusion score field and the thermodynamic restoring force, we utilize the trained neural network as a direct force estimator. Applying this framework to a sequence-dependent, frustrated 1D $O(3)$ spin glass, we probe the latent representations of an $O(3)$-equivariant attention architecture trained solely on thermal equilibrium snapshots. Without incorporating any energetic priors, an overdetermined linear inversion successfully recovers the microscopic Hamiltonian parameters of the spin system. The inferred Hamiltonian parameters exhibit a $99.7\%$ cosine similarity with the ground-truth interaction parameters. Furthermore, these sparse local parameters alone are sufficient to explain $87\%$ of the variance in the continuous force field predicted by the network. Our results provide quantitative, falsifiable evidence that deep generative architectures do not merely perform statistical pattern matching, but autonomously discover and internalize the underlying physical rules.

generative_media
#120
4.1
I 3.5 Im 5.3 P 3.5

Designing regulatory DNA elements with precise cell-type-specific activity is broadly relevant for cell engineering and gene therapy. Deep generative models can generate functional gene-regulatory elements, but existing methods struggle to achieve high specificity against undesired cell types while adhering to the genome's natural regulatory grammar. Here, we introduce DNA-CRAFT, a generative framework that integrates class-conditioned discrete diffusion with Monte Carlo tree search to design cell-type-specific and biologically faithful regulatory elements. We first train a discrete diffusion model on the ENCODE registry of 3.2 million candidate regulatory elements. Second, we condition the model to learn class-specific regulatory grammars of naturally occurring DNA sequences, including enhancers and promoters. Third, we employ conditional Monte Carlo tree guidance, an inference-time alignment algorithm designed to maximize the differential regulatory activity between desired and undesired cell types. By benchmarking DNA-CRAFT on regulatory sequence design tasks for human cell lines and immune cell types, we demonstrate that our model generates sequences with high predicted cell-type-specific activity and biological fidelity, achieving the best trade-offs compared to methods that use diffusion, autoregressive models, and gradient-based optimization.

generative_media
#122

How LiteParse Turns PDFs Into Text: A Deep Dive Into the Grid Projection Algorithm

Safety, Policy & Regulation 2026-04-22 LlamaIndex Blog
4.1
I 3.5 Im 5.3 P 3.5

Technical deep-dive on LiteParse's grid projection algorithm: extracts alignment anchors from recurring X positions, classifies every text item by where it snaps, and projects the result onto a monospace character grid to preserve document structure.

safety_policy
#123

Response time of lateral predictive coding and benefits of modular structures

Interpretability 2026-04-22 arXiv cs.NE (Neural & Evolutionary Computing)
4.0
I 3.5 Im 5.0 P 3.5

Lateral predictive coding (LPC) is a simple theoretical framework to appreciate feature detection in biological neural circuits. Recent theoretical work [Huang et al., Phys.Rev.E 112, 034304 (2025)] has successfully constructed optimal LPC networks capable of extracting non-Gaussian hidden input features by imposing the tradeoff between energetic cost and information robustness, but the resulting dynamical systems of recurrent interactions can be very slow in responding to external inputs. We investigate response-time reduction in the present paper. We find that the characteristic response time of the LPC system can be minimized to closely approaching the lower-bound value without compromising the mean predictive error (energetic cost) and the information robustness of signal transmission. We further demonstrate that optimal LPC networks taking a modular structural organization with extensively reduced number of lateral interactions are equally excellent as all-to-all completely connected networks, in terms of feature detection performance, response time, energetic cost and information robustness.

interpretability
#124

Intersectional Fairness in Large Language Models

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)
4.0
I 3.2 Im 5.3 P 3.5

Large Language Models (LLMs) are increasingly deployed in socially sensitive settings, raising concerns about fairness and biases, particularly across intersectional demographic attributes. In this paper, we systematically evaluate intersectional fairness in six LLMs using ambiguous and disambiguated contexts from two benchmark datasets. We assess LLM behavior using bias scores, subgroup fairness metrics, accuracy, and consistency through multi-run analysis across contexts and negative and non-negative question polarities. Our results show that while modern LLMs generally perform well in ambiguous contexts, this limits the informativeness of fairness metrics due to sparse non-unknown predictions. In disambiguated contexts, LLM accuracy is influenced by stereotype alignment, with models being more accurate when the correct answer reinforces a stereotype than when it contradicts it. This pattern is especially pronounced in race-gender intersections, where directional bias toward stereotypes is stronger. Subgroup fairness metrics further indicate that, despite low observed disparity in some cases, outcome distributions remain uneven across intersectional groups. Across repeated runs, responses also vary in consistency, including stereotype-aligned responses. Overall, our findings show that apparent model competence is partly associated with stereotype-consistent cues, and no evaluated LLM achieves consistently reliable or fair behavior across intersectional settings. These findings highlight the need for evaluation beyond accuracy, emphasizing the importance of combining bias, subgroup fairness, and consistency metrics across intersectional groups, contexts, and repeated runs.

evals
#125

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Agents / Tool Use
4.0
I 3.5 Im 4.6 P 4.0

Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50\% on SciWorld and 71.28\% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.

agents
#126

Knowledge Capsules: Structured Nonparametric Memory Units for LLMs

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)
4.0
I 4.3 Im 3.8 P 4.0

Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.

evals
#127

HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

Agents & Tool Use 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Agents / Tool Use
4.0
I 3.5 Im 4.6 P 4.0

Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.

agents
#128

Lexicographic Minimum-Violation Motion Planning using Signal Temporal Logic

Interpretability 2026-04-22 arXiv cs.RO (Robotics)arXiv — Efficiency (Quantization, MoE, Inference)
4.0
I 3.5 Im 4.6 P 4.0

Motion planning for autonomous vehicles often requires satisfying multiple conditionally conflicting specifications. In situations where not all specifications can be met simultaneously, minimum-violation motion planning maintains system operation by minimizing violations of specifications in accordance with their priorities. Signal temporal logic (STL) provides a formal language for rigorously defining these specifications and enables the quantitative evaluation of their violations. However, a total ordering of specifications yields a lexicographic optimization problem, which is typically computationally expensive to solve using standard methods. We address this problem by transforming the multi-objective lexicographic optimization problem into a single-objective scalar optimization problem using non-uniform quantization and bit-shifting. Specifically, we extend a deterministic model predictive path integral (MPPI) solver to efficiently solve optimization problems without quadratic input cost. Additionally, a novel predicate-robustness measure that combines spatial and temporal violations is introduced. Our results show that the proposed method offers an interpretable and scalable solution for lexicographic STL minimum-violation motion planning within a single-objective solver framework.

interpretability
#129

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

Robotics 2026-04-22 arXiv cs.RO (Robotics)arXiv — Agents / Tool Use
4.0
I 4.3 Im 3.8 P 4.0

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging, as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. This naturally extends to Arms' Debate, an iterative refinement process, and to the introduction of a third LLM-as-Judge to evaluate and select the most plausible coordinated trajectories. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves up to 71.1% average success rate, outperforming the best training-free baseline by 6.7 percentage points and surpassing most supervised methods. We further demonstrate strong few-shot generalization on novel tasks.

robotics
#130

LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

Generative Media 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)
4.0
I 4.3 Im 3.8 P 4.0

Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding. Code & models will be public at https://anticdimi.github.io/lexis.

generative_media
#131

Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems

Frontier LLMs 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)
4.0
I 3.5 Im 4.6 P 4.0

Federated continual learning (FCL) allows distributed autonomous fleets to adapt collaboratively to evolving terrain types across extended mission lifecycles. However, current approaches face several key challenges: 1) they use uniform protection strategies that do not account for the varying sensitivities to forgetting on different network layers; 2) they focus primarily on preventing forgetting during training, without addressing the long-term effects of cumulative drift; and 3) they often depend on idealized simulations that fail to capture the real-world heterogeneity present in distributed fleets. In this paper, we propose a lifecycle-aware dual-timescale FCL framework that incorporates training-time (pre-forgetting) prevention and (post-forgetting) recovery. Under this framework, we design a layer-selective rehearsal strategy that mitigates immediate forgetting during local training, and a rapid knowledge recovery strategy that restores degraded models after long-term cumulative drift. We present a theoretical analysis that characterizes heterogeneous forgetting dynamics and establishes the inevitability of long-term degradation. Our experimental results show that this framework achieves up to 8.3\% mIoU improvement over the strongest federated baseline and up to 31.7\% over conventional fine-tuning. We also deploy the FCL framework on a real-world rover testbed to assess system-level robustness under realistic constraints; the testing results further confirm the effectiveness of our FCL design.

frontier_llm
#132

F\textsuperscript{2}LP-AP: Fast \& Flexible Label Propagation with Adaptive Propagation Kernel

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks
4.0
I 4.3 Im 3.8 P 4.0

Semi-supervised node classification is a foundational task in graph machine learning, yet state-of-the-art Graph Neural Networks (GNNs) are hindered by significant computational overhead and reliance on strong homophily assumptions. Traditional GNNs require expensive iterative training and multi-layer message passing, while existing training-free methods, such as Label Propagation, lack adaptability to heterophilo\-us graph structures. This paper presents \textbf{F$^2$LP-AP} (Fast and Flexible Label Propagation with Adaptive Propagation Kernel), a training-free, computationally efficient framework that adapts to local graph topology. Our method constructs robust class prototypes via the geometric median and dynamically adjusts propagation parameters based on the Local Clustering Coefficient (LCC), enabling effective modeling of both homophilous and heterophilous graphs without gradient-based training. Extensive experiments across diverse benchmark datasets demonstrate that \textbf{F$^2$LP-AP} achieves competitive or superior accuracy compared to trained GNNs, while significantly outperforming existing baselines in computational efficiency.

evals
#133

Storm Surge Modeling, Bias Correction, Graph Neural Networks, Graph Convolution Networks

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)
4.0
I 4.3 Im 3.8 P 4.0

Storm surge forecasting remains a critical challenge in mitigating the impacts of tropical cyclones on coastal regions, particularly given recent trends of rapid intensification and increasing nearshore storm activity. Traditional high fidelity numerical models such as ADCIRC, while robust, are often hindered by inevitable uncertainties arising from various sources. To address these challenges, this study introduces StormNet, a spatio-temporal graph neural network (GNN) designed for bias correction of storm surge forecasts. StormNet integrates graph convolutional (GCN) and graph attention (GAT) mechanisms with long short-term memory (LSTM) components to capture complex spatial and temporal dependencies among water-level gauge stations. The model was trained using historical hurricane data from the U.S. Gulf Coast and evaluated on Hurricane Idalia (2023). Results demonstrate that StormNet can effectively reduce the root mean square error (RMSE) in water-level predictions by more than 70\% for 48-hour forecasts and above 50\% for 72-hour forecasts, as well as outperform a sequential LSTM baseline, particularly for longer prediction horizons. The model also exhibits low training time, enhancing its applicability in real-time operational forecasting systems. Overall, StormNet provides a computationally efficient and physically meaningful framework for improving storm surge prediction accuracy and reliability during extreme weather events.

evals
#134

Improving clinical interpretability of linear neuroimaging models through feature whitening

Interpretability 2026-04-22 arXiv cs.LG (Machine Learning)
4.0
I 3.5 Im 5.0 P 3.5

Linear models are widely used in computational neuroimaging to identify biomarkers associated with brain pathologies. However, interpreting the learned weights remains challenging, as they do not always yield clinically meaningful insights. This difficulty arises in part from the inherent correlation between brain regions, which causes linear weights to reflect shared rather than region-specific contributions. In particular, some groups of regions, including homologous structures in the left and right hemispheres, are known to exhibit strong anatomical correlations. In this work, we leverage this prior neuroanatomical knowledge to introduce a whitening approach applied to groups of regions with known shared variance, designed to disentangle overlapping information across correlated brain measures. We additionally propose a regularized variant that allows controlled tuning of the degree of decorrelation. We evaluate this method using region-of-interest features in two psychiatric classification tasks, distinguishing individuals with bipolar disorder or schizophrenia from healthy controls. Importantly, unlike PCA or ICA which use whitening as a dimensionality reduction step, our approach decorrelates anatomically informed pairs of neuroanatomical regions while retaining the full input signal, making it specifically suited for feature interpretation rather than feature selection. Our findings demonstrate that whitening improves the interpretability of model weights while preserving predictive performance, providing a robust framework for linking linear model outputs to neurobiological mechanisms.

interpretability
#135

QuanForge: A Mutation Testing Framework for Quantum Neural Networks

State Space Models 2026-04-22 arXiv cs.AI (Artificial Intelligence)
4.0
I 3.5 Im 5.0 P 3.5

With the growing synergy between deep learning and quantum computing, Quantum Neural Networks (QNNs) have emerged as a promising paradigm by leveraging quantum parallelism and entanglement. However, testing QNNs remains underexplored due to their complex quantum dynamics and limited interpretability. Developing a mutation testing technique for QNNs is promising while requires addressing stochastic factors, including the inherent randomness of mutation operators and quantum measurements. To tackle these challenges, we propose QuanForge, a mutation testing framework specifically designed for QNNs. We first introduce statistical mutation killing to provide a more reliable criterion. QuanForge incorporates nine post-training mutation operators at both gate and parameter levels, capable of simulating various potential errors in quantum circuits. Finally, a mutant generation algorithm is formalized that systematically produces effective mutants, thereby enabling a robust and reliable mutation analysis. Through extensive experiments on benchmark datasets and QNN architectures, we show that QuanForge can effectively distinguish different test suites and localize vulnerable circuit regions, providing insights for data enhancement and structural assessment of QNNs. We also analyze the generation capabilities of different operators and evaluate performance under simulated noisy conditions to assess the practical feasibility of QuanForge for future quantum devices.

ssm
#136

A Field Guide to Decision Making

Agents & Tool Use 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.0
I 3.5 Im 4.6 P 4.0

High-consequence decision making demands peak performance from individuals in positions of responsibility. Such executive authority bears the obligation to act despite uncertainty, limited resources, time constraints, and accountability risks. Tools and strategies to motivate confidence and foster risk tolerance must confront informational noise and can provide qualified accountability. Machine intelligence augments human cognition and perception to improve situational awareness, decision framing, flexibility, and coherence through agentic stewardship of contextual metadata. We examine systemic and behavioral factors crucial to address in scenarios encumbered by complexity, uncertainty, and urgency.

agents
#137

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
4.0
I 4.3 Im 3.8 P 4.0

The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

evals
#138

CHORUS: An Agentic Framework for Generating Realistic Deliberation Data

Agents & Tool Use 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool Use
4.0
I 3.5 Im 4.6 P 4.0

Understanding the intricate dynamics of online discourse depends on large-scale deliberation data, a resource that remains scarce across interactive web platforms due to restrictive accessibility policies, ethical concerns and inconsistent data quality. In this paper, we propose Chorus, an agentic framework, which orchestrates LLM-powered actors with behaviorally consistent personas to generate realistic deliberation discussions. Each actor is governed by an autonomous agent equipped with memory of the evolving discussion, while participation timing is governed by a principled Poisson process-based temporal model, which approximates the heterogeneous engagement patterns of real users. The framework is further supported by structured tool usage, enabling actors to access external resources and facilitating integration with interactive web platforms. The framework was deployed on the \textsc{Deliberate} platform and evaluated by 30 expert participants across three dimensions: content realism, discussion coherence and analytical utility, confirming Chorus as a practical tool for generating high-quality deliberation data suitable for online discourse analysis

agents
#139

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
4.0
I 4.3 Im 3.8 P 4.0

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

evals
#140

Cold-Start Forecasting of New Product Life-Cycles via Conditional Diffusion Models

State Space Models 2026-04-22 arXiv stat.ML (Statistical ML)arXiv — Generative Media / Diffusion
4.0
I 4.3 Im 3.8 P 4.0

Forecasting the life-cycle trajectory of a newly launched product is important for launch planning, resource allocation, and early risk assessment. This task is especially difficult in the pre-launch and early post-launch phases, when product-specific outcome history is limited or unavailable, creating a cold-start problem. In these phases, firms must make decisions before demand patterns become reliably observable, while early signals are often sparse, noisy, and unstable We propose the Conditional Diffusion Life-cycle Forecaster (CDLF), a conditional generative framework for forecasting new-product life-cycle trajectories under cold start. CDLF combines three sources of information: static descriptors, reference trajectories from similar products, and newly arriving observations when available. Here, static descriptors refer to structured pre-launch characteristics of the product, such as category, price tier, brand or organization identity, scale, and access conditions. This structure allows the model to condition forecasts on relevant product context and to update them adaptively over time without retraining, yielding flexible multi-modal predictive distributions under extreme data scarcity. The method satisfies consistency with a horizon-uniform distributional error bound for recursive generation. Across studies on Intel microprocessor stock keeping unit (SKU) life cycles and the platform-mediated adoption of open large language model repositories, CDLF delivers more accurate point forecasts and higher-quality probabilistic forecasts than classical diffusion models, Bayesian updating approaches, and other state-of-the-art machine-learning baselines.

ssm
#141

GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction

Generative Media 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Efficiency (Quantization, MoE, Inference)
4.0
I 4.3 Im 3.8 P 4.0

Reconstructing dynamic 3D scenes from sparse multi-view videos is highly ill-posed, often leading to geometric collapse, trajectory drift, and floating artifacts. Recent attempts introduce generative priors to hallucinate missing content, yet naive integration frequently causes structural drift and temporal inconsistency due to the mismatch between stochastic 2D generation and deterministic 3D geometry. In this paper, we propose GeoRect4D, a novel unified framework for sparse-view dynamic reconstruction that couples explicit 3D consistency with generative refinement via a closed-loop optimization process. Specifically, GeoRect4D introduces a degradation-aware feedback mechanism that incorporates a robust anchor-based dynamic 3DGS substrate with a single-step diffusion rectifier to hallucinate high-fidelity details. This rectifier utilizes a structural locking mechanism and spatiotemporal coordinated attention, effectively preserving physical plausibility while restoring missing content. Furthermore, we present a progressive optimization strategy that employs stochastic geometric purification to eliminate floaters and generative distillation to infuse texture details into the explicit representation. Extensive experiments demonstrate that GeoRect4D achieves state-of-the-art performance in reconstruction fidelity, perceptual quality, and spatiotemporal consistency across multiple datasets.

generative_media
#142

Quantum hardware noise learning via differentiable Kraus representation on tensor networks

State Space Models 2026-04-22 arXiv — Evals & Benchmarks
4.0
I 3.5 Im 5.0 P 3.5

We present a method for learning quantum hardware noise from a measurement distribution of a single device experiment. Each noise channel is represented by automatically differentiable Kraus operators obtained from a Stinespring-based parameterization that is completely positive and trace preserving by construction, and circuits are simulated with a matrix product density operator forward model. Independent channels are attached to each native gate type, to each nearest-neighbor crosstalk interaction, and to state preparation and measurement, and all channels are optimized end-to-end against a distance between the simulated and observed measurement distributions. On ibm_fez, a Heron-generation superconducting processor, training on a ripple-carry adder circuit reproduces the device output distribution, and the same learned parameters, applied without retraining, also track the device distribution of an unrelated multiplier circuit, indicating that the method captures intrinsic device characteristics rather than overfitting to the training circuit. A systematic evaluation across a range of benchmark circuits confirms that this generalization is consistent. We further use the learned model to perform an offline feasibility assessment of the quantum approximate optimization algorithm with an error detection scheme, demonstrating the kind of noise-aware prediction the framework is designed to enable.

ssm
#146

The sights of Sea Air Space Day 3

Government & Defense 2026-04-22 Breaking Defense
4.0
I 3.5 Im 4.4 P 4.1

A selection of photos from day three of Sea Air Space.

gov_defense
#147

Trump taps defense firm execs to lead space acquisition, NRO

Government & Defense 2026-04-22 Breaking Defense
4.0
I 3.5 Im 4.4 P 4.1

Erich Hernandez-Baquero has been selected to serve as the Space Force’s next acquisition czar, while Roger Mason has been nominated as the next director of the National Reconnaissance Office.

gov_defense
#150
4.0
I 3.5 Im 4.4 P 4.1

“As we start fleshing out these concepts, what we start to understand is that … something that is a really great capability over in EUCOM or CENTCOM right now may not translate over to the Pacific, where the distances are way, way greater,” Rear Adm. Douglas Sasse told the Sea-Air-Space conference.

gov_defense
#151

Turkish firms boost defense ties with Malaysia in missiles, comms and AI

Government & Defense 2026-04-22 Breaking Defense
4.0
I 3.5 Im 4.4 P 4.1

“Malaysia chose Turkish suppliers because they offered a rare combination of combat-proven capability, affordability, speed of delivery, and eagerness to build long-term industrial partnerships rather than simply sell end products systems,” one expert said.

gov_defense
#154

Forecasts of CMB $E$-mode anomalies for AliCPT-1

Evaluations & Benchmarks 2026-04-22 arXiv — Mechanistic Interpretability
4.0
I 3.2 Im 5.3 P 3.5

The standard $Λ$CDM model has been highly successful in describing cosmic microwave background (CMB) observations. Nevertheless, a set of large-scale statistical anomalies persists in temperature anisotropies across WMAP and Planck. CMB $E$-mode polarization offers an independent probe of these anomalies, circumventing the look-elsewhere effect inherent in temperature-only analyses. In this paper, we forecast the capability of the Ali CMB Polarization Telescope (AliCPT), a ground-based CMB experiment in the Northern Hemisphere, to detect such anomalies in large-scale $E$-mode polarization. Using 1000 unconstrained simulations processed with the NILC component separation method, we evaluate four anomaly estimators: dipole modulation, lack of large-angle correlations, quadrupole-octopole alignment, and point-parity asymmetry. Our analysis considers two noise levels for AliCPT, as well as a joint configuration with Simons Observatory (SO) Large Aperture Telescope (LAT). For dipole modulation, we validate the local variance estimator on modulated simulations with an input amplitude $A_d = 0.07$, and find that the combined AliCPT+SO dataset is likely to detect the injected $E$-mode modulation at a 99% confidence level. Tests of the full suite of anomaly statistics on unconstrained isotropic simulations indicate that AliCPT alone, owing to its limited sky coverage, might introduce systematic biases or enlarged uncertainties, especially for quadrupole-octopole alignment and point-parity asymmetry. The combination with SO largely restores the statistical distributions to those expected in an ideal full-sky scenario, thereby establishing a near-cosmic-variance benchmark for upcoming anomaly investigations.

evals
#155

Mechanistic Interpretability Tool for AI Weather Models

Interpretability 2026-04-22 arXiv — Mechanistic Interpretability
4.0
I 3.5 Im 5.0 P 3.5

Artificial Intelligence (AI) weather models are improving rapidly, and their forecasts are already competitive with long-established traditional Numerical Weather Prediction (NWP). To build confidence in this new methodology, it is critical that we understand how these predictions are generated. This is a huge challenge as these AI weather models remain largely black boxes. In other areas of Machine Learning (ML), mechanistic interpretability has emerged as a framework for understanding ML predictions by analysing the building blocks responsible for them. Here we present an open-source, highly adaptable tool which incorporates concepts from mechanistic interpretability. The tool organises internal latent representations from the model processor and allows for initial analyses, including cosine similarity and Principal Component Analysis (PCA), enabling the user to identify directions in latent space potentially associated with meteorological features. Applying our tool to the graph neural network GraphCast, we present preliminary case studies for mid-latitude synoptic-scale waves and specific humidity. These demonstrate the tool's ability to identify linear combinations of latent channels that appear to correspond to interpretable features.

interpretability

This paper presents the hybrid solver for a $CO_2$ sequestration problem. The solver uses the IGA-ADS (IsoGeometric Analysis Alternating Directions solver) to compute the saturation scalar field update using the explicit method, and CRVPINN (Collocation-based Robust Variational Physics Informed Neural Networks solver) to compute the pressure scalar field. The study focuses on simulating the physical behavior of $CO_2$ in porous structures, excluding chemical reactions. The mathematical model is based on Darcy's Law. The CRVPINN is pretrained on the initial pressure configuration, and the time step pressure updates require only 100 iterations of the Adam method per time step. We compare our hybrid IGA-ADS solver, coupled with the CRVPINN method, with a baseline of the IGA-ADS solver coupled with the MUMPS direct solver. Our hybrid solver is over 3 times faster on a single computational node from the ARES cluster of ACK CYFRONET. Future work includes extensive testing, inverse problem solving, and potential application to $H_2$ storage problems.

infra
#158

Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

Reinforcement Learning 2026-04-22 arXiv cs.NE (Neural & Evolutionary Computing)arXiv cs.RO (Robotics)arXiv — Reinforcement Learning
3.9
I 3.5 Im 3.8 P 4.5

Quality-Diversity (QD) algorithms excel at discovering diverse repertoires of skills, but are hindered by poor sample efficiency and often require tens of millions of environment steps to solve complex locomotion tasks. Recent advances in Reinforcement Learning (RL) have shown that high Update-to-Data (UTD) ratios accelerate Actor-Critic learning. While effective, standard high-UTD algorithms typically utilise target networks to stabilise training. This requirement introduces a significant computational bottleneck, rendering them impractical for resource-intensive Quality-Diversity (QD) tasks where sample efficiency and rapid population adaptation are critical. In this paper, we introduce QDHUAC, a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines. Our results suggest that combining target-free distributional critics with dominance-based selection is a key enabler for the next generation of sample-efficient evolutionary RL algorithms.

rl
#159

Where Reasoning Breaks: Logic-Aware Path Selection by Controlling Logical Connectives in LLMs Reasoning Chains

Post-Training 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Post-training / Alignment
3.9
I 3.2 Im 4.6 P 4.0

While LLMs demonstrate impressive reasoning capabilities, they remain fragile in multi-step logical deduction, where a single transition error can propagate through the entire reasoning chain, leading to unstable performance. In this work, we identify logical connectives as primary points of this structural fragility. Through empirical analysis, we show that connective tokens function as high entropy forking points, at which models frequently struggle to determine the correct logical direction. Motivated by this observation, we hypothesize that intervening in logical connective selection can guide LLMs toward more correct logical direction, thereby improving the overall reasoning chain. To validate this hypothesis, we propose a multi-layered framework that intervenes specifically at these logic-critical junctions in the reasoning process. Our framework includes (1) Gradient-based Logical Steering to guide LLMs internal representations towards valid reasoning subspaces, (2) Localized Branching to resolve ambiguity via targeted look-ahead search, and (3) Targeted Transition Preference Optimization, a surgical reinforcement learning objective that selectively optimizes single-token preferences at logical pivots. Crucially, by concentrating intervention solely on logic-critical transitions, our framework achieves a favorable accuracy--efficiency trade-off compared to global inference time scaling methods like beam search and self-consistency.

post_training
#160

Decoding Text Spans for Efficient and Accurate Named-Entity Recognition

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)
3.9
I 4.3 Im 3.8 P 3.5

Named Entity Recognition (NER) is a key component in industrial information extraction pipelines, where systems must satisfy strict latency and throughput constraints in addition to strong accuracy. State-of-the-art NER accuracy is often achieved by span-based frameworks, which construct span representations from token encodings and classify candidate spans. However, many span-based methods enumerate large numbers of candidates and process each candidate with marker-augmented inputs, substantially increasing inference cost and limiting scalability in large-scale deployments. In this work, we propose SpanDec, an efficient span-based NER framework that targets this bottleneck. Our main insight is that span representation interactions can be computed effectively at the final transformer stage, avoiding redundant computation in earlier layers via a lightweight decoder dedicated to span representations. We further introduce a span filtering mechanism during enumeration to prune unlikely candidates before expensive processing. Across multiple benchmarks, SpanDec matches competitive span-based baselines while improving throughput and reducing computational cost, yielding a better accuracy-efficiency trade-off suitable for high-volume serving and on-device applications.

evals
#161

Surrogate modeling for interpreting black-box LLMs in medical predictions

Interpretability 2026-04-22 arXiv cs.CL (Computation & Language)
3.9
I 3.2 Im 5.0 P 3.5

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

interpretability
#162
3.9
I 4.3 Im 3.8 P 3.5

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

robotics
#163
3.9
I 4.3 Im 3.8 P 3.5

Realizing active visual tracking with a single unified model across diverse robots is challenging, as the physical constraints and motion dynamics vary drastically from one platform to another. Existing approaches typically train separate models for each embodiment, leading to poor scalability and limited generalization. To address this, we propose AdaTracker, an adaptive in-context policy learning framework that robustly tracks targets on diverse robot morphologies. Our key insight is to explicitly model embodiment-specific constraints through an Embodiment Context Encoder, which infers embodiment-specific constraints from history. This contextual representation dynamically modulates a Context-Aware Policy, enabling it to infer optimal control actions for unseen embodiments in a zero-shot manner. To enhance robustness, we introduce two auxiliary objectives to ensure accurate context identification and temporal consistency. Experiments in both simulation and the real world demonstrate that AdaTracker significantly outperforms state-of-the-art methods in cross-embodiment generalization, sample efficiency, and zero-shot adaptation.

robotics
#164

Closing the Domain Gap in Biomedical Imaging by In-Context Control Samples

AI for Science 2026-04-22 arXiv cs.LG (Machine Learning)
3.9
I 3.5 Im 4.5 P 3.5

The central problem in biomedical imaging are batch effects: systematic technical variations unrelated to the biological signal of interest. These batch effects critically undermine experimental reproducibility and are the primary cause of failure of deep learning systems on new experimental batches, preventing their practical use in the real world. Despite years of research, no method has succeeded in closing this performance gap for deep learning models. We propose Control-Stabilized Adaptive Risk Minimization via Batch Normalization (CS-ARM-BN), a meta-learning adaptation method that exploits negative control samples. Such unperturbed reference images are present in every experimental batch by design and serve as stable context for adaptation. We validate our novel method on Mechanism-of-Action (MoA) classification, a crucial task for drug discovery, on the large-scale JUMP-CP dataset. The accuracy of standard ResNets drops from 0.939 $\pm$ 0.005, on the training domain, to 0.862 $\pm$ 0.060 on data from new experimental batches. Foundation models, even after Typical Variation Normalization, fail to close this gap. We are the first to show that meta-learning approaches close the domain gap by achieving 0.935 $\pm$ 0.018. If the new experimental batches exhibit strong domain shifts, such as being generated in a different lab, meta-learning approaches can be stabilized with control samples, which are always available in biomedical experiments. Our work shows that batch effects in bioimaging data can be effectively neutralized through principled in-context adaptation, which also makes them practically usable and efficient.

ai_science
#165

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)arXiv — Evals & Benchmarks
3.9
I 3.5 Im 3.8 P 4.5

The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

evals
#166

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

State Space Models 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks
3.9
I 3.5 Im 3.8 P 4.5

Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

ssm
#167

Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)
3.9
I 4.3 Im 3.8 P 3.5

Adversarial robustness evaluation underpins every claim of trustworthy ML deployment, yet the field suffers from fragmented protocols and undetected gradient masking. We make two contributions. (1) Structured synthesis. We analyze nine peer-reviewed corpus sources (2020--2026) through seven complementary protocols, producing the first end-to-end structured analysis of the field's consensus and unresolved challenges. (2) Auto-ART framework. We introduce Auto-ART, an open-source framework that operationalizes identified gaps: 50+ attacks, 28 defense modules, the Robustness Diagnostic Index (RDI), and gradient-masking detection. It supports multi-norm evaluation (l1/l2/linf/semantic/spatial) and compliance mapping to NIST AI RMF, OWASP LLM Top 10, and the EU AI Act. Empirical validation on RobustBench demonstrates that Auto-ART's pre-screening identifies gradient masking in 92% of flagged cases, and RDI rankings correlate highly with full AutoAttack. Multi-norm evaluation exposes a 23.5 pp gap between average and worst-case robustness on state-of-the-art models. No prior work combines such structured meta-scientific analysis with an executable evaluation framework bridging literature gaps into engineering.

evals
#168
3.9
I 4.3 Im 3.8 P 3.5

Federated learning (FL) enables training of a global model while keeping raw data on end-devices. Despite this, FL has shown to leak private user information and thus in practice, it is often coupled with methods such as differential privacy (DP) and secure vector sum to provide formal privacy guarantees to its participants. In realistic cross-device deployments, the data are highly heterogeneous, so vanilla federated learning converges slowly and generalizes poorly. Clustered federated learning (CFL) mitigates this by segregating users into clusters, leading to lower intra-cluster data heterogeneity. Nevertheless, coupling CFL with DP remains challenging: the injected DP noise makes individual client updates excessively noisy, and the server is unable to initialize cluster centroids with the less noisy aggregated updates. To address this challenge, we propose PINA, a two-stage framework that first lets each client fine-tune a lightweight low-rank adaptation (LoRA) adapter and privately share a compressed sketch of the update. The server leverages these sketches to construct robust cluster centroids. In the second stage, PINA introduces a normality-driven aggregation mechanism that improves convergence and robustness. Our method retains the benefits of clustered FL while providing formal privacy guarantees against an untrusted server. Extensive evaluations show that our proposed method outperforms state-of-the-art DP-FL algorithms by an average of 2.9% in accuracy for privacy budgets (epsilon in {2, 8}).

evals
#169

On Bayesian Softmax-Gated Mixture-of-Experts Models

Infrastructure 2026-04-22 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)arXiv — Efficiency (Quantization, MoE, Inference)
3.9
I 3.5 Im 3.8 P 4.5

Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet their theoretical properties in the Bayesian framework remain largely unexplored. In this paper, we study Bayesian mixture-of-experts models, focusing on the ubiquitous softmax-based gating mechanism. Specifically, we investigate the asymptotic behavior of the posterior distribution for three fundamental statistical tasks: density estimation, parameter estimation, and model selection. First, we establish posterior contraction rates for density estimation, both in the regimes with a fixed, known number of experts and with a random learnable number of experts. We then analyze parameter estimation and derive convergence guarantees based on tailored Voronoi-type losses, which account for the complex identifiability structure of mixture-of-experts models. Finally, we propose and analyze two complementary strategies for selecting the number of experts. Taken together, these results provide one of the first systematic theoretical analyses of Bayesian mixture-of-experts models with softmax gating, and yield several theory-grounded insights for practical model design.

infra
#170

Explicit Dropout: Deterministic Regularization for Transformer Architectures

Interpretability 2026-04-22 arXiv cs.LG (Machine Learning)
3.9
I 4.3 Im 3.8 P 3.5

Dropout is a widely used regularization technique in deep learning, but its effects are typically realized through stochastic masking rather than explicit optimization objectives. We propose a deterministic formulation that expresses dropout as an additive regularizer directly incorporated into the training loss. The framework derives explicit regularization terms for Transformer architectures, covering attention query, key, value, and feed-forward components with independently controllable strengths. This formulation removes reliance on stochastic perturbations while providing clearer and fine-grained control over regularization strength. Experiments across image classification, temporal action detection, and audio classification show that explicit dropout matches or outperforms conventional implicit methods, with consistent gains when applied to attention and feed-forward network layers. Ablation studies demonstrate stable performance and controllable regularization through regularization coefficients and dropout rates. Overall, explicit dropout offers a practical and interpretable alternative to stochastic regularization while maintaining architectural flexibility across diverse tasks.

interpretability
#171

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

Robotics 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv — Agents / Tool UsearXiv — Evals & Benchmarks
3.9
I 3.5 Im 3.8 P 4.5

This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

robotics
#172

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)
3.9
I 4.3 Im 3.8 P 3.5

Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.

evals
#173

Amodal SAM: A Unified Amodal Segmentation Framework with Generalization

Evaluations & Benchmarks 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Evals & Benchmarks
3.9
I 4.0 Im 3.8 P 4.0

Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target-Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state-of-the-art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real-world environments.

evals
#174

Where are they looking in the operating room?

Robotics 2026-04-22 arXiv cs.CV (Computer Vision)
3.9
I 4.3 Im 3.8 P 3.5

Purpose: Gaze-following, the task of inferring where individuals are looking, has been widely studied in computer vision, advancing research in visual attention modeling, social scene understanding, and human-robot interaction. However, gaze-following has never been explored in the operating room (OR), a complex, high-stakes environment where visual attention plays an important role in surgical workflow analysis. In this work, we introduce the concept of gaze-following to the surgical domain, and demonstrate its great potential for understanding clinical roles, surgical phases, and team communications in the OR. Methods: We extend the 4D-OR dataset with gaze-following annotations, and extend the Team-OR dataset with gaze-following and a new team communication activity annotations. Then, we propose novel approaches to address clinical role prediction, surgical phase recognition, and team communication detection using a gaze-following model. For role and phase recognition, we propose a gaze heatmap-based approach that uses gaze predictions solely; for team communication detection, we train a spatial-temporal model in a self-supervised way that encodes gaze-based clip features, and then feed the features into a temporal activity detection model. Results: Experimental results on the 4D-OR and Team-OR datasets demonstrate that our approach achieves state-of-the-art performance on all downstream tasks. Quantitatively, our approach obtains F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition. Furthermore, it significantly outperforms existing baselines in team communication detection, improving previous best performances by over 30%. Conclusion: We introduce gaze-following in the OR as a novel research direction in surgical data science, highlighting its great potential to advance surgical workflow analysis in computer-assisted interventions.

robotics
#175
3.9
I 3.5 Im 4.6 P 3.5

We study the emergence of symmetric oscillatory behavior in multi-agent systems where each agent incorporates a continuous memory of its past states and past rates of change, modeled by distributed retarded and neutral delays. The closed-loop dynamics are described by a system of nonlinear neutral functional differential equations (NFDEs) with a high degree of symmetry, arising from a network of homogeneous agents. By reformulating the problem as a fixed point operator equation, we apply equivariant degree theory to establish rigorous conditions for unbounded global Hopf bifurcation from the consensus equilibrium. Our main results provide sufficient conditions for the local asymptotic stability of consensus and for the existence of unbounded global branches of non-constant periodic solutions with prescribed spatio-temporal symmetries. The question of whether such periodic solutions are stable (and therefore constitute periodic multiconsensus) is not resolved by the degree method; we address it in an illustrative example via numerical simulation. The example, which models eight coupled asset markets with momentum traders and fundamentalists, demonstrates how memory-driven instability can generate periodic boom-bust cycles across clusters of assets. The numerical experiments confirm the bifurcation predictions and reveal the stability of the resulting oscillations, illustrating the power of combining symmetric bifurcation theory with targeted numerical analysis.

agents
#176

Self-supervised pretraining for an iterative image size agnostic vision transformer

Efficiency 2026-04-22 arXiv — Efficiency (Quantization, MoE, Inference)
3.9
I 3.5 Im 4.5 P 3.5

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.

efficiency
#177

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

Evaluations & Benchmarks 2026-04-22 arXiv — Efficiency (Quantization, MoE, Inference)
3.9
I 4.3 Im 3.8 P 3.5

The Composed Image Retrieval (CIR) task provides a flexible retrieval paradigm via a reference image and modification text, but it heavily relies on expensive and error-prone triplet annotations. This paper systematically investigates the Noisy Triplet Correspondence (NTC) problem introduced by annotations. We find that NTC noise, particularly ``hard noise'' (i.e., the reference and target images are highly similar but the modification text is incorrect), poses a unique challenge to existing Noise Correspondence Learning (NCL) methods because it breaks the traditional ``small loss hypothesis''. We identify and elucidate three key, yet overlooked, challenges in the NTC task, namely (C1) Modality Suppression, (C2) Negative Anchor Deficiency, and (C3) Unlearning Backlash. To address these challenges, we propose a Cone-based robuSt noisE-unlearning comPositional network (ConeSep). Specifically, we first propose Geometric Fidelity Quantization, theoretically establishing and practically estimating a noise boundary to precisely locate noisy correspondence. Next, we introduce Negative Boundary Learning, which learns a ``diagonal negative combination'' for each query as its explicit semantic opposite-anchor in the embedding space. Finally, we design Boundary-based Targeted Unlearning, which models the noisy correction process as an optimal transport problem, elegantly avoiding Unlearning Backlash. Extensive experiments on benchmark datasets (FashionIQ and CIRR) demonstrate that ConeSep significantly outperforms current state-of-the-art methods, which fully demonstrates the effectiveness and robustness of our method.

evals
#178

MD-Face: MoE-Enhanced Label-Free Disentangled Representation for Interactive Facial Attribute Editing

Generative Media 2026-04-22 arXiv — Efficiency (Quantization, MoE, Inference)
3.9
I 4.3 Im 3.8 P 3.5

GAN-based facial attribute editing is widely used in virtual avatars and social media but often suffers from attribute entanglement, where modifying one face attribute unintentionally alters others. While supervised disentangled representation learning can address this, it relies heavily on labeled data, incurring high annotation costs. To address these challenges, we propose MD-Face, a label-free disentangled representation learning framework based on Mixture of Experts (MoE). MD-Face utilizes a MoE backbone with a gating mechanism that dynamically allocates experts, enabling the model to learn semantic vectors with greater independence. To further enhance attribute entanglement, we introduce a geometry-aware loss, which aligns each semantic vector with its corresponding Semantic Boundary Vector (SBV) through a Jacobian-based pushforward method. Experiments with ProGAN and StyleGAN show that MD-Face outperforms unsupervised baselines and competes with supervised ones. Compared to diffusion-based methods, it offers better image quality and lower inference latency, making it ideal for interactive editing.

generative_media
#179

Quantization robustness from dense representations of sparse functions in high-capacity kernel associative memory

Efficiency 2026-04-22 arXiv cs.NE (Neural & Evolutionary Computing)arXiv — Efficiency (Quantization, MoE, Inference)
3.8
I 3.5 Im 3.8 P 4.0

High-capacity associative memories based on Kernel Logistic Regression (KLR) are known for their exceptional performance but are hindered by high computational costs. This paper investigates the compressibility of KLR-trained Hopfield networks to understand the geometric principles of its robust encoding. We provide a comprehensive geometric theory based on spontaneous symmetry breaking and Walsh analysis, and validate it with compression experiments (quantization and pruning). Our experiments reveal a striking contrast: the network is extremely robust to low-precision quantization but highly sensitive to pruning. Our theory explains this via a ``sparse function, dense representation'' principle, where a sparse input mapping is implemented with a dense, bimodal parameterization. Our findings not only provide a practical path to hardware-efficient kernel memories but also offer new insights into the geometric principles of robust representation in neural systems.

efficiency
#180

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

State Space Models 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)arXiv — Evals & Benchmarks
3.8
I 3.2 Im 3.8 P 4.5

Paralinguistic cues are essential for natural human-computer interaction, yet their evaluation in Large Audio-Language Models (LALMs) remains limited by coarse feature coverage and the inherent subjectivity of assessment. To address these challenges, we introduce SpeechParaling-Bench, a comprehensive benchmark for paralinguistic-aware speech generation. It expands existing coverage from fewer than 50 to over 100 fine-grained features, supported by more than 1,000 English-Chinese parallel speech queries, and is organized into three progressively challenging tasks: fine-grained control, intra-utterance variation, and context-aware adaptation. To enable reliable evaluation, we further develop a pairwise comparison pipeline, in which candidate responses are evaluated against a fixed baseline by an LALM-based judge. By framing evaluation as relative preference rather than absolute scoring, this approach mitigates subjectivity and yields more stable and scalable assessments without costly human annotation. Extensive experiments reveal substantial limitations in current LALMs. Even leading proprietary models struggle with comprehensive static control and dynamic modulation of paralinguistic features, while failure to correctly interpret paralinguistic cues accounts for 43.3% of errors in situational dialogue. These findings underscore the need for more robust paralinguistic modeling toward human-aligned voice assistants.

ssm
#181

Anchor-and-Resume Concession Under Dynamic Pricing for LLM-Augmented Freight Negotiation

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)
3.8
I 3.5 Im 3.8 P 4.0

Freight brokerages negotiate thousands of carrier rates daily under dynamic pricing conditions where models frequently revise targets mid-conversation. Classical time-dependent concession frameworks use a fixed shape parameter $β$ that cannot adapt to these updates. Deriving $β$ from the live spread enables adaptation but introduces a new problem: a pricing shift can cause the formula to retract a previous offer, violating monotonicity. LLM-powered brokers offer flexibility but require expensive reasoning models, produce non-deterministic pricing, and remain vulnerable to prompt injection. We propose a two-index anchor-and-resume framework that addresses both limitations. A spread-derived $β$ maps each load's margin structure to the correct concession posture, while the anchor-and-resume mechanism guarantees monotonically non-decreasing offers under arbitrary pricing shifts. All pricing decisions remain in a deterministic formula; the LLM, when used, serves only as a natural-language translation layer. Empirical evaluation across 115,125 negotiations shows that the adaptive $β$ tailors behavior by regime: in narrow spreads, it concedes quickly to prioritize deal closure and load coverage; in medium and wide spreads, it matches or exceeds the best fixed-$β$ baselines in broker savings. Against an unconstrained 20-billion-parameter LLM broker, it achieves similar agreement rates and savings. Against LLM-powered carriers as more realistic stochastic counterparties, it maintains comparable savings and higher agreement rates than against rule-based opponents. By decoupling the LLM from pricing logic, the framework scales horizontally to thousands of concurrent negotiations with negligible inference cost and transparent decision-making.

evals
#182

SignDATA: Data Pipeline for Sign Language Translation

Generative Media 2026-04-22 arXiv cs.CL (Computation & Language)arXiv — Generative Media / Diffusion
3.8
I 3.5 Im 3.8 P 4.0

Sign-language datasets are difficult to preprocess consistently because they vary in annotation schema, clip timing, signer framing, and privacy constraints. Existing work usually reports downstream models, while the preprocessing pipeline that converts raw video into training-ready pose or video artifacts remains fragmented, backend-specific, and weakly documented. We present SignDATA, a config-driven preprocessing toolkit that standardizes heterogeneous sign-language corpora into comparable outputs for learning. The system supports two end-to-end recipes: a pose recipe that performs acquisition, manifesting, person localization, clipping, cropping, landmark extraction, normalization, and WebDataset export, and a video recipe that replaces pose extraction with signer-cropped video packaging. SignDATA exposes interchangeable MediaPipe and MMPose backends behind a common interface, typed job schemas, experiment-level overrides, and per-stage checkpointing with config- and manifest-aware hashes. We validate the toolkit through a research-oriented evaluation design centered on backend comparison, preprocessing ablations, and privacy-aware video generation on datasets. Our contribution is a reproducible preprocessing layer for sign-language research that makes extractor choice, normalization policy, and privacy tradeoffs explicit, configurable, and empirically comparable.Code is available at https://github.com/balaboom123/signdata-slt.

generative_media
#183

VTouch++: A Multimodal Dataset with Vision-Based Tactile Enhancement for Bimanual Manipulation

Robotics 2026-04-22 arXiv cs.RO (Robotics)arXiv cs.AI (Artificial Intelligence)
3.8
I 3.5 Im 3.8 P 4.0

Embodied intelligence has advanced rapidly in recent years; however, bimanual manipulation-especially in contact-rich tasks remains challenging. This is largely due to the lack of datasets with rich physical interaction signals, systematic task organization, and sufficient scale. To address these limitations, we introduce the VTOUCH dataset. It leverages vision based tactile sensing to provide high-fidelity physical interaction signals, adopts a matrix-style task design to enable systematic learning, and employs automated data collection pipelines covering real-world, demand-driven scenarios to ensure scalability. To further validate the effectiveness of the dataset, we conduct extensive quantitative experiments on cross-modal retrieval as well as real-robot evaluation. Finally, we demonstrate real-world performance through generalizable inference across multiple robots, policies, and tasks.

robotics
#184

Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incomplete Layer Traces

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)
3.8
I 3.2 Im 4.5 P 3.5

Internal ice layers imaged by radar provide key evidence of snow accumulation and ice dynamics, but radar-derived layer boundary observations are often incomplete, with discontinuous traces and sometimes entirely missing layers, due to limited resolution, sensor noise, and signal loss. Existing graph-based models for ice stratigraphy generally assume sufficiently complete layer profiles and focus on predicting deeper-layer thickness from reliably traced shallow layers. In this work, we address the layer-completion problem itself by synthesizing complete ice-layer thickness annotations from incomplete radar-derived layer traces by conditioning on colocated physical features synchronized from physical climate models. The proposed network combines geometric learning to aggregate within-layer spatial context with a transformer-based temporal module that propagates information across layers to encourage coherent stratigraphy and consistent thickness evolution. To learn from incomplete supervision, we optimize a mask-aware robust regression objective that evaluates errors only at observed thickness values and normalizes by the number of valid entries, enabling stable training under varying sparsity without imputation and steering completions toward physically plausible values. The model preserves observed thickness where available and infers only missing regions, recovering fragmented segments and even fully absent layers while remaining consistent with measured traces. As an additional benefit, the synthesized thickness stacks provide effective pretraining supervision for a downstream deep-layer predictor, improving fine-tuned accuracy over training from scratch on the same fully traced data.

evals
#185

Relative Entropy Estimation in Function Space: Theory and Applications to Trajectory Inference

State Space Models 2026-04-22 arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks
3.8
I 3.5 Im 3.8 P 4.0

Trajectory Inference (TI) seeks to recover latent dynamical processes from snapshot data, where only independent samples from time-indexed marginals are observed. In applications such as single-cell genomics, destructive measurements make path-space laws non-identifiable from finitely many marginals, leaving held-out marginal prediction as the dominant but limited evaluation protocol. We introduce a general framework for estimating the Kullback-Leibler divergence (KL) divergence between probability measures on function space, yielding a tractable, data-driven estimator that is scalable to realistic snapshot datasets. We validate the accuracy of our estimator on a benchmark suite, where the estimated functional KL closely matches the analytic KL. Applying this framework to synthetic and real scRNA-seq datasets, we show that current evaluation metrics often give inconsistent assessments, whereas path-space KL enables a coherent comparison of trajectory inference methods and exposes discrepancies in inferred dynamics, especially in regions with sparse or missing data. These results support functional KL as a principled criterion for evaluating trajectory inference under partial observability.

ssm
#186

Fast Bayesian equipment condition monitoring via simulation based inference: applications to heat exchanger health

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)arXiv — Evals & Benchmarks
3.8
I 3.5 Im 3.8 P 4.0

Accurate condition monitoring of industrial equipment requires inferring latent degradation parameters from indirect sensor measurements under uncertainty. While traditional Bayesian methods like Markov Chain Monte Carlo (MCMC) provide rigorous uncertainty quantification, their heavy computational bottlenecks render them impractical for real-time process control. To overcome this limitation, we propose an AI-driven framework utilizing Simulation-Based Inference (SBI) powered by amortized neural posterior estimation to diagnose complex failure modes in heat exchangers. By training neural density estimators on a simulated dataset, our approach learns a direct, likelihood-free mapping from thermal-fluid observations to the full posterior distribution of degradation parameters. We benchmark this framework against an MCMC baseline across various synthetic fouling and leakage scenarios, including challenging low-probability, sparse-event failures. The results show that SBI achieves comparable diagnostic accuracy and reliable uncertainty quantification, while accelerating inference time by a factor of82$\times$ compared to traditional sampling. The amortized nature of the neural network enables near-instantaneous inference, establishing SBI as a highly scalable, real-time alternative for probabilistic fault diagnosis and digital twin realization in complex engineering systems.

evals
#187

Tokenised Flow Matching for Hierarchical Simulation Based Inference

Generative Media 2026-04-22 arXiv cs.LG (Machine Learning)arXiv cs.AI (Artificial Intelligence)arXiv — Generative Media / Diffusion
3.8
I 3.2 Im 3.8 P 4.5

The cost of simulator evaluations is a key practical bottleneck for Simulation Based Inference (SBI). In hierarchical settings with shared global parameters and exchangeable site-level parameters and observations, this structure can be exploited to improve simulation efficiency. Existing hierarchical SBI approaches factorise the posterior yet still simulate across multiple sites per training sample; We instead explore likelihood factorisation (LF) to train from single-site simulations. In LF sampling we learn a per-site neural surrogate of the simulator and then assemble synthetic multi-site observations to amortise inference for the full hierarchical posterior. Building on this, we propose Tokenised Flow Matching for Posterior Estimation (TFMPE), a tokenised flow matching approach that supports function-valued observations through likelihood factorisation. To enable systematic evaluation, we introduce a benchmark for hierarchical SBI. We validate TFMPE on this benchmark and on realistic infectious disease and computational fluid dynamics models, finding well-calibrated posteriors while reducing computational cost.

generative_media
#188

Too Sharp, Too Sure: When Calibration Follows Curvature

Frontier LLMs 2026-04-22 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
3.8
I 3.5 Im 3.8 P 4.0

Modern neural networks can achieve high accuracy while remaining poorly calibrated, producing confidence estimates that do not match empirical correctness. Yet calibration is often treated as a post-hoc attribute. We take a different perspective: we study calibration as a training-time phenomenon on small vision tasks, and ask whether calibrated solutions can be obtained reliably by intervening on the training procedure. We identify a tight coupling between calibration, curvature, and margins during training of deep networks under multiple gradient-based methods. Empirically, Expected Calibration Error (ECE) closely tracks curvature-based sharpness throughout optimization. Mathematically, we show that both ECE and Gauss--Newton curvature are controlled, up to problem-specific constants, by the same margin-dependent exponential tail functional along the trajectory. Guided by this mechanism, we introduce a margin-aware training objective that explicitly targets robust-margin tails and local smoothness, yielding improved out-of-sample calibration across optimizers without sacrificing accuracy.

frontier_llm
#189

A Hierarchical MARL-Based Approach for Coordinated Retail P2P Trading and Wholesale Market Participation of DERs

Agents & Tool Use 2026-04-22 arXiv cs.LG (Machine Learning)arXiv — Reinforcement LearningarXiv — Agents / Tool Use
3.8
I 3.2 Im 3.8 P 4.5

The ongoing shift towards decentralization of the electric energy sector, driven by the growing electrification across end-use sectors, and widespread adoption of distributed energy resources (DERs), necessitates their active participation in the electricity markets to support grid operations. Furthermore, with bi-directional energy and communication flows becoming standard, intelligent, easy-to-deploy, resource-conservative demand-side participation is expected to play a critical role in securing power grid operational flexibility and market efficiency. This work proposes a market engagement framework that leverages a hierarchical multi-agent deep reinforcement learning (MARL) approach to enable individual prosumers to participate in peer-to-peer retail auctions and further aggregate these intelligent prosumers to facilitate effective DER participation in wholesale markets. Ultimately, a Stackelberg game is proposed to coordinate this hierarchical MARL-based DER market participation framework toward enhanced market performance.

agents
#190

Efficient Symbolic Computations for Identifying Causal Effects

Frontier LLMs 2026-04-22 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
3.8
I 3.5 Im 3.8 P 4.0

Determining identifiability of causal effects from observational data under latent confounding is a central challenge in causal inference. For linear structural causal models, identifiability of causal effects is decidable through symbolic computation. However, standard approaches based on Gröbner bases become computationally infeasible beyond small settings due to their doubly exponential complexity. In this work, we study how to practically use symbolic computation for deciding rational identifiability. In particular, we present an efficient algorithm that provably finds the lowest degree identifying formulas. For a causal effect of interest, if there exists an identification formula of a prespecified maximal degree, our algorithm returns such a formula in quasi-polynomial time.

frontier_llm
#191

Decentralized Machine Learning with Centralized Performance Guarantees via Gibbs Algorithms

Frontier LLMs 2026-04-22 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
3.8
I 3.5 Im 3.8 P 4.0

In this paper, it is shown, for the first time, that centralized performance is achievable in decentralized learning without sharing the local datasets. Specifically, when clients adopt an empirical risk minimization with relative-entropy regularization (ERM-RER) learning framework and a forward-backward communication between clients is established, it suffices to share the locally obtained Gibbs measures to achieve the same performance as that of a centralized ERM-RER with access to all the datasets. The core idea is that the Gibbs measure produced by client~$k$ is used, as reference measure, by client~$k+1$. This effectively establishes a principled way to encode prior information through a reference measure. In particular, achieving centralized performance in the decentralized setting requires a specific scaling of the regularization factors with the local sample sizes. Overall, this result opens the door to novel decentralized learning paradigms that shift the collaboration strategy from sharing data to sharing the local inductive bias via the reference measures over the set of models.

frontier_llm
#192

Geometric Renyi Differential Privacy: Ricci Curvature Characterized by Heat Diffusion Mechanisms

Generative Media 2026-04-22 arXiv stat.ML (Statistical ML)arXiv — Generative Media / Diffusion
3.8
I 3.5 Im 3.8 P 4.0

In this paper, we develop a novel privacy mechanism for Riemannian manifold-valued data. Our key contribution lies in uncovering unexpected connections among geometric analysis, heat diffusion models, and differential privacy (DP). We characterize the Renyi divergence via dimension-free Harnack inequalities on Riemannian manifolds and establish Renyi differential privacy guarantees governed by Ricci curvature. For manifolds with nonnegative Ricci curvature, we propose a mechanism based on heat diffusion. In contrast, for general manifolds we introduce a Langevin-process-based approach that yields intrinsic mechanisms supporting normalization-free sampling and continuous privacy-utility trade-offs. We derive detailed utility analyses for both mechanisms. As a statistical application, we develop privacy-preserving estimation of the generalized Frechet mean, including nontrivial sensitivity analysis and phase transition characterizations. Numerical experiments further demonstrate the advantages of the proposed DP mechanisms over existing approaches.

generative_media
#193

GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

Generative Media 2026-04-22 arXiv cs.CV (Computer Vision)arXiv — Generative Media / Diffusion
3.8
I 3.5 Im 3.8 P 4.0

Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.

generative_media
#196

AI needs a strong data fabric to deliver business value

Agents & Tool Use 2026-04-22 MIT Technology Review — AI
3.8
I 3.2 Im 4.6 P 3.5

Artificial intelligence is moving quickly in the enterprise, from experimentation to everyday use. Organizations are deploying copilots, agents, and predictive systems across finance, supply chains, human resources, and customer operations. By the end of 2025, half of companies used AI in at least three business functions, according to a recent survey. But as AI becomes…

agents
#197

Discrete Preference Learning for Personalized Multimodal Generation

Efficiency 2026-04-22 arXiv — Efficiency (Quantization, MoE, Inference)arXiv — Generative Media / Diffusion
3.8
I 3.5 Im 3.8 P 4.0

The emergence of generative models enables the creation of texts and images tailored to users' preferences. Existing personalized generative models have two critical limitations: lacking a dedicated paradigm for accurate preference modeling, and generating unimodal content despite real-world multimodal-driven user interactions. Therefore, we propose personalized multimodal generation, which captures modal-specific preferences via a dedicated preference model from multimodal interactions, and then feeds them into downstream generators for personalized multimodal content. However, this task presents two challenges: (1) Gap between continuous preferences from dedicated modeling and discrete token inputs intrinsic to generator architectures; (2) Potential inconsistency between generated images and texts. To tackle these, we present a two-stage framework called Discrete Preference learning for Personalized Multimodal Generation (DPPMG). In the first stage, to accurately learn discrete modal-specific preferences, we introduce a modal-specific graph neural network (a dedicated preference model) to learn users' modal-specific preferences, which preferences are then quantized into discrete preference tokens. In the second stage, the discrete modal-specific preference tokens are injected into downstream text and image generators. To further enhance cross-modal consistency while preserving personalization, we design a cross-modal consistent and personalized reward to fine-tune token-associated parameters. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model in generating personalized and consistent multimodal content.

efficiency
#198
3.8
I 3.5 Im 4.3 P 3.5

Motivation: Protein function prediction is a challenging task and an open problem in computational biology. The Critical Assessment of protein Function Annotation (CAFA) is a triennial, community-driven initiative that provides an independent, large-scale evaluation of computational methods for protein function prediction through time-delayed benchmarking experiments. CAFA has played a key role in highlighting high-performing methodologies and fostering detailed analysis and exchange of ideas. However, outside the periodic CAFA challenges, there is no platform for the continuous evaluation of newly developed methods and tracking performance as function annotations accumulate. Results: Here we introduce the Longitudinal Assessment of Protein Function Annotation Models server (LAFA) as a persistent benchmarking system for protein function prediction methods. LAFA provides a continuous evaluation of containerized function prediction methods, enabling up-to-date and robust comparative assessment of method performance under evolving ground truth. LAFA accelerates methodological iteration, supports reproducibility, and offers a more dynamic and fine-grained view of progress in protein function prediction. Code and Data Availability: LAFA is available at https://functionbench.net/. Detailed evaluation results can be found at https://github.com/anphan0828/CAFA_forever

ssm
#202
3.8
I 3.5 Im 4.4 P 3.5

Phelan’s surprise exit comes a day after he delivered a keynote address at the Sea-Air-Space symposium, in which he touted plans for a next-generation battleship. The post Hung Cao to take over as acting SECNAV after Phelan’s unexpected exit from Trump administration appeared first on DefenseScoop .

gov_defense

The fiscal 2027 budget request calls for billions and billions more for the service. But amid the spring of subsidized hope, there is a cold-snap of congressional despair. The post ‘Best of times, worst of times’ for the Coast Guard, commandant says, amid historic funding and legislative woes appeared first on DefenseScoop .

gov_defense
#204

CISA director pick Sean Plankey withdraws his nomination

Government & Defense 2026-04-22 FedScoop — AI
3.8
I 3.5 Im 4.4 P 3.5

Plankey had been waiting for more than a year, prompting the request to withdraw him as the one tapped to lead an agency now in further upheaval. The post CISA director pick Sean Plankey withdraws his nomination appeared first on FedScoop .

gov_defense
#205

NRC gives sneak peak at website modernization progress

Government & Defense 2026-04-22 FedScoop — AI
3.8
I 3.5 Im 4.4 P 3.5

The agency held a public webinar Wednesday for early feedback from stakeholders as the team eyes a summertime launch for the redesigned site. The post NRC gives sneak peak at website modernization progress appeared first on FedScoop .

gov_defense
#206

Treasury canceled Booz contracts over vetting of IRS leaker, Bessent says

Government & Defense 2026-04-22 FedScoop — AI
3.8
I 3.5 Im 4.4 P 3.5

The secretary told senators that Treasury lost “confidence” in Booz years after a contractor shared tax returns with media outlets. That breach took place on government systems, the company noted. The post Treasury canceled Booz contracts over vetting of IRS leaker, Bessent says appeared first on FedScoop .

gov_defense
#207
3.8
I 3.5 Im 4.4 P 3.5

In 2024, Paul van Hooft and Tim Sweijs wrote, “Two-Theater Tragedy: A Reluctant Europe Cannot Easily Escape a Sino-American War Over Taiwan,” where they argued a war in the Indo-Pacific would likely draw in and weaken Europe, even if European states try to remain on the sidelines. Two years later, amidst heightened tensions in both theaters, we asked Paul to revisit their arguments.Image: U.S. Naval Forces Europe-Africa/U.S. Sixth FleetIn your 2024 article, you argue a U.S.–Chinese war over Taiwan would inevitably draw in and weaken Europe, even if European states try to stay on the sidelines militarily. How does Europe’s recent The post Europe Might Sit Out In An Indo-Pacific War — But It Can’t Escape the Fallout appeared first on War on the Rocks .

gov_defense
#208

Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation

Robotics 2026-04-22 arXiv — Generative Media / Diffusion
3.8
I 4.0 Im 3.8 P 3.5

Co-manipulation requires multiple humans to synchronize their motions with a shared object while ensuring reasonable interactions, maintaining natural poses, and preserving stable states. However, most existing motion generation approaches are designed for single-character scenarios or fail to account for payload-induced dynamics. In this work, we propose a flow-matching framework that ensures the generated co-manipulation motions align with the intended goals while maintaining naturalness and effectiveness. Specifically, we first introduce a generative model that derives explicit manipulation strategies from the object's affordance and spatial configuration, which guide the motion flow toward successful manipulation. To improve motion quality, we then design an adversarial interaction prior that promotes natural individual poses and realistic inter-person interactions during co-manipulation. In addition, we also incorporate a stability-driven simulation into the flow matching process, which refines unstable interaction states through sampling-based optimization and directly adjusts the vector field regression to promote more effective manipulation. The experimental results demonstrate that our method achieves higher contact accuracy, lower penetration, and better distributional fidelity compared to state-of-the-art human-object interaction baselines. The code is available at https://github.com/boycehbz/StaCOM.

robotics
#209

AVISE: Framework for Evaluating the Security of AI Systems

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)arXiv cs.AI (Artificial Intelligence)
3.7
I 3.2 Im 3.8 P 4.0

As artificial intelligence (AI) systems are increasingly deployed across critical domains, their security vulnerabilities pose growing risks of high-profile exploits and consequential system failures. Yet systematic approaches to evaluating AI security remain underdeveloped. In this paper, we introduce AVISE (AI Vulnerability Identification and Security Evaluation), a modular open-source framework for identifying vulnerabilities in and evaluating the security of AI systems and models. As a demonstration of the framework, we extend the theory-of-mind-based multi-turn Red Queen attack into an Adversarial Language Model (ALM) augmented attack and develop an automated Security Evaluation Test (SET) for discovering jailbreak vulnerabilities in language models. The SET comprises 25 test cases and an Evaluation Language Model (ELM) that determines whether each test case was able to jailbreak the target model, achieving 92% accuracy, an F1-score of 0.91, and a Matthews correlation coefficient of 0.83. We evaluate nine recently released language models of diverse sizes with the SET and find that all are vulnerable to the augmented Red Queen attack to varying degrees. AVISE provides researchers and industry practitioners with an extensible foundation for developing and deploying automated SETs, offering a concrete step toward more rigorous and reproducible AI security evaluation.

evals
#210

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
3.7
I 3.2 Im 3.8 P 4.0

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

evals
#211

When the Rules Fail: Tax Incentives and Defense Sustainment

Safety, Policy & Regulation 2026-04-22 War on the Rocks
3.7
I 3.2 Im 4.4 P 3.5

In the 1986 World Cup, Diego Maradona scored his infamous “Hand of God” goal — an obvious handball that went uncalled, because the referee did not have the tools to see it. Today, soccer has addressed that vulnerability with sensors and video review, ensuring the game is adjudicated fairly. U.S. defense industrial policy faces a similar challenge. In America’s defense industrial base, the issue is not a lack of oversight, but distorted incentives that steer work toward private vendors even when the organic industrial base is well-positioned to perform it. When the organic industrial base is deprived of work, it The post When the Rules Fail: Tax Incentives and Defense Sustainment appeared first on War on the Rocks .

safety_policy
#212
3.6
I 3.5 Im 3.8 P 3.5

We present a biologically detailed extension of the classical Hopfield/Marr auto-associative memory model for CA3, implementing ten populations (two asymmetric pyramidal subtypes, eight GABAergic interneuron classes), forty-seven compartments, multi-rule plasticity (recurrent Hebb, BCM anti-saturation, mossy-fiber short-term, endocannabinoid iLTD, burst-gated Hebb), and a bimodal cholinergic encoding/consolidation cycle. Evaluated on pattern completion across auto-associative, associative, and temporal regimes, and on a controlled inhibitory-proportion manipulation at $N{=}256$, the full architecture exhibits \emph{three qualitative signatures absent from a minimal Hopfield baseline}: (i)~multi-attractor cross-seed behaviour at $K{=}5$ with biologically realistic inhibitory proportions, where two of five seeds converge to positive attractors with margin ${+}0.10{-}0.22$ (Cohen's $d{=}0.71$, one-sided $p{=}0.08$); (ii)~target-selective associative recall in paired $(A, B)$ memory at $K{\geq}5$, where the full model retrieves $B$ from a partial cue of $A$ while the minimal model echoes $A$ (Pearson margin $Δ{=}{+}0.163$ at $K{=}5$); (iii)~reduced cross-seed variance of the full model below the minimal baseline under clean upstream, with ratios $1.0{-}3.0$. These three signatures are architecture-specific: they appear consistently across independent regimes and are absent from the minimal control.

robotics
#213

Neuro-evolutionary stochastic architectures in gauge-covariant neural fields

Research 2026-04-22 arXiv cs.NE (Neural & Evolutionary Computing)
3.6
I 3.5 Im 3.8 P 3.5

We extend our gauge-covariant stochastic neural-field framework by promoting architecture-level parameters to slow stochastic variables evolving in function space. Our effective theory is formulated in terms of classical commuting fields and provides symmetry-constrained diagnostics of marginality and finite-width effects through the maximal Lyapunov exponent, the amplification factor, and dressed spectral kernels. On top of this dynamics, we introduce a Markovian evolutionary scheme compatible with the local $U(1)$ structure of the effective model. By using a minimal implementation, the genotype is reduced to the weight-variance parameter $σ_w^2$, and the fitness functional combines spectral agreement, marginal stability, and a symmetry-constrained critical anchor. Comparing three evolutionary models, we find that only the fully symmetry-constrained Ginibre $U(1)$ version robustly approaches a narrow near-marginal regime and reproduces the predicted low-frequency finite-width spectral behavior. These results support the use of symmetry-guided effective stability diagnostics as practical principles for stochastic architecture search in controlled settings.

research
#214

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

AI Coding 2026-04-22 arXiv cs.CL (Computation & Language)
3.6
I 3.5 Im 3.8 P 3.5

Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.

ai_coding
#215

LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation

Infrastructure 2026-04-22 arXiv cs.CL (Computation & Language)
3.6
I 3.5 Im 3.8 P 3.5

Automatically filling Case Report Forms (CRFs) from clinical notes is challenging due to noisy language, strict output contracts, and the high cost of false positives. We describe our CL4Health 2026 submission for Dyspnea CRF filling (134 items) using a contract-driven two-stage design grounded in Schema-Guided Reasoning (SGR). The key task property is extreme sparsity: the majority of fields are unknown, and official scoring penalizes both empty values and unsupported predictions. We shift from a single-step "LLM predicts 134 fields" approach to a decomposition where (i) Stage 1 produces a stable SGR-style JSON summary with exactly 9 domain keys, and (ii) Stage 2 is a fully deterministic, 0-LLM compiler that parses the Stage 1 summary, canonicalizes item names, normalizes predictions to the official controlled vocabulary, applies evidence-gated false-positive filters, and expands the output into the required 134-item format. On the dev80 split, the best teacher configuration achieves macro-F1 0.6543 (EN) and 0.6905 (IT); on the hidden test200, the submitted English variant scores 0.63 on Codabench. The pipeline is language-agnostic: Italian results match or exceed English with no language-specific engineering.

infra
#216

Effects of Cross-lingual Evidence in Multilingual Medical Question Answering

Evaluations & Benchmarks 2026-04-22 arXiv cs.CL (Computation & Language)
3.6
I 3.5 Im 3.8 P 3.5

This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated repositories of specialized medical knowledge, web-retrieved content, and explanations from LLM's parametric knowledge. Moreover, we conduct experiments with multilingual, monolingual and cross-lingual retrieval. Our results demonstrate that larger models consistently achieve superior performance in English across baseline evaluations. When incorporating external knowledge, web-retrieved data in English proves most beneficial for high-resource languages. Conversely, for low-resource languages, the most effective strategy combines retrieval in both English and the target language, achieving comparable accuracy to high-resource language results. These findings challenge the assumption that external knowledge systematically improves performance and reveal that effective strategies depend on both the source of language resources and on model scale. Furthermore, specialized medical knowledge sources such as PubMed are limited: while they provide authoritative expert knowledge, they lack adequate multilingual coverage

evals
#217
3.6
I 3.5 Im 3.8 P 3.5

Behaviour-Driven Development (BDD) suites accumulate step-text duplication whose maintenance cost is established in prior work. Existing detection techniques require running the tests (Binamungu et al., 2018-2023) or are confined to a single organisation (Irshad et al., 2020-2022), leaving a gap: a purely static, paraphrase-robust, step-level detector usable on any repository. We fill the gap with cukereuse, an open-source Python CLI combining exact hashing, Levenshtein ratio, and sentence-transformer embeddings in a layered pipeline, released alongside an empirical corpus of 347 public GitHub repositories, 23,667 parsed .feature files, and 1,113,616 Gherkin steps. The step-weighted exact-duplicate rate is 80.2 %; the median-repository rate is 58.6 % (Spearman rho = 0.51 with size). The top hybrid cluster groups 20.7k occurrences across 2.2k files. Against 1,020 pairs manually labelled by the three authors under a released rubric (inter-annotator Fleiss' kappa = 0.84 on a 60-pair overlap), we report precision, recall, and F1 with bootstrap 95 % CIs under two protocols: the primary rubric and a score-free second-pass relabelling. The strongest honest pair-level number is near-exact at F1 = 0.822 on score-free labels; the primary-rubric semantic F1 = 0.906 is inflated by a stratification artefact that pins recall at 1.000. Lexical baselines (SourcererCC-style, NiCad-style) reach primary F1 = 0.761 and 0.799. The paper also presents a CDN-structured critique of Gherkin (Cognitive Dimensions of Notations); eight of fourteen dimensions are rated problematic or unsupported. The tool, corpus, labelled pairs, rubric, and pipeline are released under permissive licences.

frontier_llm
#218

Not all ANIMALs are equal: metaphorical framing through source domains and semantic frames

Frontier LLMs 2026-04-22 arXiv cs.CL (Computation & Language)
3.6
I 3.5 Im 3.8 P 3.5

Metaphors are powerful framing devices, yet their source domains alone do not fully explain the specific associations they evoke. We argue that the interplay between source domains and semantic frames determines how metaphors shape understanding of complex issues, and present a computational framework that allows to derive salient discourse metaphors through their source domains and semantic frames. Applying this framework to climate change news, we uncover not only well-known source domains but also reveal nuanced frame-level associations that distinguish how the issue is portrayed. In analyzing immigration discourse across political ideologies, we demonstrate that liberals and conservatives systematically employ different semantic frames within the same source domains, with conservatives favoring frames emphasizing uncontrollability and liberals choosing neutral or more ``victimizing'' semantic frames. Our work bridges conceptual metaphor theory and linguistics, providing the first NLP approach for discovery of discourse metaphors and fine-grained analysis of differences in metaphorical framing. Code, data and statistical scripts are available at https://github.com/julia-nixie/ConceptFrameMet.

frontier_llm
#219
3.6
I 3.5 Im 3.8 P 3.5

Long-Horizon (LH) tasks in Human-Scene Interaction (HSI) are complex multi-step tasks that require continuous planning, sequential decision-making, and extended execution across domains to achieve the final goal. However, existing methods heavily rely on skill chaining by concatenating pre-trained subtasks, with environment observations and self-state tightly coupled, lacking the ability to generalize to new combinations of environments and skills, failing to complete various LH tasks across domains. To solve this problem, this paper presents ALAS, a cross-domain learning framework for LH tasks via biologically inspired dual-stream disentanglement. Inspired by the brain's "where-what" dual pathway mechanism, ALAS comprises two core modules: i) an environment learning module for spatial understanding, which captures object functions, spatial relationships, and scene semantics, achieving cross-domain transfer through complete environment-self disentanglement; ii) a skill learning module for task execution, which processes self-state information including joint degrees of freedom and motor patterns, enabling cross-skill transfer through independent motor pattern encoding. We conducted extensive experiments on various LH tasks in HSI scenes. Compared with existing methods, ALAS can achieve an average subtasks success rate improvement of 23\% and average execution efficiency improvement of 29\%.

research

Evaluating the pinch capability of a robotic hand is important for understanding its functional dexterity. However, many existing grasp evaluation methods rely on object geometry or contact force models, which limits their applicability during the early stages of robotic hand design. This study proposes a kinematic evaluation method for analyzing pinch configurations of robotic hands based on interactions between fingertip workspaces. First, the reachable workspace of each fingertip is computed from the joint configurations of the fingers. Then, feasible pinch configurations are detected by evaluating the relationships between fingertip pairs. Since the proposed method does not require information about object geometry or contact force models, the pinch capability of a robotic hand can be evaluated solely based on its kinematic structure. In addition, analyses are performed on four different kinematic structures of the hand to investigate their impact on the pinch configurations. The proposed evaluation framework can serve as a useful tool for comparing different robotic hand designs and analyzing pinch capability during the design stage.

robotics
#221
3.6
I 3.5 Im 3.8 P 3.5

Dexterous robotic manipulation requires comprehensive perception across all phases of interaction: pre-contact, contact initiation, and post-contact. Such continuous feedback allows a robot to adapt its actions throughout interaction. However, many existing tactile sensors, such as GelSight and its variants, only provide feedback after contact is established, limiting a robot's ability to precisely initiate contact. We introduce FingerEye, a compact and cost-effective sensor that provides continuous vision-tactile feedback throughout the interaction process. FingerEye integrates binocular RGB cameras to provide close-range visual perception with implicit stereo depth. Upon contact, external forces and torques deform a compliant ring structure; these deformations are captured via marker-based pose estimation and serve as a proxy for contact wrench sensing. This design enables a perception stream that smoothly transitions from pre-contact visual cues to post-contact tactile feedback. Building on this sensing capability, we develop a vision-tactile imitation learning policy that fuses signals from multiple FingerEye sensors to learn dexterous manipulation behaviors from limited real-world data. We further develop a digital twin of our sensor and robot platform to improve policy generalization. By combining real demonstrations with visually augmented simulated observations for representation learning, the learned policies become more robust to object appearance variations. Together, these design aspects enable dexterous manipulation across diverse object properties and interaction regimes, including coin standing, chip picking, letter retrieving, and syringe manipulation. The hardware design, code, appendix, and videos are available on our project website: https://nus-lins-lab.github.io/FingerEyeWeb/

robotics
#222
3.6
I 3.5 Im 3.8 P 3.5

In the design stage of robotic hands, it is not straightforward to quantitatively evaluate the effect of phalanx length ratios on dexterity without defining specific objects or manipulation tasks. Therefore, this study presents a framework for optimizing the phalanx length ratios of a five-finger robotic hand based on potential dexterity within a kinematic structure. The proposed method employs global manipulability, workspace volume, overlap workspace volume, and fingertip sensitivity as evaluation metrics, and identifies optimal design configurations using a weighted objective function under given constraints. The reachable workspace is discretized using a voxel-based representation, and joint motions are discretized at uniform intervals for evaluation. The optimization is performed over design sets for both the thumb and the other fingers, and design combinations that do not generate overlap workspace are excluded. The results show that each phalanx does not contribute equally to the overall dexterity, and the factors influencing each phalanx are identified. In addition, it is observed that the selection of weighting coefficients does not necessarily lead to the direct maximization of individual performance metrics, due to the non-uniform distribution of evaluation measures within the design space. The proposed framework provides a systematic approach to analyze the trade-offs among reachability, dexterity, and controllability, and can serve as a practical guideline for the kinematic design of multi-fingered robotic hands.

robotics
#223

Passive Variable Impedance For Shared Control

Robotics 2026-04-22 arXiv cs.RO (Robotics)
3.6
I 3.5 Im 3.8 P 3.5

Shared Control methods often use impedance control to track target poses in a robotic manipulator. The guidance behavior of such controllers is shaped by the used stiffness gains, which can be varying over time to achieve an adaptive guiding. When multiple target poses are tracked at the same time with varying importance, the corresponding output wrenches have to be arbitrated with weightings changing over time. In this work, we study the stabilization of both variable stiffness in impedance control as well as the arbitration of different controllers through a scaled addition of their output wrenches, reformulating both into a holistic framework. We identify passivity violations in the closed loop system and provide methods to passivate the system. The resulting approach can be used to stabilize standard impedance controllers, allowing for the development of novel and flexible shared control methods. We do not constrain the design of stiffness matrices or arbitration factors; both can be matrix-valued including off-diagonal elements and change arbitrarily over time. The proposed methods are furthermore validated in simulation as well as in real robot experiments on different systems, proving their effectiveness and showcasing different behaviors which can be utilized depending on the requirements of the shared control approach.

robotics
#224
3.6
I 3.5 Im 3.8 P 3.5

While Central Pattern Generators (CPGs) and Multi-Layer Perceptrons (MLP) are widely used paradigms in robot control, few systematic studies have been performed on the relative merits of large parameter spaces. In contexts where input and output spaces are small and performance is bounded, having more parameters to optimize may actively hinder the learning process instead of empowering it. To empirically measure this, we submit a given robot morphology, with limited proprioceptive capabilities, to controller optimization under two bio-inspired paradigms (CPGs and MLPs) with evolutionary- and reinforcement- trainer protocols. By varying parameter spaces across multiple reward functions, we observe that shallow MLPs and densely connected CPGs result in better performance when compared to deeper MLPs or Actor-Critic architectures. To account for the relationship between said performance and the number of parameters, we introduce a Parameter Impact metric which demonstrates that the additional parameters required by the reinforcement technique do not translate into better performance, thus favouring evolutionary strategies.

robotics
#225
3.6
I 3.5 Im 3.8 P 3.5

Tactile sensors are increasingly integrated into dexterous robotic manipulators to enhance contact perception. However, learning manipulation policies that rely on tactile sensing remains challenging, primarily due to the trade-off between fidelity and computational cost of soft-body simulations. To address this, we present ETac, a tactile simulation framework that models elastomeric soft-body interactions with both high fidelity and efficiency. ETac employs a lightweight data-driven deformation propagation model to capture soft-body contact dynamics, achieving high simulation quality and boosting efficiency that enables large-scale policy training. When serving as the simulation backend, ETac produces surface deformation estimates comparable to FEM and demonstrates applicability for modeling real tactile sensors. Then, we showcase its capability in training a blind grasping policy that leverages large-area tactile feedback to manipulate diverse objects. Running on a single RTX 4090 GPU, ETac supports reinforcement learning across 4,096 parallel environments, achieving a total throughput of 869 FPS. The resulting policy reaches an average success rate of 84.45% across four object types, underscoring ETac's potential to make tactile-based skill learning both efficient and scalable.

robotics
#226
3.6
I 3.5 Im 3.8 P 3.5

The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by introducing CQS Divide, an operation derived from cyclic quorum sets (CQS) theory that decomposes attention into a set of independent subsequence computations whose recomposition yields exactly the same result as full-sequence attention. Exploiting this decomposition, we introduce Stream-CQSA, a memory-adaptive scheduling framework that partitions attention into subproblems that fit within arbitrary memory budgets. This recasts attention from a logically monolithic operation into a collection of schedulable tasks, enabling flexible execution across devices without inter-device communication. Experiments demonstrate predictable memory scaling and show that exact attention over billion-token sequences can be executed on a single GPU via streaming, without changing the underlying mathematical definition of attention or introducing approximation error.

infra
#227

Gauge-Equivariant Graph Neural Networks for Lattice Gauge Theories

Frontier LLMs 2026-04-22 arXiv cs.LG (Machine Learning)
3.6
I 3.5 Im 3.8 P 3.5

Local gauge symmetry underlies fundamental interactions and strongly correlated quantum matter, yet existing machine-learning approaches lack a general, principled framework for learning under site-dependent symmetries, particularly for intrinsically nonlocal observables. Here we introduce a gauge-equivariant graph neural network that embeds non-Abelian symmetry directly into message passing via matrix-valued, gauge-covariant features and symmetry-compatible updates, extending equivariant learning from global to fully local symmetries. In this formulation, message passing implements gauge-covariant transport across the lattice, allowing nonlocal correlations and loop-like structures to emerge naturally from local operations. We validate the approach across pure gauge, gauge-matter, and dynamical regimes, establishing gauge-equivariant message passing as a general paradigm for learning in systems governed by local symmetry.

frontier_llm
#228
3.6
I 3.5 Im 3.8 P 3.5

In streaming platforms churn is extremely costly, yet A/B tests are typically evaluated using outcomes observed within a limited experimental horizon. Even when both short- and predicted long-term engagement metrics are considered, they may fail to capture how a treatment affects users' retention. Consequently, an intervention may appear beneficial in the short term and neutral in the long term while still generating lower total value than the control due to users churn. To address this limitation, we introduce a method that estimates long-term treatment effects (LTE) and residual lifetime value change ($ΔERLV$) in short multi-cohort A/B tests under user learning. To estimate time-varying treatment effects efficiently, we introduce an inverse-variance weighted estimator that combines multiple cohorts estimates, reducing variance relative to standard approaches in the literature. The estimated treatment trajectory is then modeled as a parametric decay to recover both the asymptotic treatment effect and the cumulative value generated over time. Our framework enables simultaneous evaluation of steady-state impact and residual user value within a single experiment. Empirical results show improved precision in estimating LTE and $ΔERLV$ and identify scenarios in which relying on either short-term or long-term metrics alone would lead to incorrect product decisions.

evals
#229
3.6
I 3.5 Im 3.8 P 3.5

This paper presents a personalized Battery Electric Vehicle (BEV) energy consumption estimation framework that integrates map-based contextual features with driver-specific velocity prediction and physics-based energy consumption modeling. The system combines route selection, detailed road feature processing, a rule-based reference velocity generator, a PID controller-based vehicle dynamics simulator, and a Bidirectional LSTM model trained to reproduce individual driving behavior. The predicted individual-specific velocity profiles are coupled with a quasi-steady backward energy consumption model to compute tractive power, regenerative braking, and State-of-Charge (SOC) evolution. Evaluation across urban, freeway, and hilly routes demonstrates that the proposed approach captures key driver behavioral patterns such as deceleration at intersections, speed-limit tracking, and road grade-dependent responses, while producing accurate power and SOC trajectories. The results highlight the effectiveness of combining learned driver behavior with map-based context and physics-based energy consumption modeling to produce accurate, personalized BEV SOC depletion profiles.

evals
#230

Generative Flow Networks for Model Adaptation in Digital Twins of Natural Systems

Safety, Policy & Regulation 2026-04-22 arXiv cs.LG (Machine Learning)
3.6
I 3.5 Im 3.8 P 3.5

Digital twins of natural systems must remain aligned with physical systems that evolve over time, are only partially observed, and are typically modeled by mechanistic simulators whose parameters cannot be measured directly. In such settings, model adaptation is naturally posed as a simulation-based inference problem. However, sparse and indirect observations often fail to identify a unique and optimal calibration, leaving several simulator parameterizations compatible with the available evidence. This article presents a GFlowNet-based approach to model adaptation for digital twins of natural systems. We formulate adaptation as a generative modeling problem over complete simulator configurations, so that plausible parameterizations can be sampled with probability proportional to a reward derived from agreement between simulated and observed behavior. Using a controlled environment agriculture case study based on a mechanistic tomato model, we show that the learned policy recovers dominant regions of the adaptation landscape, retrieves strong calibration hypotheses, and preserves multiple plausible configurations under uncertainty.

safety_policy
#231

A weighted angle distance on strings

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)
3.6
I 3.5 Im 3.8 P 3.5

We define a multi-scale metric $d_ρ$ on strings by aggregating angle distances between all $n$-gram count vectors with exponential weights $ρ^n$. We benchmark $d_ρ$ in DBSCAN clustering against edit and $n$-gram baselines, give a linear-time suffix-tree algorithm for evaluation, prove metric and stability properties (including robustness under tandem-repeat stutters), and characterize isometries.

evals
#232

Amortized Vine Copulas for High-Dimensional Density and Information Estimation

Interpretability 2026-04-22 arXiv cs.LG (Machine Learning)
3.6
I 3.5 Im 3.8 P 3.5

Modeling high-dimensional dependencies while keeping likelihoods tractable remains challenging. Classical vine-copula pipelines are interpretable but can be expensive, while many neural estimators are flexible but less structured. In this work, we propose Vine Denoising Copula (VDC), an amortized vine-copula pipeline that trains a single bivariate denoising model and reuses it across all vine edges. For each edge, given pseudo-observations, the model predicts a density grid. We then apply an IPFP/Sinkhorn projection that enforces non-negativity, unit mass, and uniform marginals. This keeps the exact vine likelihood and preserves the usual copula interpretation while replacing repeated per-edge optimization with GPU inference. Across synthetic and real-data benchmarks, VDC delivers strong bivariate density accuracy, competitive MI/TC estimation, and substantial speedups for high-dimensional vine fitting. In practice, these gains make explicit information estimation and dependence decomposition feasible at scales where repeated vine fitting would otherwise be costly, although conditional downstream inference remains mixed.

interpretability
#233

Towards Certified Malware Detection: Provable Guarantees Against Evasion Attacks

Evaluations & Benchmarks 2026-04-22 arXiv cs.LG (Machine Learning)
3.6
I 3.5 Im 3.8 P 3.5

Machine learning-based static malware detectors remain vulnerable to adversarial evasion techniques, such as metamorphic engine mutations. To address this vulnerability, we propose a certifiably robust malware detection framework based on randomized smoothing through feature ablation and targeted noise injection. During evaluation, our system analyzes an executable by generating multiple ablated variants, classifies them by using a smoothed classifier, and identifies the final label based on the majority vote. By analyzing the top-class voting distribution and the Wilson score interval, we derive a formal certificate that guarantees robustness within a specific radius against feature-space perturbations. We evaluate our approach by comparing the performance of the base classifier and the smoothed classifier on both clean executables and ablated variants generated using PyMetaEngine. Our results demonstrate that the proposed smoothed classifier successfully provides certifiable robustness against metamorphic evasion attacks without requiring modifications to the underlying machine learning architecture.

evals
#234

Forecasting Individual NetFlows using a Predictive Masked Graph Autoencoder

Frontier LLMs 2026-04-22 arXiv cs.LG (Machine Learning)
3.6
I 3.5 Im 3.8 P 3.5

In this paper, we propose a proof-of-concept Graph Neural Network model that can successfully predict network flow-level traffic (NetFlow) by accurately modelling the graph structure and the connection features. We use sliding-windows to split the network traffic in equal-sized heterogeneous bidirectional graphs containing IP, Port, and Connection nodes. We then use the GNN to model the evolution of the graph structure and the connection features. Our approach shows superior results when identifying the Port and IP to which connections attach, while feature reconstruction remains competitive with strong forecasting baselines. Overall, our work showcases the use of GNNs for per-flow NetFlow prediction.

frontier_llm
#235

Participatory provenance as representational auditing for AI-mediated public consultation

Infrastructure 2026-04-22 arXiv cs.AI (Artificial Intelligence)
3.6
I 3.5 Im 3.8 P 3.5

Artificial intelligence is increasingly deployed to synthesize large-scale public input in policy consultations and participatory processes. Yet no formal framework exists for auditing whether these summaries faithfully represent the source population, an accountability gap that existing approaches to AI explainability, grounding and hallucination detection do not address because they focus on output quality rather than input fidelity. Here, participatory provenance is introduced: a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Applied to Canada's 2025-2026 national AI Strategy consultation ($n = 5{,}253$ respondents across two independent policy topics), the framework reveals that both official government summaries underperform a random-participant baseline ($-9.1\%$ and $-8.0\%$ coverage degradation), with $16.9\%$ and $15.3\%$ of participants effectively excluded. Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI ($33$-$88\%$ exclusion rates). Brevity, semantic isolation and rhetorical register independently predict representational outcome. An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.

infra
#236

Fast and Provably Accurate Sequential Designs using Hilbert Space Gaussian Processes

Evaluations & Benchmarks 2026-04-22 arXiv stat.ML (Statistical ML)
3.6
I 3.5 Im 3.8 P 3.5

Gaussian processes are widely used for accurate emulation of unknown surfaces in sequential design of expensive simulation experiments. Integrated mean squared error (IMSE) is an effective acquisition function for sequential designs based on Gaussian processes. However, existing approaches struggle with its implementation because the required integrals often lack closed-form expressions for most kernel functions. We propose a novel and computationally efficient Hilbert space Gaussian process approximation for the IMSE acquisition function, where a truncated eigenbasis representation of the integral enables closed-form evaluation. We establish sharp global non-asymptotic bounds for both the approximation error of isotropic kernels and the resulting error in the acquisition function. In a series of numerical experiments with $γ$-stabilizing, the proposed method achieves substantially lower prediction error and reduced computation time compared to existing benchmarks. These results demonstrate that the proposed Hilbert space Gaussian process framework provides an accurate and computationally efficient approach for Gaussian process based sequential design.

evals
#237

Calibrating conditional risk

State Space Models 2026-04-22 arXiv stat.ML (Statistical ML)
3.6
I 3.5 Im 3.8 P 3.5

We introduce and study the problem of calibrating conditional risk, which involves estimating the expected loss of a prediction model conditional on input features. We analyze this problem in both classification and regression settings and show that it is fundamentally equivalent to a standard regression task. For classification settings, we further establish a connection between conditional risk calibration and individual/conditional probability calibration, and develop theoretical insights for the performance metric. This reveals that while conditional risk calibration is related to existing uncertainty quantification problems, it remains a distinct and standalone machine learning problem. Empirically, we validate our theoretical findings and demonstrate the practical implications of conditional risk calibration in the learning to defer (L2D) framework. Our systematic experiments provide both qualitative and quantitative assessments, offering guidance for future research in uncertainty-aware decision-making.

ssm
#238

Properties and limitations of geometric tempering for gradient flow dynamics

Research 2026-04-22 arXiv stat.ML (Statistical ML)
3.6
I 3.5 Im 3.8 P 3.5

We consider the problem of sampling from a probability distribution $π$. It is well known that this can be written as an optimisation problem over the space of probability distributions in which we aim to minimise the Kullback--Leibler divergence from $π$. We consider the effect of replacing $π$ with a sequence of moving targets $(π_t)_{t\ge0}$ defined via geometric tempering on the Wasserstein and Fisher--Rao gradient flows. We show that convergence occurs exponentially in continuous time, providing novel bounds in both cases. We also consider popular time discretisations and explore their convergence properties. We show that in the Fisher--Rao case, replacing the target distribution with a geometric mixture of initial and target distribution never leads to a convergence speed up both in continuous time and in discrete time. Finally, we explore the gradient flow structure of tempered dynamics and derive novel adaptive tempering schedules.

research
#239
3.6
I 3.5 Im 3.8 P 3.5

Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.

evals
#240
3.6
I 3.5 Im 3.8 P 3.5

Reinforcement learning (RL) with verifiable rewards (RLVR) has demonstrated the great potential of enhancing the reasoning abilities in multimodal large language models (MLLMs). However, the reliance on language-centric priors and expensive manual annotations prevents MLLMs' intrinsic visual understanding and scalable reward designs. In this work, we introduce SSL-R1, a generic self-supervised RL framework that derives verifiable rewards directly from images. To this end, we revisit self-supervised learning (SSL) in visual domains and reformulate widely-used SSL tasks into a set of verifiable visual puzzles for RL post-training, requiring neither human nor external model supervision. Training MLLMs on these tasks substantially improves their performance on multimodal understanding and reasoning benchmarks, highlighting the potential of leveraging vision-centric self-supervised tasks for MLLM post-training. We think this work will provide useful experience in devising effective self-supervised verifiable rewards to enable RL at scale. Project page: https://github.com/Jiahao000/SSL-R1.

post_training
#241

R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs

Evaluations & Benchmarks 2026-04-22 arXiv cs.CV (Computer Vision)
3.6
I 3.5 Im 3.8 P 3.5

Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.

evals
#242
3.6
I 3.5 Im 3.8 P 3.5

In low-resource settings, blind-sweep ultrasound provides a practical and accessible method for identifying fetal growth restriction. However, unlike freehand ultrasound which is subjectively controlled, detection of biometry plane in blind-sweep ultrasound is more challenging due to the uncontrolled fetal structure to be observed and the variaties of oblique planes in the scan. In this work, we propose a structure-augmented system to detect fetal abdomen plane, where the abdominal structure is highlighted using a segmentation prior. Since standard planes are emerging gradually, the decision boundary of the keyframes is unstable to predict. We thus aggregated the structure-augmented planes with a temporal sliding window to help stabilise keyframe localisation. Extensive results indicate that the structure-augmented temporal sliding strategy significantly improves and stabilises the detection of anatomically meaningful planes, which enables more reliable biometric measurements in blind-sweep ultrasound.

research
#243
3.6
I 3.5 Im 3.8 P 3.5

This study investigates the impact of face image background correction through segmentation on face recognition and morphing attack detection performance in realistic, unconstrained image capture scenarios. The motivation is driven by operational biometric systems such as the European Entry/Exit System (EES), which require facial enrolment at airports and other border crossing points where controlled backgrounds usually required for such captures cannot always be guaranteed, as well as by accessibility needs that may necessitate image capture outside traditional office environments. By analyzing how such preprocessing steps influence both recognition accuracy and security mechanisms, this work addresses a critical gap between usability-driven image normalization and the reliability requirements of large-scale biometric identification systems. Our study evaluates a comprehensive range of segmentation techniques, three families of morphing attack detection methods, and four distinct face recognition models, using databases that include both controlled and in-the-wild image captures. The results reveal consistent patterns linking segmentation to both recognition performance and face image quality. Additionally, segmentation is shown to systematically influence morphing attack detection performance. These findings highlight the need for careful consideration when deploying such preprocessing techniques in operational biometric systems.

evals
#244

RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images

Evaluations & Benchmarks 2026-04-22 arXiv cs.CV (Computer Vision)
3.6
I 3.5 Im 3.8 P 3.5

Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.

evals
#245
3.6
I 3.5 Im 3.8 P 3.5

We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.

audio
#248

The Download: introducing the 10 Things That Matter in AI Right Now

Industry 2026-04-22 MIT Technology Review — AI
3.6
I 3.5 Im 3.8 P 3.5

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Introducing: 10 Things That Matter in AI Right Now What actually matters in AI right now? It’s getting harder to tell amid the constant launches, hype, and warnings. To cut through…

industry
#249

3 things Michelle Kim is into right now

Reinforcement Learning 2026-04-22 MIT Technology Review — AI
3.6
I 3.5 Im 3.8 P 3.5

Isegye Idol If you thought K-pop was weird, virtual idols—humans who perform as anime-style digital characters via motion capture—will blow your mind. My favorite is a girl group called Isegye Idol, created by Woowakgood, a Korean VTuber (a streamer who likewise performs as a digital persona). Isegye Idol’s six members are anonymous, which seems to…

rl
#250

One town’s scheme to get rid of its geese

Industry 2026-04-22 MIT Technology Review — AI
3.6
I 3.5 Im 3.8 P 3.5

“Pull over!” I order my brother one sunny February afternoon. Our target is in sight: a gaggle of Canada geese, pecking at grass near the dog park. As I approach, tiptoeing over their grayish-white poop, I notice that one bird wears a white cuff around its slender black neck. It’s a GPS tracker—part of a…

industry
#251

There is no nature anymore

Industry 2026-04-22 MIT Technology Review — AI
3.6
I 3.5 Im 3.8 P 3.5

When people talk about “nature,” they’re generally talking about things that aren’t made by human beings. Rocks. Reefs. Red wolves. But while there is plenty of God’s creation to go around, it is hard to think of anything on Earth that human hands haven’t affected. In the Brazilian rainforest, scientists have found microplastics in the…

industry
#252

Los Angeles is finally going underground

Industry 2026-04-22 MIT Technology Review — AI
3.6
I 3.5 Im 3.8 P 3.5

Los Angeles deserves its reputation as the quintessential car city—the rhythms of its 2,200 square miles are dictated by wide boulevards and concrete arcs of freeways. But it once had a world-class rail transit system, and for the last three decades, the city has been rebuilding a network of trolleys and subways. In May, a…

industry

We present a robust and accurate numerical method for the anisotropic diffusion equation in curvilinear coordinates. This study extends the recent work [Muir et al., Computer Physics Communications, 2025] for solving the anisotropic diffusion equation in magnetic fields from Cartesian meshes to to curvilinear coordinates and complex geometries. The method uses summation by parts with simultaneous approximation terms for computing the diffusion perpendicular to field lines. The diffusion along field lines is computed using a penalty approach, similar to a simultaneous approximation term, but applied across the volume. To extend the method to complex geometry we use a multi-block approach with piecewise smooth structured meshes. That is, the domain is split into sub-grids, with locally adjacent boundaries coupled weakly using penalties. We prove the semi-discrete stability for the curvilinear implementation by deriving discrete energy estimates. The approach is verified though a number of numerical tests, which demonstrate the convergence properties of the method in multi-domain approach. Finally, we present a qualitative result generated in complex geometry and magnetic field, which is generated by the Stepped Pressure Equilibrium Code.

generative_media
#254

Synthetic Flight Data Generation Using Generative Models

State Space Models 2026-04-22 arXiv — Post-training / Alignment
3.6
I 3.5 Im 3.8 P 3.5

The increasing adoption of synthetic data in aviation research offers a promising solution to data scarcity and confidentiality challenges. This study investigates the potential of generative models to produce realistic synthetic flight data and evaluates their quality through a comprehensive four-stage assessment framework. The need for synthetic flight data arises from their potential to serve as an alternative to confidential real-world records and to augment rare events in historical datasets. These enhanced datasets can then be used to train machine learning models that predict critical events, such as flight delays, cancellations, diversions, and turnaround times. Two generative models, Tabular Variational Autoencoder (TVAE) and Gaussian Copula (GC), are adapted to generate synthetic flight information and compared based on their ability to preserve statistical similarity, fidelity, diversity, and predictive utility. Results indicate that while GC achieves higher statistical similarity and fidelity, its computational cost hinders its applicability to large datasets. In contrast, TVAE efficiently handles large datasets and enables scalable synthetic data generation. The findings demonstrate that synthetic data can support flight delay prediction models with accuracy comparable to those trained on real data. These results pave the way for leveraging synthetic flight data to enhance predictive modeling in air transportation.

ssm
#255

Second-order topology in two-dimensional azulenoid kekulene carbon lattices

Efficiency 2026-04-22 arXiv — Efficiency (Quantization, MoE, Inference)
3.6
I 3.5 Im 3.8 P 3.5

The discovery of higher-order topological insulator (HOTI) has established a new paradigm for understanding symmetry-constrained boundary electronic states. Here, based on first-principles calculations, we demonstrate the emergence of HOTI phase in organic lattices of two-dimensional azulenoid-kekulene-type carbon allotropes, namely AKC-[3,3] and AKC-[6,0]. Enabled by the $C_6$ rotational symmetry, the nontrivial bulk topology is confirmed through the topological invariant and fractionally quantized corner charge, giving $\{[M^{(I)}_{2}],[K^{(3)}_{2}]\}$ = $\{0,2\}$ and $Q_{\mathrm{corner}} = e/3$, respectively, accompanied by the emergence of exotic corner states in nanoflakes. Notably, the structural modifications are explored, revealing that in the derived structure PAK-[6,0], whose corner-localized states are preserved, highlighting the robustness of the higher-order topological phase. These findings highlight azulenoid-kekulene-based carbon allotropes as a promising platform to explore the interplay between structural design, crystalline symmetry, and higher-order topological boundary responses in two dimensional carbon systems.

efficiency
#256

CSI Feedback Under Basis Mismatch: Rate-Splitting Transform Coding for FDD Massive MIMO

Efficiency 2026-04-22 arXiv — Efficiency (Quantization, MoE, Inference)
3.6
I 3.5 Im 3.8 P 3.5

In frequency division duplex massive multiple-input multiple-output systems, downlink channel state information must be fed back within a limited uplink budget. While transform coding with Karhunen-Loeve transform and reverse water-filling is rate-distortion optimal for Gaussian channels, its performance is limited by basis mismatch between the user and base station. We analyze this mismatch and propose a practical architecture separating long-term basis feedback from short-term coefficient quantization. Using a random vector quantization, we derive a closed-form end-to-end mean square error expression. This allows us to characterize the optimal rate split and identify a phase transition threshold for basis updates. Simulations on correlated Gaussian and COST2100 channels demonstrate near-optimal performance, robustness to update overhead, and significant complexity reduction compared to deep-learning-based autoencoders.

efficiency
#257

A Search for Rotation Measure Flare Candidates in Repeating Fast Radio Bursts

Research 2026-04-22 arXiv — Mechanistic Interpretability
3.6
I 3.5 Im 3.8 P 3.5

Fast radio bursts (FRBs) are millisecond-duration extragalactic radio transients of unknown origin. Rotation measures (RMs) probe their local magneto-ionic environments and provide important clues to their nature. While RM variability has been observed in several repeating FRBs, it is typically gradual or stochastic. Recently, observations of FRB~20220529 revealed an abrupt RM excursion followed by rapid recovery on week-long timescales, termed an ``RM flare'', suggesting a potentially distinct form of RM variability associated with localized magnetized plasma. In this work, we perform a systematic search for RM flare candidates in repeating FRBs with multi-epoch RM measurements. Using a $3σ$ significance threshold, we identify two candidates with multiple observational epochs (FRB~20121102A and FRB~20201124A) and two additional single-epoch candidates (FRB~20180916B), in addition to the event in FRB~20220529A. Our results suggest that RM flares, if confirmed, may not be rare among repeating FRBs and point to highly dynamic magnetized environments local to the sources. Future high-cadence polarimetric observations, particularly following the discovery of RM excursions, will be essential for confirming these candidates and constraining their physical origin.

research
#258
3.6
I 3.5 Im 3.8 P 3.5

The solar interior is probed by the properties of the Sun's acoustic oscillations (p-modes) observed on the solar surface. The frequencies of these p-modes measured in the last three decades show long term variation similar to the 11 year cyclic behaviour exhibited by 10.7 cm radio flux, sunspot numbers and other solar activity indices. It is also now established that the cyclic behavior of some of the solar proxies are connected with geomagnetic activities and have implications for space weather. Hence, in recent years efforts have been made using machine-learning methods to forecast these solar proxies with a view to improve our understanding of space weather. Developing a comparable method for forecasting p-mode frequency shifts is therefore of interest for two reasons. Firstly, it will facilitate future investigations into its potential role in tracing energy drivers from the Sun's interior to the geospace response by improving models of solar interior dynamics to coronal and heliospheric plasma conditions. In other words, it will help establish a more robust and quantitative link between the Sun's interior and its exterior. Secondly, it may provide us with an independent indicator or an early indicator of ascending and descending phase of solar activity which might be useful for space weather forecasting. In this article, we develop and apply the standard time-series analysis and machine-learning based methods to characterise p-mode frequency shifts for the remaining solar cycle 25.

research
#259
3.6
I 3.5 Im 3.8 P 3.5

Post-quantum cryptographic (PQC) accelerators for ML-KEM (FIPS 203) and ML-DSA (FIPS 204) rely on pipelined Number Theoretic Transform (NTT) stages over $\mathbb{Z}_q$. Our prior work established structural dependency analysis at scale [1] and quantified the security margin of partial NTT masking [2]. Whether per-stage arithmetic masking guarantees pipeline-level security had no prior machine-checked answer for the r-bearing case: composition frameworks (ISW, t-SNI, PINI, DOM) were formalized exclusively for Boolean masking over $\mathrm{GF}(2)$; no proof assistant artifact addresses the NTT butterfly over $\mathbb{Z}_q$. We present three machine-checked results in Lean 4 with Mathlib, all zero sorry. First, we close a stated limitation of prior work: value-independence implies constant marginal distribution under fresh randomness (via an algebraic MutualInfoZero proxy). Second, butterfly per-context uniformity: for any Cooley-Tukey butterfly with fresh output mask over $\mathbb{Z}/q\mathbb{Z}$ ($q > 0$), each output wire has exactly one mask value producing each output, a uniform marginal independent of secrets, universal over all moduli, twiddle factors, and inputs. Third, a k-stage NTT pipeline with fresh per-stage masking satisfies per-context uniformity at every stage under the ISW first-order probing model. We document a named warning: pointwise value-independence is false for butterfly outputs. The Adams Bridge accelerator (CHIPS Alliance Caliptra) fails the fresh masking hypothesis, masking active only in INTT round 0, architecturally explaining its structural insecurity. Artifact: nine theorems, 1,738 build jobs, zero sorry. Composition for nonlinear gadgets (Barrett) is addressed in forthcoming manuscripts proving Barrett's PF-PINI(2) satisfaction ('one-bit barrier') [3] and k-stage composition for PF-PINI gadgets under fresh-mask renewal [4].

interpretability
#260

Irreducible Gravitational Wave Background as a Particle Detector

Research 2026-04-22 arXiv — Mechanistic Interpretability
3.6
I 3.5 Im 3.8 P 3.5

We show that spectral features of primordial gravitational-wave backgrounds (GWB) can directly reconstruct \textit{Lagrangian} parameters of beyond-the-Standard-Model (BSM) particles, for any transient gravitational-wave production mechanism, independent of the specific source of gravitational waves. Sufficiently long-lived particles generically induce a temporary period of early matter domination in the thermal history of the Universe, which imprints two characteristic frequencies in any primordial GWB corresponding to the onset and end of this epoch. These frequencies are determined by the initial abundance, mass, and decay rate of the species. Once the underlying model and initial abundance are specified, the observed spectral features directly determine the particle mass and decay rate. We find that gravitational-wave observations probe regions of parameter space both complementary to and far beyond the reach of upcoming laboratory searches for long-lived particles. Remarkably, frequencies in the nanohertz band, where a stochastic signal has recently been reported by pulsar timing arrays, map directly onto decay lengths accessible in upcoming long-lived-particle (LLP) searches.

research
#261
3.6
I 3.5 Im 3.8 P 3.5

When a quantum electronic device is coupled to an electrical resonator, admittance changes of the quantum subsystem may be detected. The effective reactance may include capacitive and inductive terms that incorporate geometric, quantum, and tunneling components; while the effective resistance may be composed of Sisyphus and Hermes terms linked to relaxation and decoherence, respectively. Such reflectometry is usually studied when all characteristic times of the quantum system are much shorter than the resonator's period, in which case only stationary quantum states are probed. We present a rigorous description of a driven-dissipative qudit-resonator system. Our approach demonstrates how to strictly introduce quantum and tunneling capacitances as well as Hermes and Sisyphus resistances, and how these values are modified when the dynamics of the subsystems becomes mutually dependent. We present the cases of a Cooper-pair box, a single-Cooper-pair transistor, a double quantum dot, and a single-electron box. Our approach can be applied to describe any quantum system coupled to any classical resonator.

research
#262

Probing QCD instantons using jet correlation observables in proton-proton collisions at the LHC

Interpretability 2026-04-22 arXiv — Mechanistic Interpretability
3.6
I 3.5 Im 3.8 P 3.5

Discovery of instantons in colliders will provide experimental evidence for the topological properties of the QCD vacuum. In this work, we propose jet correlation observables that can unambiguously discriminate between instanton-induced processes and perturbative hard scattering events in pp collisions at LHC energies. By calculating the instanton sizes and their separations in 2+1 flavor QCD with physical quark masses, we provide constraints on the center-of-mass energies of the produced hadrons in an instanton-induced process. Our proposal is directly applicable for future ep measurements at the Electron-Ion Collider, offering a cleaner environment to probe instanton-induced processes.

interpretability
#263

Interaction-induced asymmetry in infinite-temperature dynamical correlations of hard-core anyons

Infrastructure 2026-04-22 arXiv — Mechanistic Interpretability
3.6
I 3.5 Im 3.8 P 3.5

We study dynamical correlations of interacting hard-core anyons on a one-dimensional lattice at infinite temperature. This is a setting in which the many-body spectrum is independent of the statistical phase $θ$, while dynamical correlators remain sensitive to $θ$ through nonlocal Jordan-Wigner strings. We compute single-particle Green's functions, spectral functions, and density-density correlators, thereby separating the effects of fractional statistics on one-body coherence from those on density transport in a maximally mixed ensemble. In the noninteracting case $V=0$, high-temperature averaging leads to inversion-symmetric Green's functions for all $θ$ despite the presence of anyonic strings. Finite nearest-neighbor interactions $V$ generate, however, a pronounced left-right asymmetry in the Green's functions for $0<θ<π$, with the strongest chirality appearing at intermediate couplings $V\sim J$ where interactions and hopping compete most effectively. In this regime, the Green's function decays exponentially in time with a statistical-angle-dependent decay rate. At strong coupling, the dynamics crosses over to an atomic-limit regime in which the dependence on $θ$ is reduced. Here the Green's function decays universally as $t^{-1}$ and the corresponding spectral function displays a three-band structure. In contrast, density-density correlations are insensitive to statistics and recover the known infinite-temperature transport regimes of the XXZ chain, including ballistic, superdiffusive and diffusive behaviours. These results identify dynamical correlation functions as direct probes of fractional statistics in high-entropy quantum systems.

infra

High-energy photonuclear ($γ+A$) scattering in ultra-peripheral heavy-ion collisions provides a unique probe of nuclear structure. This Letter studies the dependence of $γ+A$ jet production in ultra-peripheral Pb+Pb collisions at $\sqrt{s_{_\text{NN}}} = 5.02$ TeV on the presence of forward neutron emission from either nucleus. The data was taken in 2018 with the ATLAS detector at the LHC and corresponds to an integrated luminosity of $1.72$ nb$^{-1}$. The kinematics of the hard $γ+A$ processes, expressed via the particle-level photon ($z_{-}$) or nuclear parton ($x_{+}$) momentum fractions, are determined from $R = 0.4$ jets reconstructed using the anti-$k_t$ algorithm. At lower $z_{-}$, where the non-diffractive component dominates, the nuclear parton distribution can be cleanly probed in collisions that leave the struck nucleus essentially intact. Such collisions are expected to probe larger impact parameters ($b_\text{A}$) within the target. The shape of the $γ+A$ cross-section as a function of $x_{+}$ in such collisions is found to differ from that in $γ+A$ collisions accompanied by forward neutron emission, with an observed significance of $6.0σ$. These results are consistent at large $x_{+}$ with large $b_\text{A}$ collisions exhibiting no modifications to the parton distributions that are usually observed in hard scattering processes involving nuclei, relative to collisions with smaller $b_\text{A}$. Thus, these measurements provide an experimental observation that the modifications to nuclear parton distributions vary with impact parameter.

research
#265

Reshaping the inner shadow of a Kerr black hole by a torn accretion disk

Research 2026-04-22 arXiv — Mechanistic Interpretability
3.6
I 3.5 Im 3.8 P 3.5

When an accretion flow extends to the event horizon, their intersection defines the contour of the inner shadow. However, the morphological evolution of this critical feature remains largely unexplored within a torn accretion disk system, a configuration comprising distinct sub-disks formed when a tilted disk is disrupted by frame-dragging. To address this, we phenomenologically construct a torn accretion disk model and numerically simulate the inner shadow of a Kerr black hole using relativistic backward ray-tracing. We discover that the torn disk geometry profoundly alters the black hole's observational signatures, inducing severe erosion of the inner shadow and generating novel features such as bifurcated shadows, crescent-like structures, and multiple orders of shadow rings. These exotic morphologies, which are predominantly governed by the spatial discontinuity between the sub-disks and the tilt angle of the outer sub-disk, are exceedingly difficult to replicate within standard equatorial accretion paradigms. Our findings demonstrate that these distinctive shadow structures hold significant potential to serve as robust diagnostic probes for torn accretion environments, simultaneously implying that relying solely on the inner shadow to test gravity theories is fundamentally insufficient.

research
#266
3.6
I 3.5 Im 3.8 P 3.5

The nonlinear response coefficient, $χ_{4,22}$, is a crucial observable for probing the dynamical properties of the quark-gluon plasma (QGP). While traditionally understood as a signature of medium response, recent studies suggest that $χ_{4,22}$ also encapsulates critical information regarding the intrinsic initial-state configuration of the colliding nuclei. In this study, we utilize A Multi-Phase Transport (AMPT) model to investigate the microscopic origin and stage-by-stage development of $χ_{4,22}$ in $^{238}$U+$^{238}$U and $^{197}$Au+$^{197}$Au collisions at $\sqrt{s_{\rm NN}} = 200$ GeV. By tracking the flow observables through the partonic cascade, quark coalescence, and hadronic rescattering phases, we map the translation of initial geometric eccentricities into final-state momentum anisotropies. Our results demonstrate that the absolute magnitude of $χ_{4,22}$ increases continuously during the collective expansion, confirming its nature as a dynamically generated medium response. However, the comparative ratio of this coefficient between the U+U and Au+Au systems is stable across all evolutionary stages within statistical uncertainties. This indicates that the ratio approximately cancels complex evolutionary dynamics to isolate intrinsic geometric correlations present at the initial state. These findings provide compelling theoretical support and crucial insights for recent experimental efforts aiming to extract high-order nuclear structure, such as hexadecapole deformation, using nonlinear flow observables.

interpretability
#267

Prospects of boosted magnetic dipole inelastic fermion dark matter at ILC-BDX

Evaluations & Benchmarks 2026-04-22 arXiv — Mechanistic Interpretability
3.6
I 3.5 Im 3.8 P 3.5

In this work, we investigate the projected sensitivity of the Beam-Dump eXperiment at the International Linear Collider (ILC-BDX) to inelastic fermionic dark matter coupled to the Standard Model photon through an off-diagonal magnetic dipole operator. We compute the production rate of dark matter states in the bremsstrahlung like process $e^- N \to e^- N γ^* (\to χ_{1} \barχ_0)$, induced by the scattering of high-energy electrons on target nuclei. The resulting boosted dark matter fluxes are then propagated to the detector, where the signal events arise from scattering off detector electrons. The projected exclusion limits are derived using the expected numbers of electrons on target (this implies a typical rate of $4.0~\times~10^{21}/\mbox{year}$) corresponding to 1 year and 10 years of data taking. To characterize the impact of inelasticity, we consider two benchmark relative mass splittings, $Δ=0.05$ and $Δ=0.001$, motivated by thermal dark matter scenarios. Our results show that ILC-BDX can probe inelastic magnetic-dipole dark matter over a phenomenologically relevant region of parameter space.

evals
#268

Spectral Fluctuation-Dissipation-Response Inequalities

Generative Media 2026-04-22 arXiv — Mechanistic Interpretability
3.6
I 3.5 Im 3.8 P 3.5

We derive spectral fluctuation--dissipation--response inequalities for finite-state Markov jump processes. By comparing the causal susceptibility to its passive equilibrium reference, we establish frequency-resolved and frequency-integrated inequalities that bound their mismatch in terms of the steady-state entropy production rate, probe variance, short-time perturbation diffusion, and reversible relaxation timescales. Our bounds exactly recover the standard fluctuation--dissipation theorem at equilibrium and apply directly to measurable causal susceptibilities, providing experimentally testable thermodynamic limits on FDT breakdown in driven steady states.

generative_media
#269

Polaron transport and Verwey transition in magnetite

AI for Science 2026-04-22 arXiv — AI for Science
3.6
I 3.5 Im 3.8 P 3.5

The enigmatic puzzle of the Verwey transition in magnetite Fe$_3$O$_4$ has been unresolved for almost a century. We present an ab initio-based model of the polaron transport combining kinetic Monte Carlo and molecular dynamics calculations to directly describe the coupling of polarons with lattice vibrations. Contrary to the Ihle-Lorentz small-polaron model, we find no significant change in the band structure across the Verwey transition, however, trimeron hopping is observed. The proposed model provides dc-conductivity in agreement with experimental data across the Verwey transition.

ai_science
#270
3.6
I 3.5 Im 3.8 P 3.5

The co-segregation of impurities in multicomponent alloys has been widely recognized as an effective strategy for tailoring material properties. However, quantitative predictions of co-segregation behavior remain a significant challenge for alloy design in systems containing multiple solute species. In this work, we develop an extended dual-solute (DS) segregation framework to quantitatively predict co-segregation behavior with solute-solute interactions, including both homoatomic and heteroatomic contributions. A machine-learning workflow is first established to predict the pairwise segregation energy to construct the DS segregation energy spectra that intrinsically include solute-solute interactions. The resulting spectral information is then utilized to determine the upper and lower bounds of segregation for individual solutes. When applied to magnesium-based multicomponent systems constructed by alloying Mg with any two of the 11 candidate solute species, the extended DS segregation framework is successfully validated by hybrid molecular dynamics/Monte Carlo simulations and experimental results available in existing literature. Furthermore, we introduce a design strategy to promote co-segregation by incorporating additional solute species that exhibit attractive interactions with existing solutes, thereby enabling enhanced segregation even in the presence of strong site competition. These results underscore the critical role of solute-solute interactions in governing co-segregation behavior and provide a predictive pathway for the design and optimization of multicomponent alloys.

ai_science
#271
3.6
I 3.5 Im 3.8 P 3.5

We employ the exact factorization of a multi-component wavefunction to analyze the dynamics of interacting photons, electrons and nuclei. We consider physical situations emerging in the regime of strong coupling between light excitations and molecular - electronic excitations, giving rise to the so-called molecular polaritons. Nonadiabatic molecular dynamics techniques, routinely used in the field of chemical physics, have been often employed to simulate photophysical and photochemical phenomena in the presence of molecular polaritons. In this work, we analyze the foundations of these techniques in the eye of the exact factorization and we assess their performance on illustrative model studies.

ai_science
#272
3.6
I 3.5 Im 3.8 P 3.5

The exact microscopic origin, symmetry, and thermal melting mechanism of the charge density wave (CDW) phase in TiSe$_{2}$ remain a subject of intense debate, particularly regarding the presence of chiral structural order and a multi-step phase transition. Here, we resolve the finite-temperature structural dynamics of the monolayer TiSe$_{2}$ using large-scale molecular dynamics simulations driven by an accurate, first-principles-trained machine-learning interatomic potential. We demonstrate that the CDW melting deviates from a conventional second-order phase transition, while it undergoes a two-step melting process characterised by an extended fluctuation regime between $T^{\ast}\approx200$ K and $T_{\mathrm{CDW}}\approx250$ K, with proliferation of topological defects and domain walls, and accompanied by a completely overdamped soft optical phonon. Furthermore, we reveal that anisotropic long-wavelength thermal fluctuations spontaneously stabilise an asymmetric $3Q$ chiral CDW order with $C2$ symmetry. These findings provide a unified microscopic framework for understanding complex fluctuation-driven phase transitions in 2D quantum materials, demonstrating that the intricate CDW physics of TiSe$_{2}$ can be largely captured without invoking excitonic correlations.

ai_science
#273

Aligning Stuttered-Speech Research with End-User Needs: Scoping Review, Survey, and Guidelines

Audio & Speech 2026-04-22 arXiv cs.CL (Computation & Language)
3.5
I 3.2 Im 3.8 P 3.5

Atypical speech is receiving greater attention in speech technology research, but much of this work unfolds with limited interdisciplinary dialogue. For stuttered speech in particular, it is widely recognised that current speech recognition systems fall short in practice, and current evaluation methods and research priorities are not systematically grounded in end-user experiences and needs. In this work, we analyse these gaps through 1) a scoping review of papers that deal with stuttered speech and 2) a survey of 70 stakeholders, including adults who stutter and speech-language pathologists. By analysing these two perspectives, we propose a taxonomy of stuttered-speech research, identify where current research directions diverge from the needs articulated by stakeholders, and conclude by outlining concrete guidelines and directions towards addressing the real needs of the stuttering community.

audio
#274

Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

Infrastructure 2026-04-22 arXiv cs.LG (Machine Learning)
3.5
I 3.2 Im 3.8 P 3.5

Self-consistency boosts inference-time performance by sampling multiple reasoning traces in parallel and voting. However, in constrained domains like math and code, this strategy is compute-inefficient because it samples with replacement, repeatedly revisiting the same high-probability prefixes and duplicate completions. We propose Distinct Leaf Enumeration (DLE), a deterministic decoding method that treats truncated sampling as traversal of a pruned decoding tree and systematically enumerates distinct leaves instead of sampling with replacement. This strategy improves inference efficiency in two ways. Algorithmically, it increases coverage of the truncated search space under a fixed budget by exploring previously unvisited high-probability branches. Systemically, it reuses shared prefixes and reduces redundant token generation. Empirically, DLE explores higher-quality reasoning traces than stochastic self-consistency, yielding better performance on math, coding, and general reasoning tasks.

infra
#275

Centering Ecological Goals in Automated Identification of Individual Animals

Evaluations & Benchmarks 2026-04-22 arXiv cs.AI (Artificial Intelligence)
3.5
I 3.2 Im 3.8 P 3.5

Recognizing individual animals over time is central to many ecological and conservation questions, including estimating abundance, survival, movement, and social structure. Recent advances in automated identification from images and even acoustic data suggest that this process could be greatly accelerated, yet their promise has not translated well into ecological practice. We argue that the main barrier is not the performance of the automated methods themselves, but a mismatch between how those methods are typically developed and evaluated, and how ecological data is actually collected, processed, reviewed, and used. Future progress, therefore, will depend less on algorithmic gains alone than on recognizing that the usefulness of automated identification is grounded in ecological context: it depends on what question is being asked, what data are available, and what kinds of mistakes matter. Only by centering these questions can we move toward automated identification of individuals that is not only accurate but also ecologically useful, transparent, and trustworthy.

evals
#276

The Origin of Edge of Stability

Safety, Policy & Regulation 2026-04-22 arXiv stat.ML (Statistical ML)
3.5
I 3.2 Im 3.8 P 3.5

Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/η$, where $η$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/η$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/η$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/η$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.

safety_policy
#277

Online Survival Analysis: A Bandit Approach under Cox PH Model

Research 2026-04-22 arXiv stat.ML (Statistical ML)
3.5
I 3.2 Im 3.8 P 3.5

Survival analysis is a widely used statistical framework for modeling time-to-event data under censoring. Classical methods, such as the Cox proportional hazards (Cox PH) model, offer a semiparametric approach to estimating the effects of covariates on the hazard function. Despite its importance, survival analysis has been largely unexplored in online settings, particularly within the bandit framework, where decisions must be made sequentially to optimize treatments as new data arrive over time. In this work, we take an initial step toward integrating survival analysis into a purely online learning setting under the Cox PH model, addressing key challenges including staggered entry, delayed feedback, and right censoring. We adapt three canonical bandit algorithms to balance exploration and exploitation, with theoretical guarantees of sublinear regret bounds. Extensive simulations and semi-real experiments using SEER cancer data demonstrate that our approach enables rapid and effective learning of near-optimal treatment policies.

research
#278

Exploring High-Order Self-Similarity for Video Understanding

Robotics 2026-04-22 arXiv cs.CV (Computer Vision)
3.5
I 3.2 Im 3.8 P 3.5

Space-time self-similarity (STSS), which captures visual correspondences across frames, provides an effective way to represent temporal dynamics for video understanding. In this work, we explore higher-order STSS and demonstrate how STSSs at different orders reveal distinct aspects of these dynamics. We then introduce the Multi-Order Self-Similarity (MOSS) module, a lightweight neural module designed to learn and integrate multi-order STSS features. It can be applied to diverse video tasks to enhance motion modeling capabilities while consuming only marginal computational cost and memory usage. Extensive experiments on video action recognition, motion-centric video VQA, and real-world robotic tasks consistently demonstrate substantial improvements, validating the broad applicability of MOSS as a general temporal modeling module. The source code and checkpoints will be publicly available.

robotics
#279
3.5
I 3.2 Im 3.8 P 3.5

Generative AI (GenAI) is reshaping software engineering by shifting development from manual coding toward agent-driven implementation. While vibe coding promises rapid prototyping, it often suffers from architectural drift, limited traceability, and reduced maintainability. Applying the design science research (DSR) methodology, this paper proposes Shift-Up, a framework that reinterprets established software engineering practices, like executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs), as structural guardrails for GenAI-native development. Preliminary findings from our exploratory evaluation compare unstructured vibe coding, structured prompt engineering, and the Shift-Up approach in the development of a web application. These findings indicate that embedding machine-readable requirements and architectural artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation activities. The results suggest that traditional software engineering artifacts can serve as effective control mechanisms in AI-assisted development.

interpretability
#280

Polytropic stellar wind models with strongly localized heating

Research 2026-04-22 arXiv — Mechanistic Interpretability
3.5
I 3.2 Im 3.8 P 3.5

Polytropic models of stellar winds remain to be useful tools because they allow for a simple description of the energy balance of the expanding plasma without explicitly specifying potentially complex energy transport processes like, e.g., heat conduction or extended wave heating. Among recent applications to stellar winds and to the solar wind was a study of the consequences of strongly localized heating in the latter, possibly due to acoustic waves. Such 'nonuniform' heating can result from a time- and space-localized damping of wave modes and allows, as an extreme case, an adiabatic expansion of particular wind streams outside the heating region. The present study generalizes the modeling from the first analytical as well as numerical studies, that were limited to this extreme case, towards a more realistic non-adiabatic behaviour. The additional energy due to heating is demonstrated to be in a plausible range in view of typical flare energies and low compared to the gravitational energy of the plasma in this region. The corresponding solutions may be of interest for stellar winds, in general, and w.r.t. recent observations made with the Parker Solar Probe, which revealed strongly varying wind streams and the presence of acoustic waves near the Sun, for the solar wind, in particular. Potential observational evidence for the solar wind is discussed.

research
#281

Symplectic connection third-order Hall effect in a room-temperature ferromagnet

Robotics 2026-04-22 arXiv — Mechanistic Interpretability
3.5
I 3.2 Im 3.8 P 3.5

Third-order nonlinear Hall effects (THE) have recently attracted considerable experimental interest as powerful probes for quantum geometric properties in emergent quantum materials, encompassing quadrupole moments of quantum metric and Berry curvature. Here, we report a fundamentally new THE in room-temperature van der Waals ferromagnet Fe3GaTe2 from second-order Berry connection polarizability, which manifests a higher-order characterization of band geometry called symplectic connection. Our observations show that the third-order transverse response in Fe3GaTe2 is odd to magnetization, vanishes above the Curie temperature and remains independent of driving current directions. Scaling law analysis combined with first-principles calculations establishes this response as the symplectic-connection-induced THE. This discovery opens the door to probing high-order quantum geometric properties beyond Berry curvature and quantum metric through nonlinear transport, unveiling the potential of exploring nonlinear Hall phenomena in broad classes of magnets without breaking inversion symmetry. Moreover, the room-temperature manipulation of THE holds promises for device applications based on harnessing the quantum-geometric connection structure.

robotics
#282

Hallucination Early Detection in Diffusion Models

Generative Media 2026-04-22 arXiv — Generative Media / Diffusion
3.5
I 3.2 Im 3.8 P 3.5

Text-to-Image generation has seen significant advancements in output realism with the advent of diffusion models. However, diffusion models encounter difficulties when tasked with generating multiple objects, frequently resulting in hallucinations where certain entities are omitted. While existing solutions typically focus on optimizing latent representations within diffusion models, the relevance of the initial generation seed is typically underestimated. While using various seeds in multiple iterations can improve results, this method also significantly increases time and energy costs. To address this challenge, we introduce HEaD+ (Hallucination Early Detection +), a novel approach designed to identify incorrect generations early in the diffusion process. The HEaD+ framework integrates cross-attention maps and textual information with a novel input, the Predicted Final Image. The objective is to assess whether to proceed with the current generation or restart it with a different seed, thereby exploring multiple-generation seeds while conserving time. HEaD+ is trained on the newly created InsideGen dataset of 45,000 generated images, each containing prompts with up to seven objects. Our findings demonstrate a 6-8% increase in the likelihood of achieving a complete generation (i.e., an image accurately representing all specified subjects) with four objects when applying HEaD+ alongside existing models. Additionally, HEaD+ reduces generation times by up to 32% when aiming for a complete image, enhancing the efficiency of generating complete and accurate object representations relative to leading models. Moreover, we propose an integrated localization module that predicts object centroid positions and verifies pairwise spatial relations (if requested by the users) at an intermediate timestep, gating generation together with object presence to further improve relation-consistent outcomes.

generative_media
#283
3.5
I 3.2 Im 3.8 P 3.5

Conical intersections are central to the description of photophysics and photochemistry. Nevertheless, in non-adiabatic molecular dynamics simulations, they are fundamentally challenging for single-reference electronic structure methods. Density functional theory (DFT) and its time-dependent extension (TDDFT) represent the most widely used theoretical approaches in physics, chemistry, and biology. However, the treatment of ground and excited states as separate problems leads to breakdowns in the topological structure of potential energy surfaces near conical intersections. In this work, we solve this long-standing issue by presenting Convex DFT (CVX-DFT), a framework that, by explicitly enforcing convexity of the variational problem within an appropriately defined subspace, guarantees a unique and continuous electronic solution across regions of degeneracies. We demonstrate that CVX-DFT yields smooth and physically meaningful intersection seams by comparison with reference methods, such as multireference wave function methods. In this way, we establish the method as a robust and computationally efficient DFT approach for treating electronically degenerate regions. These developments represent a critical step toward reliable non-adiabatic simulations beyond the limitations of conventional TDDFT.

ai_science
Items
283
Multi-source
89
Long-form (≥7.5)
4
Sources OK / attempted
64 / 72
Top category
Evaluations & Benchmarks (52)