Wolf Digest — 2026-04-21

#1

AI2 releases BAR: modular post-training via MoE merge from independently-trained experts

Post-Training 2026-04-20 AI2 (Allen Institute)arXiv cs.LGarXiv AgentsarXiv RLarXiv Efficiency

8.4

I 8.0 Im 8.8 P 8.0

AI2 released BAR, short for Branched Adapters and Routers — a recipe for post-training language models one capability at a time by training domain experts independently on disjoint data slices, then merging the resulting experts into a single mixture-of-experts model for inference. The released arXiv paper, titled Train Separately, Merge Together, frames the method as an answer to a practical problem that has been gnawing at industrial post-training pipelines for more than a year. When a lab wants to upgrade a single capability — say, adding a new long-context SFT set, improving a math reasoning track, or patching a safety behavior — the conventional recipe requires rerunning the full post-training pipeline against the joint objective, because supervised fine-tuning and reinforcement-learning-from-human-feedback objectives interact in ways that make isolated updates regress prior capabilities. BAR decouples the problem: each capability is trained on its own branch using lightweight adapters, each branch is routed to its own expert slot, and the resulting merged MoE model exposes all capabilities at inference without ever having computed the cross-capability joint loss. AI2 reports that this allows independent expert upgrading in production — the practical win is that a team working on the math track does not have to coordinate with the team working on safety or agentic tool use to ship an update. The paper is open, and AI2 confirms the recipe is being applied to the Olmo model family. The importance lies less in a headline benchmark number and more in the operational claim: most large labs have accumulated dozens of post-training datasets, and their current merge strategy is a combination of data mixing, per-task LoRA, and careful rehearsal — none of which scale cleanly. If BAR's claims hold under replication, this is the kind of infrastructure paper that changes how open-weight model families are maintained over time. Early community reaction has centered on whether the MoE routing overhead offsets the avoided compute for joint retraining, and on how BAR's expert isolation interacts with constitutional-AI-style objectives that intentionally span many capability axes.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Efficiency, RL, cs.LG — indicating the contribution spans multiple research subfields.'}
{'source': 'AI2 Blog', 'text': 'AI2 published an accompanying blog post framing the contribution as infrastructure for modular post-training.'}

moepost-trainingrlhfsftallenaiolmo

#2

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Evaluations & Benchmarks 2026-04-20 HF +8 arXiv cs.AIarXiv AgentsarXiv EvalsarXiv cs.CLHF Daily Papers

Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang +3

8.1

I 6.9 Im 6.9 P 10.0

ClawEnvKit is a toolkit for automatically generating training environments for claw-shaped robotic agents — the classic prehensile-manipulator category used in warehouse picking, assembly, and home-task research. The contribution is procedural environment synthesis: instead of hand-authoring task scenes or relying on a small set of curated benchmark arrangements, the system generates a large distribution of claw-compatible manipulation scenes with configurable clutter, object affordances, and task specifications. Papers in this line have been accumulating across 2025 and 2026 — the open-ended environment synthesis thread is one of the more active subfields in robot learning — but ClawEnvKit is notable for pairing the generator with a unified evaluation harness targeted at a specific morphology class. The paper is surfaced across five sources, including a Hugging Face Daily Papers feature, which indicates the community sees it as a useful piece of infrastructure rather than just another benchmark. For applied robotics, the practical question is whether the generated environments actually transfer: procedural environment generation has historically produced benchmarks that are easier to overfit than to generalize from, because the generator's inductive biases leak into the training distribution. ClawEnvKit's value will be determined by whether independent labs can take a policy trained on its generated scenes and show real-world transfer without additional fine-tuning on hand-curated rollouts. The broader context is that VLAs — models like OpenVLA, RT-2, and the more recent XEmbodied — need enormous training volumes, and procedural generators are the only realistic path to that scale. Work like ClawEnvKit is how the open research community keeps pace with the proprietary data advantages of labs like Physical Intelligence and Figure.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Evals, cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}
{'source': 'Hugging Face Daily Papers', 'text': 'Featured on HF Daily Papers (8 upvotes) — an early-day community popularity signal.'}

agentbenchmark

#3

Kimi K2.6: The new leading open-weights model

Frontier LLMs 2026-04-20 Artificial Analysis

8.1

I 8.5 Im 8.0 P 7.5

Moonshot AI released Kimi K2.6, and Artificial Analysis has evaluated it as the new leading open-weights language model on their independent intelligence composite. The intelligence composite is a rolling average across reasoning, mathematics, coding, and knowledge benchmarks that Artificial Analysis re-runs in-house, so the claim is not just a number copied from the model card. Kimi K2.6 now sits above the previously leading open-weight models — the Qwen3 family, GLM-5, and DeepSeek's latest — while also undercutting the closed frontier tier from OpenAI, Anthropic, and Google Research by a smaller margin than any open-weight model before it. The release is significant for three reasons. First, Moonshot has been publishing progressively stronger long-context variants — the original K2 shipped with one of the largest open-weight context windows of 2025, and K2.6 extends the pattern. Second, the pricing on the hosted API is aggressive compared to Western closed models, which matters for agentic workloads where token economics dominate. Third, and most importantly for the trajectory of the open-weight frontier, this is a Chinese lab leading on a composite intelligence benchmark — the gap between open and closed has been steadily closing, but the sustained leadership has usually flipped between Qwen, DeepSeek, and GLM. Kimi joining that rotation means four Chinese labs are now trading the open-weight crown on a monthly cadence. For practitioners making inference-stack decisions this week, the actionable question is whether K2.6's latency, tool-use quality, and structured-output reliability match the composite benchmark number — those do not always track. Community discussion will likely center on whether the weights are released under a genuinely permissive license, on memory footprint relative to Qwen3-235B, and on whether K2.6's safety profile departs from the Chinese-lab baseline on politically sensitive prompts.

kimimoonshotopen-weightchinabenchmarkllm

#4

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Robotics 2026-04-20 HF +43 arXiv cs.ROarXiv cs.CVarXiv AgentsarXiv cs.CLHF Daily Papers

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li +46

7.9

I 6.3 Im 6.9 P 10.0

OneVL targets a specific failure mode in current vision-language-action pipelines used for autonomous driving: chain-of-thought reasoning improves trajectory quality but imposes an autoregressive latency tax that is prohibitive for real-time deployment. Latent CoT methods attempt to compress the reasoning trace into continuous representations that can be consumed in a single forward pass, but those approaches lose the auditability and grounding of explicit reasoning. OneVL proposes a one-step latent reasoning architecture that preserves a visually grounded explanation without running a sequential reasoning decoder at inference. The paper reports that one-step latent reasoning achieves parity with multi-step CoT on trajectory prediction benchmarks while eliminating the autoregressive cost, and retains an explanation surface that downstream verification can consume. The paper's significance is split between the research community and the operational robotics community. For researchers, it is an empirical contribution to the ongoing debate over whether explicit CoT is necessary for planning quality or whether the latent variant suffices when the representation is trained end-to-end — OneVL argues for the latter under the right training setup. For the robotics industry, the practical question is whether one-step latent reasoning is robust to distribution shift: the concern with compressing reasoning into a single forward step is that long-tail scenes — construction zones, unusual traffic patterns, rare pedestrian behaviors — may not admit a single-shot answer even when the latent representation is high-capacity. The paper is surfaced across four arXiv feeds and featured on Hugging Face Daily Papers with 43 upvotes on release day, which is a strong signal of community interest. Replication and deployment results from second labs will determine whether OneVL becomes the default shape of autonomous-driving VLAs or remains a strong but narrow result.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, cs.CL, cs.CV, cs.RO — indicating the contribution spans multiple research subfields.'}
{'source': 'Hugging Face Daily Papers', 'text': 'Featured on HF Daily Papers (43 upvotes) — an early-day community popularity signal.'}

vlaagentreasoningbenchmark

#5

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv EvalsarXiv RL

Zhenwen Liang, Yujun Zhou, Sidi Lu, Xiangliang Zhang +2

7.8

I 7.6 Im 7.4 P 7.8

This paper identifies a specific pathology in reinforcement-learning-from-verifiable-rewards training — what the authors call saturation — where base models that are already near-perfect on the training problems produce gradients too small to meaningfully update the policy, but the training loop continues to run because the loss curve still shows marginal improvement. The paper shows that this produces models that score well on in-distribution problems but generalize worse than their pre-RL base on harder held-out sets. The mechanism: when rollouts are almost uniformly correct, the advantage estimate collapses and the resulting update is dominated by noise and by regularization drift away from the base. The authors propose a curriculum scheme that detects saturation and escalates problem difficulty, plus a modified advantage normalizer that avoids the collapse. The empirical contribution is a set of training curves showing that the naive RLVR setup plateaus and then regresses on held-out math, while the saturation-aware variant continues improving. This is an operationally important finding. The RLVR training stack has become the default post-training strategy for math and coding in the open-weight community — OlmoRL, GRPO, and the various GSPO-style variants all rely on verifier-driven rewards — and saturation has been an unspoken problem that most public recipes simply ignore by stopping training early. If the proposed diagnostic reliably detects saturation, labs can shift their compute from over-training saturated tracks to seeding the curriculum with new data, which is how lab-scale RLVR budgets get efficient. The paper appears across three arXiv feeds and is likely to be cited as motivation for the next generation of RLVR schedulers.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, RL, cs.LG — indicating the contribution spans multiple research subfields.'}

llmreasoningbenchmarkpolicy

#6

Using large language models for embodied planning introduces systematic safety risks

Robotics 2026-04-20 arXiv cs.ROarXiv cs.AIarXiv cs.LG

Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu +3

7.7

I 7.2 Im 7.6 P 7.9

This paper introduces DESPITE, a 12,279-task benchmark for evaluating whether LLMs produce safe plans when used as planners for robotic systems. The benchmark spans both physical dangers — actions that would damage the robot or environment — and normative dangers — actions that would violate human expectations even if physically safe — and uses fully deterministic validation, which is unusual for planning benchmarks and makes the results reproducible across labs. The headline finding is systematic: across a range of frontier models, LLM planners produce plans that violate physical or normative safety constraints at rates that are non-trivially high, and the failure modes cluster in ways that suggest systematic blind spots rather than isolated errors. The importance is practical. The VLA and LLM-planner paradigm for robotic autonomy has been the dominant open-research direction for more than a year, with PI's π series, OpenVLA, and numerous academic efforts. Safety has been discussed as a caveat, but the field has lacked a standardized, deterministic benchmark that labs can run as a gate before deploying planning LLMs in hardware. DESPITE is positioned to be that benchmark. The paper's claim that safety failures are systematic, not stochastic, is the load-bearing finding — if it replicates, it means that scaling alone will not close the safety gap and that targeted safety training is required for the planner layer specifically, separate from the general-purpose model's safety training. Expect downstream papers to use DESPITE as the safety target for new planning LMs within the next few months.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.LG, cs.RO — indicating the contribution spans multiple research subfields.'}

reasoningbenchmarksafety

#7

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

Multimodal 2026-04-20 arXiv cs.CVarXiv cs.AIarXiv cs.LGarXiv Efficiency

Florian Kittler, Sheethal Bhat, Andreas Maier

7.7

I 6.3 Im 7.4 P 9.0

ProtoCLIP addresses robustness in zero-shot medical image classification — specifically chest X-ray classification — by adding a prototype-aligned latent refinement step on top of standard CLIP-style contrastive embeddings. The paper shows meaningful improvements on robustness benchmarks that probe covariate shift, and adapts the prototype alignment to the low-label regime typical of real medical imaging deployments. The method is lightweight and compatible with frozen CLIP backbones, which matters for hospital deployments where fine-tuning the foundation model is impractical. The broader context is that medical imaging has become one of the more reliable application areas for vision-language foundation models, but the gap between benchmark performance and real-world robustness has been persistent — models that excel on curated radiology benchmarks often fail under the real distribution of scanner hardware, patient demographics, and acquisition protocols. ProtoCLIP's prototype alignment is part of a broader thread of methods that try to close that gap without retraining, and its reported robustness improvements make it a plausible component in production medical imaging pipelines. The paper is cross-posted across four arXiv feeds, indicating the medical imaging and efficiency subfields both flagged it.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.AI, cs.CV, cs.LG — indicating the contribution spans multiple research subfields.'}

vlmdistillationalignment

#8

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Frontier LLMs 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv Evals

Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton +4

7.7

I 8.5 Im 6.3 P 7.8

MathNet is a global multimodal benchmark for mathematical reasoning and retrieval — meaning it includes problems that require combining visual mathematical content (diagrams, plots, handwritten notation) with retrieval over a knowledge base of mathematical facts, theorems, and solved problems. The benchmark spans multiple languages and mathematical traditions, which is a meaningful gap in the existing math benchmark suite: MATH, GSM8K, and MMLU-Math are all predominantly English-language and biased toward US high-school curricula. Mathnet's multilingual coverage is a corrective, and the retrieval component makes it a more realistic proxy for how modern reasoning agents use math — they typically retrieve before reasoning rather than reasoning purely from parametric memory. The paper reports baseline performance for leading frontier models and finds the expected pattern — closed models lead, but the gap narrows on problems where retrieval can substitute for parametric knowledge — and also a notable gap between English-language performance and non-English-language performance even on problems whose mathematical content is language-invariant. This is an argument that tokenizer biases and training data biases persist in the reasoning trace even when the underlying mathematics does not. The benchmark will likely see fast uptake by labs publishing reasoning-focused models, and the retrieval axis in particular differentiates it from the existing MATH-style suite.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

ragreasoningbenchmarkmultimodal

#9

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

Robotics 2026-04-20 arXiv cs.ROarXiv cs.CVarXiv Robotic AutonomyarXiv Efficiency

Kangan Qian, ChuChu Xie, Yang Zhong, Jingrui Pang +12

7.6

I 6.3 Im 7.1 P 9.0

XEmbodied is a vision-language-action foundation model for large-scale embodied environments that explicitly encodes geometric and physical cues — rather than relying on a generic vision encoder to infer them — when building annotations for VLA training. The motivation is that current cloud-scale VLA annotation pipelines use off-the-shelf vision-language models that lack geometric reasoning and domain semantics, producing annotations that leak errors into downstream policy learning. XEmbodied's architecture bakes in depth, surface normals, and object affordance channels alongside the language-grounded visual features. The result is a foundation model that can produce higher-quality automated annotations for VLA training, which is the bottleneck for scaling open-weight robotic policy learning. The paper is surfaced across four arXiv feeds including the dedicated Robotic Autonomy topical query, indicating the robotics community immediately flagged it. The practical implication is for labs that are building automated labeling pipelines for robotic foundation models: Physical Intelligence's π0.7 release last week demonstrated the power of scale in steerable VLAs, and the open community's ability to keep pace depends on whether procedurally annotated data can substitute for the human-annotated manipulation datasets that companies like PI and Figure have assembled. XEmbodied is an argument that a geometry-aware annotator model is a necessary substrate. Open questions include whether the geometric channels add real information beyond what a sufficiently large generic VLM would infer, and whether the annotation improvements translate to downstream policy quality or just to cleaner metadata.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, Robotic Autonomy, cs.CV, cs.RO — indicating the contribution spans multiple research subfields.'}

vlmvlareasoningbenchmark

#10

FUSE: Ensembling Verifiers with Zero Labeled Data

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv EvalsarXiv cs.CLarXiv stat.ML

Joonhyuk Lee, Virginia Ma, Sarah Zhao, Yash Nair +3

7.6

I 6.3 Im 7.1 P 9.0

FUSE is a method for ensembling verifiers with zero labeled data — an important result because the standard recipe for training a verifier assumes access to labels indicating which candidate solutions are correct. When labels are unavailable, labs typically fall back to a single LLM-as-a-judge, which is known to be biased, especially on problems where the generator and judge share training data or architecture. FUSE shows that ensembling multiple weak verifiers produces a signal that closely approximates the labeled-verifier quality, without any labels. The method is straightforward enough to deploy on top of existing RLVR pipelines. Why this matters: labeled verification is the bottleneck for RLVR at scale. Coding tasks have an obvious verifier — run the code — but most reasoning, writing, and multi-turn agentic tasks do not. The dominant fallback has been to distill labels from frontier closed models, which raises data-licensing and distribution-drift concerns. A method that produces usable verifier ensembles without labels would materially expand the set of tasks that RLVR can target, which in turn expands the set of capabilities that can be trained via RL rather than SFT. The paper is cross-listed across stat.ML, cs.CL, Evals, and cs.LG, which suggests multiple subcommunities recognize it. Replication in external labs, especially on agentic tasks where verifier quality is the dominant failure mode, will determine whether FUSE becomes a standard component of post-training stacks.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.CL, cs.LG, stat.ML — indicating the contribution spans multiple research subfields.'}

llmbenchmark

#11

Sessa: Selective State Space Attention

State space 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv EvalsarXiv cs.CLarXiv SSM

Liubomyr Horbatko

7.6

I 6.3 Im 6.6 P 9.4

Sessa — Selective State Space Attention — proposes a hybrid architecture that combines selective state-space blocks (the Mamba lineage) with attention in a deliberately-chosen interleaving, rather than the usual quadratic-attention-everywhere pattern. The paper's contribution is in the specific choice of which layers use SSM blocks and which use attention, and in training innovations that let the hybrid converge stably at scale. Benchmarks show competitive quality with same-parameter transformers while retaining the linear-scaling inference profile of SSM blocks on long contexts. The broader context is that the SSM-versus-transformer debate has largely settled into a pragmatic consensus that hybrid architectures dominate pure SSMs on standard reasoning benchmarks, while pure transformers dominate hybrids on short-context quality. The remaining question is which interleaving pattern produces the best quality-per-flop, and Sessa is another data point in that search. Given that Olmo Hybrid (from AI2) was released in March and that commercial labs like Mistral, Together, and Cartesia have been shipping hybrid architectures throughout 2025 and 2026, the engineering community is consolidating around hybrids as the default for new foundation models. Sessa's specific recipe, if replicated, will add to the menu of known-good interleaving patterns.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, SSM, cs.AI, cs.CL, cs.LG — indicating the contribution spans multiple research subfields.'}

mambatransformerbenchmark

#12

GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv cs.CLarXiv Efficiency

Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig +2

7.6

I 6.9 Im 7.1 P 8.1

GSQ — Gumbel-Softmax Quantization — is a new scalar quantization method for LLM weights that reportedly achieves high accuracy at aggressive bit-widths through Gumbel-Softmax sampling during the quantization search. The claim is that the method recovers near-FP16 quality at low-bit integer formats, competing with state-of-the-art mixed-precision schemes like AQLM and QuIP-based approaches, while being simpler to deploy because it only requires scalar codebooks rather than vector quantization tables. The paper reports benchmark recovery numbers on standard perplexity and downstream QA tests for representative open-weight models. The significance is in the operational simplicity: every serving stack already supports scalar quantization at the hardware level, whereas vector quantization requires custom kernels. If GSQ's accuracy claims replicate, labs can potentially recover the quality of vector-quantization schemes without the deployment complexity. This is meaningful for inference economics in the 100-billion-parameter-plus open-weight regime where memory dominates cost. The paper is surfaced across three arXiv feeds including the dedicated Efficiency topical query, and will likely see quick replication by the llama.cpp and vLLM communities.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.CL, cs.LG — indicating the contribution spans multiple research subfields.'}

llmquantization

#13

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Evaluations & Benchmarks 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv EvalsarXiv Efficiency

Terry Leitch

7.5

I 6.6 Im 6.6 P 9.0

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, Evals, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

llmquantizationreasoningbenchmark

#14

WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation

Agents & Tool Use 2026-04-20 arXiv cs.AIarXiv AgentsarXiv cs.CL

Harish Santhanalakshmi Ganesan

7.5

I 7.9 Im 6.3 P 7.9

WorldDB is a memory engine for LLM agents that organizes agent experience as a vector graph of worlds — roughly, a hybrid between a vector store and a scene-graph — with ontology-aware write-time reconciliation to resolve conflicts between new observations and stored state. The motivation is that current agent memory is either flat (vector stores with no structure) or rigid (symbolic knowledge graphs that cannot absorb natural-language observations gracefully). WorldDB tries to get both benefits: the fluidity of vector storage for natural-language state and the structural consistency of an ontology for reasoning. The practical target is long-running agentic systems — code agents that operate on large codebases, web agents that maintain state across sessions, research agents that accumulate partial understanding of a domain — where memory is currently a dominant failure mode. Frontier agents from labs like Cognition (Devin), Cursor (Cursor 3), and the various browser-agent systems all implement ad-hoc versions of this. A paper-grade, principled abstraction that becomes a standard library would materially reduce the engineering overhead of building capable long-running agents. The paper is cross-listed on Agents, CL, and AI, and deserves attention from anyone shipping agent products.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

ragagentreasoning

#15

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Generative Media 2026-04-20 arXiv cs.CVarXiv cs.LGarXiv EvalsarXiv RLarXiv Diffusion/GenMedia

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu +4

7.5

I 6.3 Im 6.8 P 9.1

UDM-GRPO extends Group Relative Policy Optimization to uniform discrete diffusion models — a class of generative models that operate in a discrete token space and denoise via a uniform noise schedule, used increasingly for text generation and for image generation pipelines that tokenize at a high level. The contribution is both theoretical and practical: the paper derives a stable variant of GRPO suited to the discrete diffusion setting, avoiding the gradient collapse that standard GRPO exhibits when applied naively, and shows empirically that the stabilized variant outperforms DPO-based alternatives on generative quality benchmarks. The broader arc is that RL fine-tuning of diffusion models has been a fraught subfield — the interaction between noise schedules and policy gradients is subtle and has produced multiple competing recipes with inconsistent results. A stable recipe for discrete diffusion specifically unlocks a set of practical applications: RL fine-tuning of code-generation diffusion models, RL fine-tuning of discrete-token image generators, and RL fine-tuning of text-diffusion LMs. Cross-listed on cs.CV, Evals, RL, and the Diffusion/GenMedia topical query, indicating broad subcommunity interest.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Diffusion/GenMedia, Evals, RL, cs.CV, cs.LG — indicating the contribution spans multiple research subfields.'}

diffusionbenchmarkpolicy

#16

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Evaluations & Benchmarks 2026-04-20 arXiv cs.AIarXiv MechInterparXiv Evals

Difan Jiao, Yilun Liu, Ye Yuan, Zhenwei Tang +3

7.5

I 6.6 Im 7.6 P 7.8

This paper proposes using internal model representations — the hidden states extracted during inference — as the basis for harmful-content detection, rather than post-hoc text classification on model outputs. The core claim: a linear probe on mid-layer hidden states predicts harmful compliance more accurately and earlier (before generation completes) than any post-hoc output classifier, including those built on the same frontier model. The method uses contrastive representation learning to identify the harmful-direction in activation space and is efficient enough to run alongside production inference. Why this matters: the current production safety stack consists of an input moderation model, the target model's own safety training, and an output moderation model. Each adds latency and introduces false positives, and none of them are principled — they are trained classifiers whose failure modes are opaque. An activation-level probe, if it replicates, would be faster, more sample-efficient, and more interpretable because the direction can be localized to specific features. The paper is cross-listed on MechInterp and Evals, placing it at the intersection of interpretability-as-safety-tooling — the thread that Anthropic's Transformer Circuits, Apollo Research, and the broader mechanistic interpretability community have been advancing for several years. Expect downstream work testing whether the probe transfers across model families and whether it can be evaded by adversarial inputs that preserve the harmful semantics while shifting the activation representation.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, MechInterp, cs.AI — indicating the contribution spans multiple research subfields.'}

llmbenchmarksafety

#17

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Safety, Policy & Regulation 2026-04-20 arXiv cs.AIarXiv EvalsarXiv cs.CL

Md Rysul Kabir, Zoran Tiganj

7.5

I 6.6 Im 7.6 P 7.8

This paper dissects harmful-compliance failures in frontier models by tracing both the behavioral outputs and the mechanistic routes through the network. The central finding is that harmful compliance is not monolithic: different jailbreak techniques, persuasion attempts, and roleplay framings activate mechanistically distinct pathways inside the same model, even when the surface behavior — producing the harmful output — is identical. The implication for safety training is that patching one pathway does not close the others, and that current safety evaluations based on surface behavior may miss entire families of attacks. The paper is cross-listed on Evals, AI, and CL, placing it in the emerging literature that uses interpretability tools as evaluation methodology. The operational implication is that safety evaluations should include mechanistic audits alongside behavioral redteaming — just as model-quality evaluations include both benchmark accuracy and held-out distribution testing. This kind of work is foundational for the case that alignment requires interpretability: if harmful compliance pathways are mechanistically diverse, then safety training that targets only the dominant pathway will leave open the long tail. Frontier labs have been moving in this direction, but published benchmark work with this specific framing is still rare, which is why the cross-source activity on this paper is meaningful.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

llmbenchmarksafetypolicy

#18

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Evaluations & Benchmarks 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv AgentsarXiv RL

Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma +3

7.5

I 6.2 Im 6.6 P 9.1

Learning from Less measures RLVR — reinforcement learning from verifiable rewards — in the low-data, low-compute regime that characterizes most academic labs and most industrial labs that are not frontier-scale. Prior RLVR results have largely come from very-large-compute runs on very-large-data sets, and the question of whether the method is worth the trouble for mid-scale labs has been open. The paper shows that RLVR's returns diminish substantially in low-data regimes but do not vanish, and that the compute efficiency can be improved by specific tricks — curriculum selection from the verifier-judged hardest held-out problems, aggressive advantage clipping, and smaller batch sizes than the frontier recipes default to. This is practical guidance for the open research community. Most public RLVR recipes are scaled-down adaptations of frontier-lab recipes that were never designed for the mid-scale regime. A paper that characterizes what does and does not work at practical compute budgets is the kind of operational contribution that usually comes out of applied labs rather than research labs, and its appearance on four arXiv topical queries signals that the topic has momentum. Expect the reported configurations to appear in public RL fine-tuning recipes over the next month, particularly for math and code SFT-plus-RLVR pipelines.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, RL, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

llmreasoning

#19

Bounded Ratio Reinforcement Learning

Robotics 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv RL

Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd +4

7.4

I 6.0 Im 7.9 P 7.9

Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under RL, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

llmhumanoidpolicy

#20

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

Research 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv cs.CLarXiv Efficiency

Manan Gupta, Dhruv Kumar

7.4

I 6.0 Im 6.6 P 9.2

Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\textbf{Latent Phase-Shift Rollback}$ (LPSR): at each generation step, we monitor the residual stream at a critical layer lcrit, detect abrupt directional reversals (phase shifts) via a cosine-similarity $+$ entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves $\mathbf{44.0\%}$ on MATH-500 with an 8B model versus $28.8\%$ for standard AR ($+15.2$ pp; McNemar $χ^2 = 66.96$, $p < 10^{-15}$). Critically, prompted self-correction, the most natural inference-time baseline, scores only $19.8\%$, below standard AR; LPSR exceeds it by $+24.2$ pp ($χ^2 = 89.4$, $p \approx 0$). LPSR also outperforms Best-of-16 ($+7.8$ pp) at $5.4\times$ lower token cost, and surpasses a standard 70B model ($35.2\%$) with $8.75\times$ fewer parameters at ${\sim}3\times$ the token budget. A 32-layer sweep reveals a novel \textbf{detection-correction dissociation}: error-detection AUC peaks at layer~14 ($0.718$) but task accuracy peaks at layer~16 ($44.0\%$ vs.\ $29.2\%$), demonstrating that optimal monitoring depth differs for detection and correction.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.AI, cs.CL, cs.LG — indicating the contribution spans multiple research subfields.'}

reasoning

#21

IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters

Agents & Tool Use 2026-04-20 arXiv cs.AIarXiv AgentsarXiv cs.CLarXiv Efficiency

Hongwei Zheng, Weiqi Wu, Zhengjia Wang, Guanyu Jiang +5

7.4

I 5.6 Im 7.1 P 9.1

Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds before it begins. To overcome this, we introduce Conversation Starter Generation: generating personalized starters to guide users into conversation. However, unlike in-conversation stages where immediate context guides the response, initiation must operate in a cold-start moment without explicit user intent. To pioneer in this direction, we present IceBreaker that frames human ice-breaking as a two-step handshake: (i) evoke resonance via Resonance-Aware Interest Distillation from session summaries to capture trigger interests, and (ii) stimulate interaction via Interaction-Oriented Starter Generation, optimized with personalized preference alignment and a self-reinforced loop to maximize engagement. Online A/B tests on one of the world's largest conversational agent products show that IceBreaker improves user active days by +0.184% and click-through rate by +9.425%, and has been deployed in production.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Efficiency, cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

agentdistillationalignment

#22

Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

Generative Media 2026-04-20 arXiv cs.CVarXiv cs.AIarXiv cs.LGarXiv AgentsarXiv Efficiency

Tianshi Cao, Jiawei Ren, Yuxuan Zhang, Jaewoo Seo +11

7.4

I 5.0 Im 7.4 P 9.2

Closed-loop simulation is a core component of autonomous vehicle (AV) development, enabling scalable testing, training, and safety validation before real-world deployment. Neural scene reconstruction converts driving logs into interactive 3D environments for simulation, but it does not produce complete 3D object assets required for agent manipulation and large-viewpoint novel-view synthesis. To address this challenge, we present Asset Harvester, an image-to-3D model and end-to-end pipeline that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. Rather than relying on a single model component, we developed a system-level design for real-world AV data that combines large-scale curation of object-centric training tuples, geometry-aware preprocessing across heterogeneous sensors, and a robust training recipe that couples sparse-view-conditioned multiview generation with 3D Gaussian lifting. Within this system, SparseViewDiT is explicitly designed to address limited-angle views and other real-world data challenges. Together with hybrid data curation, augmentation, and self-distillation, this system enables scalable conversion of sparse AV object observations into reusable 3D assets.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Efficiency, cs.AI, cs.CV, cs.LG — indicating the contribution spans multiple research subfields.'}

agentdistillationsafety

#23

A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

Frontier LLMs 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv cs.CL

Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian +7

7.4

I 6.6 Im 6.8 P 8.1

Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This "atlas of medical concepts" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.CL, cs.LG — indicating the contribution spans multiple research subfields.'}

dporeasoningmultimodal

#24

Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

Frontier LLMs 2026-04-20 arXiv cs.AIarXiv Agents

Ravidu Suien Rammuni Silva, Ahmad Lotfi, Isibor Kennedy Ihianle, Golnaz Shahtahmassebi +1

7.4

I 8.2 Im 6.8 P 6.6

Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models' responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, cs.AI — indicating the contribution spans multiple research subfields.'}

agentllmreasoningpolicyvideo

#25

Qwen3.6 Max Preview (Alibaba) enters Artificial Analysis leaderboard

Frontier LLMs 2026-04-19 Artificial Analysis

7.4

I 7.5 Im 7.5 P 7.0

Alibaba's Qwen3.6 Max Preview — successor to the public Qwen3 family — added to Artificial Analysis's model intelligence index. The preview tier positions Alibaba's hosted offering against closed frontier models on standard composite reasoning, math, coding, and knowledge tasks.

qwenalibabachinafrontierbenchmark

#26

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv RLarXiv cs.CLarXiv stat.ML

Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein +3

7.3

I 6.0 Im 6.3 P 9.1

Large language models (LLMs) using chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods decide to withhold outputs before or after generation, dynamic mid-generation abstention considers early termination of unpromising reasoning traces at each token position. Prior work has explored empirical variants of this idea, but principled guidance for the abstention rule remains lacking. We present a formal analysis of dynamic abstention for LLMs, modeling abstention as an explicit action within a regularized reinforcement learning framework. An abstention reward parameter controls the trade-off between compute and information. We show that abstaining when the value function falls below this reward strictly outperforms natural baselines under general conditions. We further derive a principled and efficient method to approximate the value function. Empirical results on mathematical reasoning and toxicity avoidance tasks support our theory and demonstrate improved selective accuracy over existing methods.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under RL, cs.CL, cs.LG, stat.ML — indicating the contribution spans multiple research subfields.'}

llmreasoning

#27

IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem

Evaluations & Benchmarks 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv Evals

Aniruddha Adiga, Jingyuan Chou, Anshul Chiranth, Bryan Lewis +6

7.3

I 6.9 Im 6.6 P 7.8

Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real-time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative-based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information-theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi-horizon short-term forecasting (1- to 4-week-ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP-based methods have the most robust performance, with statistical methods having a slight edge during the pre-peak phase. IDOBE dataset along with baselines are released publicly on https://github.com/NSSAC/IDOBE to enable standardized, reproducible benchmarking of outbreak forecasting methods.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

benchmark

#28

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Research 2026-04-20 HF +42 HF Daily Papers

7.3

I 6.5 Im 6.5 P 8.4

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 42 upvotes. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

hf-daily

#29

ConforNets: Latents-Based Conformational Control in OpenFold3

AI for Science 2026-04-20 arXiv cs.LGarXiv EvalsarXiv AI-for-SciencearXiv Efficiency

Minji Lee, Colin Kalicki, Minkyu Jeon, Aymen Qabel +2

7.2

I 6.3 Im 6.0 P 8.9

Models from the AlphaFold (AF) family reliably predict one dominant conformation for most well-ordered proteins but struggle to capture biologically relevant alternate states. Several efforts have focused on eliciting greater conformational variability through ad hoc inference-time perturbations of AF models or their inputs. Despite their progress, these approaches remain inefficient and fail to consistently recover major conformational modes. Here, we investigate both the optimal location and manner-of-operation for perturbing latent representations in the AF3 architecture. We distill our findings in ConforNets: channel-wise affine transforms of the pre-Pairformer pair latents. Unlike previous methods, ConforNets globally modulate AF3 representations, making them reusable across proteins. On unsupervised generation of alternate states, ConforNets achieve state-of-the-art success rates on all existing multi-state benchmarks. On the novel supervised task of conformational transfer, ConforNets trained on one source protein can induce a conserved conformational change across a protein family. Collectively, these results introduce a mechanism for conformational control in AF3-based models.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under AI-for-Science, Efficiency, Evals, cs.LG — indicating the contribution spans multiple research subfields.'}

benchmark

#30

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Agents & Tool Use 2026-04-20 arXiv AgentsarXiv cs.CLarXiv Post-training

Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang +3

7.2

I 5.3 Im 7.9 P 7.9

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Post-training, cs.CL — indicating the contribution spans multiple research subfields.'}

rlhfagentllmreasoningalignmentpolicy

#31

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Generative Media 2026-04-20 HF +27 arXiv cs.CVarXiv AgentsarXiv Diffusion/GenMediaHF Daily Papers

Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu

7.2

I 6.3 Im 6.0 P 8.7

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Diffusion/GenMedia, cs.CV — indicating the contribution spans multiple research subfields.'}
{'source': 'Hugging Face Daily Papers', 'text': 'Featured on HF Daily Papers (27 upvotes) — an early-day community popularity signal.'}

agentvideo

#32

CSET: 'Operationalizing AI Guidance' reference guide for translating high-level AI principles into practical implementation

Safety, Policy & Regulation 2026-04-20 CSET

Kyle Crichton, Abhiram Reddy, Jessica Ji

7.2

I 6.5 Im 8.0 P 6.5

CSET publishes a practical reference guide bridging high-level AI governance principles to implementation. Draws on 1200+ resources and is aimed at organizations — including federal agencies — translating NIST/OSTP/EO-level guidance into adoptable controls and processes. Relevant to Pentagon/CDAO/DIU procurement scaffolding.

csetpolicygovernancenistdefense

#33

SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy

Robotics 2026-04-20 arXiv cs.CVarXiv AgentsarXiv Efficiency

Wei Yao, Haohan Ma, Hongwen Zhang, Yunlian Sun +5

7.1

I 6.3 Im 6.8 P 7.7

Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: http://yw0208.github.io/synagent

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Efficiency, cs.CV — indicating the contribution spans multiple research subfields.'}

ragagentdistillationhumanoidpolicy

#34

When Can LLMs Learn to Reason with Weak Supervision?

Reinforcement Learning 2026-04-20 HF +9 arXiv cs.AIarXiv cs.LGarXiv RLHF Daily Papers

Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi +2

7.1

I 5.0 Im 6.9 P 8.9

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under RL, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}
{'source': 'Hugging Face Daily Papers', 'text': 'Featured on HF Daily Papers (9 upvotes) — an early-day community popularity signal.'}

llmreasoning

#35

EasyVideoR1: Easier RL for Video Understanding

Research 2026-04-20 HF +20 HF Daily Papers

7.1

I 6.5 Im 6.5 P 7.7

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 20 upvotes. EasyVideoR1: Easier RL for Video Understanding

hf-daily

#36

Document-as-Image Representations Fall Short for Scientific Retrieval

Multimodal 2026-04-20 arXiv cs.AIarXiv EvalsarXiv cs.CL

Ghazal Khalighinejad, Raghuveer Thirukovalluru, Alexander H. Oh, Bhuwan Dhingra

7.0

I 6.3 Im 6.6 P 7.8

Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

ragbenchmarkmultimodal

#37

An Integrated Deep-Learning Framework for Peptide-Protein Interaction Prediction and Target-Conditioned Peptide Generation with ConGA-PePPI and TC-PepGen

AI for Science 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv AI-for-Science

Chupei Tang, Junxiao Kong, Moyu Tang, Di Wang +4

7.0

I 6.6 Im 6.3 P 7.8

Motivation: Peptide-protein interactions (PepPIs) are central to cellular regulation and peptide therapeutics, but experimental characterization remains too slow for large-scale screening. Existing methods usually emphasize either interaction prediction or peptide generation, leaving candidate prioritization, residue-level interpretation, and target-conditioned expansion insufficiently integrated. Results: We present an integrated framework for early-stage peptide screening that combines a partner-aware prediction and localization model (ConGA-PepPI) with a target-conditioned generative model (TC-PepGen). ConGA-PepPI uses asymmetric encoding, bidirectional cross-attention, and progressive transfer from pair prediction to binding-site localization, while TC-PepGen preserves target information throughout autoregressive decoding via layerwise conditioning. In five-fold cross-validation, ConGA-PepPI achieved 0.839 accuracy and 0.921 AUROC, with binding-site AUPR values of 0.601 on the protein side and 0.950 on the peptide side, and remained competitive on external benchmarks. Under a controlled length-conditioned benchmark, 40.39% of TC-PepGen peptides exceeded native templates in AlphaFold 3 ipTM, and unconstrained generation retained evidence of target-conditioned signal.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under AI-for-Science, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

benchmark

#38

PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues

Agents & Tool Use 2026-04-20 arXiv AgentsarXiv cs.CLarXiv Post-training

Prajwal Vijay Kajare, Priyanshu Priya, Bikash Santra, Asif Ekbal

7.0

I 5.3 Im 7.4 P 7.9

Emotion plays a pivotal role in shaping negotiation outcomes, influencing trust, cooperation, and long-term relationships. Developing negotiation dialog systems that can recognize and respond strategically to emotions is, therefore, essential to create more effective human-centered interactions. Beyond generating emotionally appropriate responses, interpretability - understanding how a system generates a particular emotion-aware response, is critical for fostering reliability and building rapport. Driven by these aspects, in this work, we introduce PRISMA, an interpretable emotionally intelligent negotiation dialogue system targeting two application domains, viz. job interviews and resource allocation. To enable interpretability, we propose an Emotion-aware Negotiation Strategy-informed Chain-of-Thought (ENS-CoT) reasoning mechanism, which mimics human negotiation by perceiving, understanding, using, and managing emotions. Leveraging ENS-CoT, we curate two new datasets: JobNego (for job interview negotiation) and ResNego (for resource allocation negotiation). We then leverage these datasets to develop PRISMA by augmenting self-training with Direct Preference Optimization (DPO), guiding agents toward more accurate, interpretable, and emotionally appropriate negotiation responses. Automatic and human evaluation on JobNego and ResNego datasets demonstrate that PRISMA substantially enhances interpretability and generates appropriate emotion-aware responses, while improving overall negotiation effectiveness.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Post-training, cs.CL — indicating the contribution spans multiple research subfields.'}

dporagagentreasoning

#39

MetaCloak-JPEG: JPEG-Robust Adversarial Perturbation for Preventing Unauthorized DreamBooth-Based Deepfake Generation

Generative Media 2026-04-20 arXiv cs.CVarXiv EfficiencyarXiv Diffusion/GenMedia

Tanjim Rahaman Fardin, S M Zunaid Alam, Mahadi Hasan Fahim, Md Faysal Mahfuz

7.0

I 6.9 Im 6.0 P 7.7

The rapid progress of subject-driven text-to-image synthesis, and in particular DreamBooth, has enabled a consent-free deepfake pipeline: an adversary needs only 4-8 publicly available face images to fine-tune a personalized diffusion model and produce photorealistic harmful content. Current adversarial face-protection systems -- PhotoGuard, Anti-DreamBooth, and MetaCloak -- perturb user images to disrupt surrogate fine-tuning, but all share a structural blindness: none backpropagates gradients through the JPEG compression pipeline that every major social-media platform applies before adversary access. Because JPEG quantization relies on round(), whose derivative is zero almost everywhere, adversarial energy concentrates in high-frequency DCT bands that JPEG discards, eliminating 60-80% of the protective signal. We introduce MetaCloak-JPEG, which closes this gap by inserting a Differentiable JPEG (DiffJPEG) layer built on the Straight-Through Estimator (STE): the forward pass applies standard JPEG compression, while the backward pass replaces round() with the identity. DiffJPEG is embedded in a JPEG-aware EOT distribution (~70% of augmentations include DiffJPEG) and a curriculum quality-factor schedule (QF: 95 to 50) inside a bilevel meta-learning loop. Under an l-inf perturbation budget of eps=8/255, MetaCloak-JPEG attains 32.7 dB PSNR, a 91.3% JPEG survival rate, and outperforms PhotoGuard on all 9 evaluated JPEG quality factors (9/9 wins, mean denoising-loss gain +0.125) within a 4.1 GB training-memory budget.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Diffusion/GenMedia, Efficiency, cs.CV — indicating the contribution spans multiple research subfields.'}

diffusionquantization

#40

One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection

Generative Media 2026-04-20 arXiv cs.CVarXiv EfficiencyarXiv Diffusion/GenMedia

Boan Zhang, Wen Li, Guanhua Yu, Xiyang Liu +2

7.0

I 6.9 Im 6.0 P 7.7

Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that off-manifold anomalies are harder to generate, resulting in larger reconstruction errors in data space or lower probability densities in the tractable latent space. However, their iterative denoising and noising nature leads to slow inference. In this paper, we propose OSD-IRF, a novel one-step diffusion with inverse residual fields, to address this limitation for uIAD task. We first train a deep diffusion probabilistic model (DDPM) on normal data without any conditioning. Then, for a test sample, we predict its inverse residual fields (IRF) based on the noise estimated by the well-trained parametric noise function of the DDPM. Finally, uIAD is performed by evaluating the probability density of the IRF under a Gaussian distribution and comparing it with a threshold. Our key observation is that anomalies become distinguishable in this IRF space, a finding that has seldom been reported in prior works. Moreover, OSD-IRF requires only single step diffusion for uIAD, thanks to the property that IRF holds for any neighboring time step in the denoising process. Extensive experiments on three widely used uIAD benchmarks show that our model achieves SOTA or competitive performance across six metrics, along with roughly a 2X inference speedup without distillation.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Diffusion/GenMedia, Efficiency, cs.CV — indicating the contribution spans multiple research subfields.'}

diffusiondistillationbenchmark

#41

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Safety, Policy & Regulation 2026-04-20 arXiv cs.AIarXiv cs.CL

Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti +4

7.0

I 6.6 Im 7.0 P 6.8

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

benchmarksafetyasr

#42

Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus

Efficiency 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv Efficiency

Esteban Rodríguez-Betancourt, Edgar Casasola-Murillo

7.0

I 6.0 Im 6.6 P 7.9

In self-supervised learning, self-distilled methods have shown impressive performance, learning representations useful for downstream tasks and even displaying emergent properties. However, state-of-the-art methods usually rely on ensembles of complex mechanisms, with many design choices that are empirically motivated and not well understood. In this work, we explore the role of self-distillation within learning dynamics. Specifically, we isolate the effect of self-distillation by training a group of randomly initialized networks, removing all other common components such as projectors, predictors, and even pretext tasks. Our findings show that even this minimal setup can lead to learned representations with non-trivial improvements over a random baseline on downstream tasks. We also demonstrate how this effect varies with different hyperparameters and present a short analysis of what is being learned by the models under this setup.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

distillation

#43

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Agents & Tool Use 2026-04-20 arXiv AgentsarXiv EvalsarXiv cs.CL

Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie

7.0

I 6.3 Im 6.3 P 7.8

Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS-RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Evals, cs.CL — indicating the contribution spans multiple research subfields.'}

ragagentllmreasoningbenchmark

#44

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Research 2026-04-20 HF +11 HF Daily Papers

7.0

I 6.5 Im 6.5 P 7.4

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 11 upvotes. WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

hf-daily

#45

SkillFlow: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Research 2026-04-20 HF +10 HF Daily Papers

7.0

I 6.5 Im 6.5 P 7.3

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 10 upvotes. SkillFlow: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

hf-daily

#46

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

Multimodal 2026-04-20 arXiv cs.CVarXiv EvalsarXiv Post-training

Nitish Shukla, Surgan Jandial, Arun Ross

6.9

I 6.3 Im 6.5 P 7.5

Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, Post-training, cs.CV — indicating the contribution spans multiple research subfields.'}

vlmdporagreasoningbenchmarkalignment

#47

Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

Agents & Tool Use 2026-04-20 arXiv cs.AIarXiv AgentsarXiv Evals

Kevin Murphy

6.9

I 7.0 Im 5.7 P 7.7

We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, Evals, cs.AI — indicating the contribution spans multiple research subfields.'}

agentllmbenchmark

#48

Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD

Evaluations & Benchmarks 2026-04-20 arXiv cs.AIarXiv cs.LGarXiv Evals

Nicholas Thumiger, Andrea Bartezzaghi, Mattia Rigotti, Cezary Skura +4

6.9

I 6.0 Im 6.3 P 7.8

Computational Fluid Dynamics (CFD) is central to race-car aerodynamic development, yet its cost -- tens of thousands of core-hours per high-fidelity evaluation -- severely limits the design space exploration feasible within realistic budgets. AI-based surrogate models promise to alleviate this bottleneck, but progress has been constrained by the limited complexity of public datasets, which are dominated by smoothed passenger-car shapes that fail to exercise surrogates on the thin, complex, highly loaded components governing motorsport performance. This work presents three primary contributions. First, we introduce a high-fidelity RANS dataset built on a parametric LMP2-class CAD model and spanning six operating conditions (map points) covering straight-line and cornering regimes, generated and validated by aerodynamics experts at Dallara to preserve features relevant to industrial motorsport. Second, we present the Gauge-Invariant Spectral Transformer (GIST), a graph-based neural operator whose spectral embeddings encode mesh connectivity to enhance predictions on tightly packed, complex geometries. GIST guarantees discretization invariance and scales linearly with mesh size, achieving state-of-the-art accuracy on both public benchmarks and the proposed race-car dataset. Third, we demonstrate that GIST achieves a level of predictive accuracy suitable for early-stage aerodynamic design, providing a first validation of the concept of interactive design-space exploration -- where engineers query a surrogate in place of the CFD solver -- within industrial motorsport workflows.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

transformerbenchmark

#49

Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

Research 2026-04-20 HF +3 HF Daily Papers

6.9

I 6.5 Im 6.5 P 7.1

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 3 upvotes. Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

hf-daily

#50

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Research 2026-04-20 HF +3 HF Daily Papers

6.9

I 6.5 Im 6.5 P 7.1

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 3 upvotes. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

hf-daily

#51

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Research 2026-04-20 HF +3 HF Daily Papers

6.9

I 6.5 Im 6.5 P 7.1

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 3 upvotes. Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

hf-daily

#52

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Research 2026-04-20 HF +2 HF Daily Papers

6.9

I 6.5 Im 6.5 P 7.1

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 2 upvotes. Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

hf-daily

#53

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Research 2026-04-20 HF +2 HF Daily Papers

6.9

I 6.5 Im 6.5 P 7.1

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 2 upvotes. Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

hf-daily

#54

MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Research 2026-04-20 HF +2 HF Daily Papers

6.9

I 6.5 Im 6.5 P 7.1

Featured on Hugging Face Daily Papers (Apr 20, 2026) with 2 upvotes. MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

hf-daily

#55

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

Multimodal 2026-04-20 arXiv cs.CVarXiv cs.AIarXiv cs.LG

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

6.8

I 5.0 Im 7.1 P 7.8

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.CV, cs.LG — indicating the contribution spans multiple research subfields.'}

ragalignment

#56

ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship

Agents & Tool Use 2026-04-20 arXiv AgentsarXiv cs.CL

Zhaopei Huang, Yanfeng Jia, Jiayi Zhao, Xinjie Zhang +2

6.8

I 7.3 Im 5.7 P 6.8

Developing compassionate interactive systems requires agents to not only understand user emotions but also provide diverse, substantive support. While recent works explore empathetic dialogue generation, they remain limited in response form and content, struggling to satisfy diverse needs across users and contexts. To address this, we explore empowering agents with external tools to execute diverse actions. Grounded in the psychological concept of "social support", this paradigm delivers substantive, human-like companionship. Specifically, we first design a dozen user-centric tools simulating various multimedia applications, which can cover different types of social support behaviors in human-agent interaction scenarios. We then construct ComPASS-Bench, the first personalized social support benchmark for LLM-based agents, via multi-step automated synthesis and manual refinement. Based on ComPASS-Bench, we further synthesize tool use records to fine-tune the Qwen3-8B model, yielding a task-specific ComPASS-Qwen. Comprehensive evaluations across two settings reveal that while the evaluated LLMs can generate valid tool-calling requests with high success rates, significant gaps remain in final response quality. Moreover, tool-augmented responses achieve better overall performance than directly producing conversational empathy. Notably, our trained ComPASS-Qwen demonstrates substantial improvements over its base model, achieving comparable performance to several large-scale models. Our code and data are available at https://github.com/hzp3517/ComPASS.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, cs.CL — indicating the contribution spans multiple research subfields.'}

agentllmbenchmark

#57

OpenGame: Open Agentic Coding for Games

Agents & Tool Use 2026-04-20 HF +33 arXiv AgentsHF Daily Papers

Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng +7

6.8

I 6.9 Im 6.5 P 6.4

Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence. We bridge this gap with OpenGame, the first open-source agentic framework explicitly designed for end-to-end web game creation. At its core lies Game Skill, a reusable, evolving capability composed of a Template Skill that grows a library of project skeletons from experience and a Debug Skill that maintains a living protocol of verified fixes - together enabling the agent to scaffold stable architectures and systematically repair integration errors rather than patch isolated syntax bugs. Powering this framework is GameCoder-27B, a code LLM specialized for game engine mastery through a three-stage pipeline of continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning. Since verifying interactive playability is fundamentally harder than checking static code, we further introduce OpenGame-Bench, an evaluation pipeline that scores agentic game generation along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging. Across 150 diverse game prompts, OpenGame establishes a new state-of-the-art. We hope OpenGame pushes code agents beyond discrete software engineering problems and toward building complex, interactive real-world applications. Our framework will be fully open-sourced.

How it was discussed across sources

{'source': 'Hugging Face Daily Papers', 'text': 'Featured on HF Daily Papers (33 upvotes) — an early-day community popularity signal.'}

vlmagentllmalignment

#58

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

Frontier LLMs 2026-04-20 arXiv cs.CVarXiv cs.AIarXiv Agents

Kecheng Zhang, Zongxin Yang, Mingfei Han, Haihong Hao +5

6.7

I 6.0 Im 6.0 P 7.7

Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbolρ$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6\%} on StreamingBench and \textbf{46.9\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, cs.AI, cs.CV — indicating the contribution spans multiple research subfields.'}

agentllmreasoningbenchmarkvideo

#59

Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations

AI Coding 2026-04-20 arXiv cs.AIarXiv cs.CL

Eric Rudolph, Philipp Steigerwald, Jens Albrecht

6.7

I 5.9 Im 6.8 P 6.8

This paper studies how empirical dialogue-flow statistics can be incorporated into Next Dialogue Act Prediction (NDAP). A KL regularization term is proposed that aligns predicted act distributions with corpus-derived transition patterns. Evaluated on a 60-class German counselling taxonomy using 5-fold cross-validation, this improves macro-F1 by 9--42% relative depending on encoder and substantially improves dialogue-flow alignment. Cross-dataset validation on HOPE suggests that improvements transfer across languages and counselling domains. In systematic ablations across pretrained encoders and architectures, the findings indicate that transition regularization provides consistent gains and disproportionately benefits weaker baseline models. The results suggest that lightweight discourse-flow priors complement pretrained encoders, especially in fine-grained, data-sparse dialogue tasks.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

alignment

#60

Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

Post-Training 2026-04-20 arXiv cs.CLarXiv Post-training

Hao Meng, Siyuan Zheng, Shuran Zhou, Qiangqiang Wang +1

6.7

I 6.0 Im 6.8 P 6.8

Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term "constraint violation". To address this, we propose a novel alignment framework that instills musical knowledge without human annotation. We define rule-based musical constraints to automatically generate a preference dataset from an SFT model's outputs. The model is then aligned through a sequential process, first using Direct Preference Optimization (DPO) on paired preference data, followed by Kahneman-Tversky Optimization (KTO) on unpaired negative samples. Experimental results demonstrate that our aligned model substantially reduces rule violations and outperforms strong baselines in both objective and subjective evaluations, generating melodies with substantially improved musicality and coherence. An interactive demo with audio comparisons is available at https://arain233.github.io/AligningMelody-demo.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Post-training, cs.CL — indicating the contribution spans multiple research subfields.'}

dpollmalignment

#61

Safe Control using Learned Safety Filters and Adaptive Conformal Inference

Safety, Policy & Regulation 2026-04-20 arXiv cs.ROarXiv cs.LG

Sacha Huriot, Ihab Tabbara, Hussein Sibai

6.6

I 6.0 Im 6.5 P 6.8

Safety filters have been shown to be effective tools to ensure the safety of control systems with unsafe nominal policies. To address scalability challenges in traditional synthesis methods, learning-based approaches have been proposed for designing safety filters for systems with high-dimensional state and control spaces. However, the inevitable errors in the decisions of these models raise concerns about their reliability and the safety guarantees they offer. This paper presents Adaptive Conformal Filtering (ACoFi), a method that combines learned Hamilton-Jacobi reachability-based safety filters with adaptive conformal inference. Under ACoFi, the filter dynamically adjusts its switching criteria based on the observed errors in its predictions of the safety of actions. The range of possible safety values of the nominal policy's output is used to quantify uncertainty in safety assessment. The filter switches from the nominal policy to the learned safe one when that range suggests it might be unsafe. We show that ACoFi guarantees that the rate of incorrectly quantifying uncertainty in the predicted safety of the nominal policy is asymptotically upper bounded by a user-defined parameter. This gives a soft safety guarantee rather than a hard safety guarantee. We evaluate ACoFi in a Dubins car simulation and a Safety Gymnasium environment, empirically demonstrating that it significantly outperforms the baseline method that uses a fixed switching threshold by achieving higher learned safety values and fewer safety violations, especially in out-of-distribution scenarios.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.LG, cs.RO — indicating the contribution spans multiple research subfields.'}

safetypolicy

#62

Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models

Safety, Policy & Regulation 2026-04-20 arXiv cs.AIarXiv MechInterp

Chad Coleman, W. Russell Neuman, Manan Shah, Ali Dasdan +4

6.6

I 5.6 Im 6.8 P 6.8

We present Six Llamas, a comparative study examining whether large language models fine-tuned on distinct religious corpora encode systematically different patterns of ethical reasoning. Six variants of Meta-Llama-3.1-8B are constructed: one unmodified control and five LoRA-adapted models trained exclusively on the sacred and theological texts of Christianity, Islam, Judaism, Hinduism, or Buddhism. All six models are probed with an identical battery of 17 standardized ethical prompts spanning moral dilemmas, game-theoretic scenarios, public policy questions, and moral-psychological self-assessments. To assess robustness and reproducibility, we implement a multi-temperature sampling design spanning ten temperature settings. We compute response consistency metrics, pairwise inter-model agreement rates, temperature sensitivity coefficients across four prompt domains, and run-to-run stability analyses. Findings show that LoRA-adapted models produce ethical reasoning patterns that are (a) systematically differentiated from the base model, (b) consistent with the moral logics of their training traditions, (c) structured along interpretable dimensions in moral-philosophical space, (d) core ethical positions remain stable across temperature variations for high-consensus dilemmas. The Trolley Problem achieves 100% consistency across all models and temperatures, while (e) tradition-specific divergence intensifies at higher temperatures in morally contested domains, and (f) the base model exhibits the highest overall response consistency (mean 88.3%), suggesting LoRA adaptation introduces both tradition-specific signal and increased sampling sensitivity. The study offers a proof-of-concept for the condensate comparative method using differentially trained language models as instruments for cultural and ethical analysis and identifies specific criteria for falsification and planned extensions.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under MechInterp, cs.AI — indicating the contribution spans multiple research subfields.'}

reasoningpolicy

#63

Tensor Processing with Homodyne Photonic Integrated Circuits exceeds 1,000 TOPS

Frontier LLMs 2026-04-20 arXiv EvalsarXiv Efficiency

Lian Zhou, Kaiwen Xue, Yun-Jhu Lee, Chun-Ho Lee +9

6.6

I 6.6 Im 6.4 P 6.4

High-performance computing underpins modern artificial intelligence (AI), enabling foundation models, real-time inference and perception in autonomous systems, and data-intensive scientific simulations. Recent advances in quantization techniques utilizing low-precision computation without degrading model accuracy, creates new opportunities for analog photonic computing characterized by ultra-high clock rates and low energy consumption. Here we propose and demonstrate a coherent homodyne integrated circuit capable of general matrix multiplication(GEMM) with aggregate throughput that exceeds 1,000 TOPS (tera-operations per second), enabled by massive on-chip optical fanout and parallelism. By leveraging time multiplexing, the required modulator count is reduced from O($N^2$) to O(N), allowing dense integration of record-scale 256 $\times$ 256 homodyne units (each <0.0064 $mm^2$) within a single reticle. We employ wafer-scale fabricated 64 thin-film lithium niobate (TFLN) transmitters (each over 40-GHz bandwidth with propagation loss of 0.2 dB/cm) to encode data and chip-to-chip coupled to Si/SiN computing circuits (64 channels). Our system achieves up to 7-bit computational accuracy across 8 $\times$ 8 parallel channels at record computing clockrate 120 Gbaud/s, and 6-bit statistical accuracy across 256 $\times$ 100 channels at 20-128 Gbaud/s, representing a total throughput of 1,000-6,000 TOPS. Massive parallelism amortizes the optoelectronic (OE) conversion to allow 330-TOPS/W efficiency using foundry-available packaging technology. The system throughput is benchmarked with Qwen2.5-0.5 billion parameter models that generate accurate tokens. High throughput and energy efficiency establish a near-term pathway toward light-based accelerators for large-scale training and low-latency inference from datacenters to edges, accelerating new models toward artificial general intelligence.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, Evals — indicating the contribution spans multiple research subfields.'}

ragquantizationbenchmarkchip

#64

Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models

Multimodal 2026-04-20 arXiv cs.CVarXiv cs.AI

Yakoub Bazi, Mohamad M. Al Rahhal, Mansour Zuair, Faroun Mohamed

6.6

I 7.0 Im 5.9 P 6.4

Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.CV — indicating the contribution spans multiple research subfields.'}

vlmreasoningbenchmarkalignmentmultimodal

#65

Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv MechInterp

Yidi Yuan

6.5

I 5.6 Im 6.4 P 7.0

Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and hence affect its geometric impact. We applied STP at consecutive semantic reasoning step boundaries and achieved 168x more accurate multi-step latent prediction than frozen baselines on ProcessBench (3,400 samples), compared to only 4x for the random-token STP. Probing the latent manifold with a learned non-linear predictor reveals that STP-shaped trajectories are smooth curves, not straight lines: a 3-layer MLP reduces prediction error by a further 3-12x over linear extrapolation on step-boundary models. Removing the language modeling loss yields trajectories that are 2x more MLP-predictable than the combined loss, revealing a tradeoff between generation quality and geometric purity. Our results identify sampling position as the critical variable in geometric regularization and establish multi-step latent prediction MSE as a new evaluation metric for this class of methods.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under MechInterp, cs.LG — indicating the contribution spans multiple research subfields.'}

ragllmreasoning

#66

A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv Efficiency

Ran Ben-Basat, Yaniv Ben-Itzhak, Gal Mendelson, Michael Mitzenmacher +2

6.4

I 6.2 Im 5.7 P 6.8

This note clarifies the relationship between the recent TurboQuant work and the earlier DRIVE (NeurIPS 2021) and EDEN (ICML 2022) schemes. DRIVE is a 1-bit quantizer that EDEN extended to any $b>0$ bits per coordinate; we refer to them collectively as EDEN. First, TurboQuant$_{\text{mse}}$ is a special case of EDEN obtained by fixing EDEN's scalar scale parameter to $S=1$. EDEN supports both biased and unbiased quantization, each optimized by a different $S$ (chosen via methods described in the EDEN works). The fixed choice $S=1$ used by TurboQuant is generally suboptimal, although the optimal $S$ for biased EDEN converges to $1$ as the dimension grows; accordingly TurboQuant$_{\text{mse}}$ approaches EDEN's behavior for large $d$. Second, TurboQuant$_{\text{prod}}$ combines a biased $(b-1)$-bit EDEN step with an unbiased 1-bit QJL quantization of the residual. It is suboptimal in three ways: (1) its $(b-1)$-bit step uses the suboptimal $S=1$; (2) its 1-bit unbiased residual quantization has worse MSE than (unbiased) 1-bit EDEN; (3) chaining a biased $(b-1)$-bit step with a 1-bit unbiased residual step is inferior to unbiasedly quantizing the input directly with $b$-bit EDEN. Third, some of the analysis in the TurboQuant work mirrors that of the EDEN works: both exploit the connection between random rotations and the shifted Beta distribution, use the Lloyd-Max algorithm, and note that Randomized Hadamard Transforms can replace uniform random rotations. Experiments support these claims: biased EDEN (with optimized $S$) is more accurate than TurboQuant$_{\text{mse}}$, and unbiased EDEN is markedly more accurate than TurboQuant$_{\text{prod}}$, often by more than a bit (e.g., 2-bit EDEN beats 3-bit TurboQuant$_{\text{prod}}$). We also repeat all accuracy experiments from the TurboQuant paper, showing that EDEN outperforms it in every setup we have tried.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.LG — indicating the contribution spans multiple research subfields.'}

quantization

#67

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

Multimodal 2026-04-20 arXiv cs.CL

HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo +1

6.4

I 6.9 Im 5.8 P 5.8

Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.

ragllmbenchmarkmultimodal

#68

MUA: Mobile Ultra-detailed Animatable Avatars

Multimodal 2026-04-20 arXiv cs.CVarXiv Efficiency

Heming Zhu, Guoxing Sun, Marc Habermann

6.4

I 6.9 Im 5.4 P 6.4

Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.CV — indicating the contribution spans multiple research subfields.'}

distillationgpu

#69

ReCap: Lightweight Referential Grounding for Coherent Story Visualization

Generative Media 2026-04-20 arXiv cs.CVarXiv Evals

Aditya Arora, Akshita Gupta, Pau Rodriguez, Marcus Rohrbach

6.3

I 6.9 Im 5.4 P 6.2

Story Visualization aims to generate a sequence of images that faithfully depicts a textual narrative that preserve character identity, spatial configuration, and stylistic coherence as the narratives unfold. Maintaining such cross-frame consistency has traditionally relied on explicit memory banks, architectural expansion, or auxiliary language models, resulting in substantial parameter growth and inference overhead. We introduce ReCap, a lightweight consistency framework that improves character stability and visual fidelity without modifying the base diffusion backbone. ReCap's CORE (COnditional frame REferencing) module treats anaphors, in our case pronouns, as visual anchors, activating only when characters are referred to by a pronoun and conditioning on the preceding frame to propagate visual identity. This selective design avoids unconditional cross-frame conditioning and introduces only 149K additional parameters, a fraction of the cost of memory-bank and LLM-augmented approaches. To further stabilize identity, we incorporate SemDrift (Guided Semantic Drift Correction) applied only during training. When text is vague or referential, the denoiser lacks a visual anchor for identity-defining attributes, causing character appearance to drift across frames, SemDrift corrects this by aligning denoiser representations with pretrained DINOv3 visual embeddings, enforcing semantic identity stability at zero inference cost. ReCap outperforms previous state-of-the-art, StoryGPT-V, on the two main benchmarks for story visualization by 2.63% Character-Accuracy on FlintstonesSV and by 5.65% on PororoSV, establishing a new state-of-the-art character consistency on both benchmarks. Furthermore, we extend story visualization to human-centric narratives derived from real films, demonstrating the capability of ReCap beyond stylized cartoon domains.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.CV — indicating the contribution spans multiple research subfields.'}

diffusionllmbenchmark

#70

AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment

Agents & Tool Use 2026-04-20 arXiv cs.AIarXiv cs.CL

Yixuan Wang, Yue Huang, Hong Qian, Yunzhao Wei +6

6.3

I 5.9 Im 5.7 P 6.8

Creativity has become a core competence in the era of LLMs and human-AI collaboration, underpinning innovation in real-world problem solving. Crucially, the systematic improvement of creativity necessitates scientifically valid assessment instruments. Psychometric research recognizes context-based assessment as an effective way to measure creative thinking. However, high-quality expert-designed contexts remain scarce. Existing LLM-based generators often struggle with insufficient assessment cues, weak narrative coherence, limited stylistic diversity, and poor support for creative thinking. To address these challenges, we propose AlphaContext, an evolutionary tree-based psychometric context generator for creativity assessment. First, the HyperTree Outline Planner formalizes expert-designed outlining as a rule-guided hypertree and performs top-down hierarchical planning. The MCTS-based Context Generator fills the outline via MCTS to balance global structure and local quality. Then, the Evolutionary Context Optimizer evolves contexts with MAP-Elites by repeatedly updating niche elites to jointly improve diversity and quality. Finally, the Assessment-Guided Evolution Refiner simulates virtual participants with diverse styles and recycles weak contexts for further evolution. Experiments show that AlphaContext yields an average improvement of 8% over competitive methods across 6 quality metrics.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

ragllm

#71

Barrier-enforced multi-objective optimization for direct point and sharp interval forecasting

Frontier LLMs 2026-04-20 arXiv cs.LG

Worachit Amnuaypongsa, Yotsapat Suparanonrat, Pana Wanitchollakit, Jitkomut Songsiri

6.3

I 6.3 Im 6.3 P 5.8

This paper proposes a multi-step probabilistic forecasting framework using a single neural-network based model to generate simultaneous point and interval forecasts. Our approach ensures non-crossing prediction intervals (PIs) through a model structure design that strictly satisfy a target coverage probability (PICP) while maximizing sharpness. Unlike existing methods that rely on manual weight tuning for scalarized loss functions, we treat point and PI forecasting as a multi-objective optimization problem, utilizing multi-gradient descent to adaptively select optimal weights. Key innovations include a new PI loss function based on an extended log-barrier with an adaptive hyperparameter to guarantee the coverage, a hybrid architecture featuring a shared temporal model with horizon-specific submodels, and a training strategy. The proposed loss is scale-independent and universally applicable; combined with our training algorithm, the framework eliminates trial-and-error hyperparameter tuning for balancing multiple objectives. Validated by an intra-day solar irradiance forecasting application, results demonstrate that our proposed loss consistently outperforms those in current literature by achieving target coverage with the narrowest PI widths. Furthermore, when compared against LSTM encoder-decoder and Transformer architectures--including those augmented with Chronos foundation models--our method remains highly competitive and can be seamlessly adapted to any deep learning structure.

transformerrag

#72

Multi-Scale Reversible Chaos Game Representation: A Unified Framework for Sequence Classification

AI for Science 2026-04-20 arXiv cs.LG

Sarwan Ali, Taslim Murad

6.3

I 5.3 Im 7.1 P 5.8

Biological classification with interpretability remains a challenging task. For this, we introduce a novel encoding framework, Multi-Scale Reversible Chaos Game Representation (MS-RCGR), that transforms biological sequences into multi-resolution geometric representations with guaranteed reversibility. Unlike traditional sequence encoding methods, MS-RCGR employs rational arithmetic and hierarchical k-mer decomposition to generate scale-invariant features that preserve complete sequence information while enabling diverse analytical approaches. Our framework bridges three distinct paradigms for sequence analysis: (1) traditional machine learning using extracted geometric features, (2) computer vision models operating on CGR-generated images, and (3) hybrid approaches combining protein language model embeddings with CGR features. Through comprehensive experiments on synthetic DNA and protein datasets encompassing seven distinct sequence classes, we demonstrate that MS-RCGR features consistently enhance classification performance across all paradigms. Notably, our hybrid approach combining pre-trained language model embeddings (ESM2, ProtT5) with MS-RCGR features achieves superior performance compared to either method alone. The reversibility property of our encoding ensures no information loss during transformation, while multi-scale analysis captures patterns ranging from individual nucleotides to complex motif structures. Our results indicate that MS-RCGR provides a flexible, interpretable, and high-performing foundation for biological sequence analysis.

#73

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

Frontier LLMs 2026-04-20 Latent Space NewsletterLatent Space Podcast

Brandon Anderson

6.3

I 5.6 Im 6.0 P 6.8

95% of cancer treatments fail to pass clinical trials, but it may be a matching problem — that Noetik is solving with autoregressive transformers like TARIO-2!

transformer

#74

AutoPPA: Automated Circuit PPA Optimization via Contrastive Code-based Rule Library Learning

Frontier LLMs 2026-04-20 arXiv cs.LG

Chongxiao Li, Pengwei Jin, Di Huang, Guangrun Sun +13

6.2

I 6.3 Im 6.0 P 5.8

Performance, power, and area (PPA) optimization is a fundamental task in RTL design, requiring a precise understanding of circuit functionality and the relationship between circuit structures and PPA metrics. Recent studies attempt to automate this process using LLMs, but neither feedback-based nor knowledge-based methods are efficient enough, as they either design without any prior knowledge or rely heavily on human-summarized optimization rules. In this paper, we propose AutoPPA, a fully automated PPA optimization framework. The key idea is to automatically generate optimization rules that enhance the search for optimal solutions. To do this, AutoPPA employs an Explore-Evaluate-Induce ($E^2I$) workflow that contrasts and abstracts rules from diverse generated code pairs rather than manually defined prior knowledge, yielding better optimization patterns. To make the abstracted rules more generalizable, AutoPPA employs an adaptive multi-step search framework that adopts the most effective rules for a given circuit. Experiments show that AutoPPA outperforms both the manual optimization and the state-of-the-art methods SymRTLO and RTLRewriter.

llm

#75

Spectral bandits for smooth graph functions

Safety, Policy & Regulation 2026-04-20 arXiv cs.LGarXiv stat.ML

Michal Valko, Rémi Munos, Branislav Kveton, Tomáš Kocák

6.2

I 5.0 Im 6.5 P 6.6

Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.LG, stat.ML — indicating the contribution spans multiple research subfields.'}

policy

#76

NEMO: Neural Electro-Mechano-Optic Sensors for Multiplexed Neural Interfaces

Interpretability 2026-04-20 arXiv MechInterp

Andrew Cochran, Harshvardhan Gupta, Vishal Jain, Maysamreza Chamanzar +1

6.2

I 6.3 Im 5.8 P 5.8

We introduce a novel electro-optomechanic neural sensor for realizing ultra-compact neural recording probes that can detect and relay electrophysiology signals from within neural tissue. This technology addresses outstanding challenges faced by existing neural recording technologies, including the resolution trade-off with signal-to-noise-ratio (SNR) due to the high impedances of small electrodes, and lingering stimulation artifacts. The sensor employs a highly miniaturized NEMS (nano-electromechanical systems) electrostatic transducer that modulates a silicon photonic microdisk resonator to convert electrical signals to an optical signal modulation. We have been able to achieve a limit of detection down to 110 microvolts, making the sensor sensitive enough to detect neural signals. This sensitive electro-optomechanic sensor directly detects electrophysiology signals and converts them to optomechanic modulation for effective transmission to outside the brain, which provides the unique potential for massive multiplexing of neural recordings. This design eliminates the need for bulky backend headstages that limit neural recording on awake free-roaming subjects. The ability of the device to record electrophysiological signals has been demonstrated using benchtop characterization and ex-vivo recordings from live neural tissue.

#77

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

Evaluations & Benchmarks 2026-04-20 arXiv cs.CL

Raghvendra Kumar, Devankar Raj, Sriparna Saha

6.2

I 6.3 Im 6.0 P 5.8

India's linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we present the first unified survey of Indian NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models, tools, and systems across text, speech, multimodal, and culturally grounded tasks. We organize resources by linguistic phenomena, domains, and modalities; analyze trends in annotation, evaluation, and model design; and identify persistent challenges such as data sparsity, uneven language coverage, script diversity, and limited cultural and domain generalization. This survey offers a consolidated foundation for equitable, culturally grounded, and scalable NLP research in the Indian linguistic ecosystem.

ragbenchmarkmultimodal

#78

River-LLM: Large Language Model Seamless Exit Based on KV Share

Frontier LLMs 2026-04-20 arXiv cs.CLarXiv Efficiency

Yingtao Shen, An Zou

6.2

I 5.3 Im 6.0 P 6.8

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solutions, such as recomputation or masking, either introduce significant latency overhead or incur severe precision loss, failing to bridge the gap between theoretical layer reduction and practical wall-clock speedup. In this paper, we propose River-LLM, a training-free framework that enables seamless token-level Early Exit. River-LLM introduces a lightweight KV-Shared Exit River that allows the backbone's missing KV cache to be naturally generated and preserved during the exit process, eliminating the need for costly recovery operations. Furthermore, we utilize state transition similarity within decoder blocks to predict cumulative KV errors and guide precise exit decisions. Extensive experiments on mathematical reasoning and code generation tasks demonstrate that River-LLM achieves 1.71 to 2.16 times of practical speedup while maintaining high generation quality.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.CL — indicating the contribution spans multiple research subfields.'}

kv cachellmreasoning

#79

ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation

Frontier LLMs 2026-04-20 arXiv cs.CL

Qingying Niu, Yuhao Wang, Ruiyang Ren, Bohui Fang +1

6.2

I 6.3 Im 6.0 P 5.8

Retrieval-augmented generation (RAG) remains unreliable in long-form settings, where retrieved evidence is noisy or contradictory, making it difficult for RAG pipelines to maintain factual consistency. Existing approaches focus on retrieval expansion or verification during generation, leaving conflict resolution entangled with generation. To address this limitation, we propose ArbGraph, a framework for pre-generation evidence arbitration in long-form RAG that explicitly resolves factual conflicts. ArbGraph decomposes retrieved documents into atomic claims and organizes them into a conflict-aware evidence graph with explicit support and contradiction relations. On top of this graph, we introduce an intensity-driven iterative arbitration mechanism that propagates credibility signals through evidence interactions, enabling the system to suppress unreliable and inconsistent claims before final generation. In this way, ArbGraph separates evidence validation from text generation and provides a coherent evidence foundation for downstream long-form generation. We evaluate ArbGraph on two widely used long-form RAG benchmarks, LongFact and RAGChecker, using multiple large language model backbones. Experimental results show that ArbGraph consistently improves factual recall and information density while reducing hallucinations and sensitivity to retrieval noise. Additional analyses show that these gains are evident under conflicting or ambiguous evidence, highlighting the effectiveness of evidence-level conflict resolution for improving the reliability of long-form RAG. The implementation is publicly available at https://github.com/1212Judy/ArbGraph.

ragbenchmark

#80

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

Reinforcement Learning 2026-04-20 arXiv cs.AIarXiv Evals

Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin +3

6.2

I 6.3 Im 5.4 P 6.4

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Evals, cs.AI — indicating the contribution spans multiple research subfields.'}

ragllmreasoningbenchmark

#81

ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

Multimodal 2026-04-20 arXiv cs.CVarXiv cs.CL

Clayton Fields, Casey Kennington

6.1

I 5.3 Im 6.0 P 6.6

Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.CL, cs.CV — indicating the contribution spans multiple research subfields.'}

transformer

#82

Learning the Riccati solution operator for time-varying LQR via Deep Operator Networks

Evaluations & Benchmarks 2026-04-20 arXiv cs.AIarXiv cs.LG

Jun Chen, Umberto Biccari, Junmin Wang

6.1

I 5.0 Im 6.0 P 6.8

We propose a computational framework for replacing the repeated numerical solution of differential Riccati equations in finite-horizon Linear Quadratic Regulator (LQR) problems by a learned operator surrogate. Instead of solving a nonlinear matrix-valued differential equation for each new system instance, we construct offline an approximation of the associated solution operator mapping time-dependent system parameters to the Riccati trajectory. The resulting model enables fast online evaluation of approximate optimal feedbacks across a wide class of systems, thereby shifting the computational burden from repeated numerical integration to a one-time learning stage. From a theoretical perspective, we establish control-theoretic guarantees for this operator-based approximation. In particular, we derive bounds quantifying how operator approximation errors propagate to feedback performance, trajectory accuracy, and cost suboptimality, and we prove that exponential stability of the closed-loop system is preserved under sufficiently accurate operator approximation. These results provide a framework to assess the reliability of data-driven approximations in optimal control. On the computational side, we design tailored DeepONet architectures for matrix-valued, time-dependent problems and introduce a progressive learning strategy to address scalability with respect to the system dimension. Numerical experiments on both time-invariant and time-varying LQR problems demonstrate that the proposed approach achieves high accuracy and strong generalization across a wide range of system configurations, while delivering substantial computational speedups compared to classical solvers. The method offers an effective and scalable alternative for parametric and real-time optimal control applications.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.LG — indicating the contribution spans multiple research subfields.'}

#83

LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

Agents & Tool Use 2026-04-20 arXiv cs.AIarXiv cs.CL

Samar M. Magdy, Fakhraddin Alwajih, Abdellah El Mekki, Wesam El-Sayed +1

6.1

I 5.3 Im 5.7 P 6.8

Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages (e.g., Arabic), where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form alone.We introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics (Figure 1). We construct a bidirectional parallel corpus of 3,850 sentences (550 per variety) spanning seven Arabic dialects (Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni), derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages. LQM annotated errors data, prompts, and annotation guidelines are publicly available at https://github.com/UBC-NLP/LQM_MT.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.AI, cs.CL — indicating the contribution spans multiple research subfields.'}

ragllm

#84

Physics-Informed Neural Networks for Biological $2\mathrm{D}{+}t$ Reaction-Diffusion Systems

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv Diffusion/GenMedia

William Lavery, Jodie A. Cochrane, Christian Olesen, Dagim S. Tadele +2

6.1

I 5.3 Im 5.7 P 6.8

Physics-informed neural networks (PINNs) provide a powerful framework for learning governing equations of dynamical systems from data. Biologically-informed neural networks (BINNs) are a variant of PINNs that preserve the known differential operator structure (e.g., reaction-diffusion) while learning constitutive terms via trainable neural subnetworks, enforced through soft residual penalties. Existing BINN studies are limited to $1\mathrm{D}{+}t$ reaction-diffusion systems and focus on forward prediction, using the governing partial differential equation as a regulariser rather than an explicit identification target. Here, we extend BINNs to $2\mathrm{D}{+}t$ systems within a PINN framework that combines data preprocessing, BINN-based equation learning, and symbolic regression post-processing for closed-form equation discovery. We demonstrate the framework's real-world applicability by learning the governing equations of lung cancer cell population dynamics from time-lapse microscopy data, recovering $2\mathrm{D}{+}t$ reaction-diffusion models from experimental observations. The proposed framework is readily applicable to other spatio-temporal systems, providing a practical and interpretable tool for fast analytic equation discovery from data.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Diffusion/GenMedia, cs.LG — indicating the contribution spans multiple research subfields.'}

diffusion

#85

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

AI Coding 2026-04-20 arXiv cs.LG

Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin

6.1

I 6.0 Im 5.8 P 5.8

Discrete diffusion language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive approaches, offering the flexibility to generate tokens in arbitrary orders and the potential of parallel decoding. However, existing heuristic sampling strategies remain inefficient: they choose only a small part of tokens to sample at each step, leaving substantial room for improvement. In this work, we study the problem of token sampling order optimization and demonstrate its significant potential for acceleration. Specifically, we find that fully leveraging correct predictions at each step can reduce the number of sampling iterations by an order of magnitude without compromising accuracy. Based on this, we propose Neural Indicator Sampling (NI Sampling), a general sampling order optimization framework that utilize a neural indicator to decide which tokens should be sampled at each step. We further propose a novel trajectory-preserving objective to train the indicator. Experiments on LLaDA and Dream models across multiple benchmarks show that our method achieves up to 14.3$\times$ acceleration over full-step sampling with negligible performance drop, and consistently outperforms confidence threshold sampling in the accuracy-step trade-off. Code is available at https://github.com/imagination-research/NI-Sampling.

diffusionragllmbenchmark

#86

MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

Agents & Tool Use 2026-04-20 arXiv cs.CVarXiv Agents

Jiyao Liu, Jianghan Shen, Sida Song, Tianbin Li +18

6.1

I 6.3 Im 5.1 P 6.4

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, cs.CV — indicating the contribution spans multiple research subfields.'}

ragagentllmbenchmark

#87

SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection

Multimodal 2026-04-20 arXiv cs.CVarXiv Efficiency

Hao Vo, Khoa Vo, Thinh Phan, Ngo Xuan Cuong +4

6.0

I 5.3 Im 5.9 P 6.4

Camera-only 3D object detection has emerged as a cost-effective and scalable alternative to LiDAR for autonomous driving, yet existing methods primarily prioritize overall performance while overlooking the severe long-tail imbalance inherent in real-world datasets. In practice, many rare but safety-critical categories such as children, strollers, or emergency vehicles are heavily underrepresented, leading to biased learning and degraded performance. This challenge is further exacerbated by pronounced inter-class ambiguity (e.g., visually similar subclasses) and substantial intra-class diversity (e.g., objects varying widely in appearance, scale, pose, or context), which together hinder reliable long-tail recognition. In this work, we introduce SemLT3D, a Semantic-Guided Expert Distillation framework designed to enrich the representation space for underrepresented classes through semantic priors. SemLT3D consists of: (1) a language-guided mixture-of-experts module that routes 3D queries to specialized experts according to their semantic affinity, enabling the model to better disentangle confusing classes and specialize on tail distributions; and (2) a semantic projection distillation pipeline that aligns 3D queries with CLIP-informed 2D semantics, producing more coherent and discriminative features across diverse visual manifestations. Although motivated by long-tail imbalance, the semantically structured learning in SemLT3D also improves robustness under broader appearance variations and challenging corner cases, offering a principled step toward more reliable camera-only 3D perception.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Efficiency, cs.CV — indicating the contribution spans multiple research subfields.'}

distillationsafety

#88

Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

Frontier LLMs 2026-04-20 arXiv cs.LG

Sijie Mai, Shiqin Han

6.0

I 6.0 Im 5.5 P 5.8

Multimodal affective computing aims to predict humans' sentiment, emotion, intention, and opinion using language, acoustic, and visual modalities. However, current models often learn spurious correlations that harm generalization under distribution shifts or noisy modalities. To address this, we propose a causal modality-invariant representation (CmIR) learning framework for robust multimodal learning. At its core, we introduce a theoretically grounded disentanglement method that separates each modality into `causal invariant representation' and `environment-specific spurious representation' from a causal inference perspective. CmIR ensures that the learned invariant representations retain stable predictive relationships with labels across different environments while preserving sufficient information from the raw inputs via invariance constraint, mutual information constraint, and reconstruction constraint. Experiments across multiple multimodal benchmarks demonstrate that CmIR achieves state-of-the-art performance. CmIR particularly excels on out-of-distribution data and noisy data, confirming its robustness and generalizability.

benchmarkmultimodal

#89

Looking for Lights from the Darkness: Signals from MeV-scale Solar Axion-like Particles

Interpretability 2026-04-20 arXiv MechInterp

Yu-Cheng Qiu, Yongchao Zhang

6.0

I 6.0 Im 5.5 P 5.8

The axion-like particles $a$ can be produced in the Sun via the process of $p + D \to {}^3{\rm He} +a$, with mass up to 5.5 MeV. The photons in the subsequent decay $a \to γγ$ can deviate significantly from the Sun, or even from roughly the opposite direction of the Sun. The nontrivial angular and spectral distributions of such photons enable us new methods to detect the {\it lights from the darkness}. In this letter, we consider both the space detection and terrestrial experiments at the South Pole. As a result of the two-body decay and the geometric effects, there exists a critical height for the terrestrial experiments, below which there is no photon for some regions of the parameter space. With the sensitivities of $10^{-16}$ ($10^{-17}$) erg cm$^{-2}$ s$^{-1}$ for the MeV-scale photons in future space and terrestrial experiments, the coupling $g_{aγ}$ of $a$ to photons can be probed up to $3\times10^{-12}$ ($1\times10^{-12}$) GeV$^{-1}$, well surpassing the current supernova limits.

#90

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

Efficiency 2026-04-20 arXiv Efficiency

Mao Lin, Xi Wang, Guilherme Cox, Dong Li +1

6.0

I 6.9 Im 5.1 P 5.4

As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware by relying solely on either GPU or CPU for attention computing, and considering yet limited CPU local memory for KV cache storage. We propose HybridGen, an efficient hybrid attention framework for long-context LLM inference. HybridGen enables CPU-GPU collaborative attention on systems with expanded tiered memory (e.g., CXL memory), addressing three key challenges: (1) multi-dimensional attention dependencies, (2) intensifying CPU-GPU load imbalance with longer sequences, and (3) NUMA penalty of tiered memories. HybridGen tackles these by introducing attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping. Experiments with three LLM models with eleven different sizes on three GPU platforms with a CXL-expanded memory show that HybridGen outperforms six state-of-the-art KV cache management methods by 1.41x--3.2x on average while maintaining superior accuracy.

kv cacheragllmgpu

#91

Dissecting AI Trading: Behavioral Finance and Market Bubbles

Agents & Tool Use 2026-04-20 arXiv cs.AIarXiv Agents

Shumiao Ouyang, Pengfei Sui

5.9

I 5.0 Im 5.7 P 6.6

We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open-call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency-weighted extrapolative beliefs. Second, these individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents' reasoning text through a twenty-mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under Agents, cs.AI — indicating the contribution spans multiple research subfields.'}

agentllmreasoning

#92

Revisiting Active Sequential Prediction-Powered Mean Estimation

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv stat.ML

Maria-Eleni Sfyraki, Jun-Kun Wang

5.9

I 5.0 Im 5.7 P 6.6

In this work, we revisit the problem of active sequential prediction-powered mean estimation, where at each round one must decide the query probability of the ground-truth label upon observing the covariates of a sample. Furthermore, if the label is not queried, the prediction from a machine learning model is used instead. Prior work proposed an elegant scheme that determines the query probability by combining an uncertainty-based suggestion with a constant probability that encodes a soft constraint on the query probability. We explored different values of the mixing parameter and observed an intriguing empirical pattern: the smallest confidence width tends to occur when the weight on the constant probability is close to one, thereby reducing the influence of the uncertainty-based component. Motivated by this observation, we develop a non-asymptotic analysis of the estimator and establish a data-dependent bound on its confidence interval. Our analysis further suggests that when a no-regret learning approach is used to determine the query probability and control this bound, the query probability converges to the constraint of the max value of the query probability when it is chosen obliviously to the current covariates. We also conduct simulations that corroborate these theoretical findings.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.LG, stat.ML — indicating the contribution spans multiple research subfields.'}

#93

Random Matrix Theory of Early-Stopped Gradient Flow: A Transient BBP Scenario

Frontier LLMs 2026-04-20 arXiv cs.LGarXiv stat.ML

Florentin Coeurdoux, Grégoire Ferré, Jean-Philippe Bouchaud

5.9

I 5.0 Im 5.7 P 6.6

Empirical studies of trained models often report a transient regime in which signal is detectable in a finite gradient descent time window before overfitting dominates. We provide an analytically tractable random-matrix model that reproduces this phenomenon for gradient flow in a linear teacher--student setting. In this framework, learning occurs when an isolated eigenvalue separates from a noisy bulk, before eventually disappearing in the overfitting regime. The key ingredient is anisotropy in the input covariance, which induces fast and slow directions in the learning dynamics. In a two-block covariance model, we derive the full time-dependent bulk spectrum of the symmetrized weight matrix through a $2\times 2$ Dyson equation, and we obtain an explicit outlier condition for a rank-one teacher via a rank-two determinant formula. This yields a transient Baik-Ben Arous-Péché (BBP) transition: depending on signal strength and covariance anisotropy, the teacher spike may never emerge, emerge and persist, or emerge only during an intermediate time interval before being reabsorbed into the bulk. We map the corresponding phase diagrams and validate the theory against finite-size simulations. Our results provide a minimal solvable mechanism for early stopping as a transient spectral effect driven by anisotropy and noise.

How it was discussed across sources

{'source': 'arXiv cross-listings', 'text': 'Listed under cs.LG, stat.ML — indicating the contribution spans multiple research subfields.'}

#94

Dual Alignment Between Language Model Layers and Human Sentence Processing

Frontier LLMs 2026-04-20 arXiv cs.CL

Tatsuki Kuribayashi, Alex Warstadt, Yohei Oseki, Ethan Gotlieb Wilcox

5.9

I 5.0 Im 6.3 P 5.8

A recent study (Kuribayashi et al., 2025) has shown that human sentence processing behavior, typically measured on syntactically unchallenging constructions, can be effectively modeled using surprisal from early layers of large language models (LLMs). This raises the question of whether such advantages of internal layers extend to more syntactically challenging constructions, where surprisal has been reported to underestimate human cognitive effort. In this paper, we begin by exploring internal layers that better estimate human cognitive effort observed in syntactic ambiguity processing in English. Our experiments show that, in contrast to naturalistic reading, later layers better estimate such a cognitive effort, but still underestimate the human data. This dual alignment sheds light on different modes of sentence processing in humans and LMs: naturalistic reading employs a somewhat weak prediction akin to earlier layers of LMs, while syntactically challenging processing requires more fully-contextualized representations, better modeled by later layers of LMs. Motivated by these findings, we also explore several probability-update measures using shallow and deep layers of LMs, showing a complementary advantage to single-layer's surprisal in reading time modeling.

llmalignment

#95

Physics-Informed Neural Networks: A Didactic Derivation of the Complete Training Cycle

Frontier LLMs 2026-04-20 arXiv cs.LG

Abdeladhim Tahimi

5.8

I 5.3 Im 5.8 P 5.8

This paper is a step-by-step, self-contained guide to the complete training cycle of a Physics-Informed Neural Network (PINN) -- a topic that existing tutorials and guides typically delegate to automatic differentiation libraries without exposing the underlying algebra. Using a first-order initial value problem with a known analytical solution as a running example, we walk through every stage of the process: forward propagation of both the network output and its temporal derivative, evaluation of a composite loss function built from the ODE residual and the initial condition, backpropagation of gradients -- with particular attention to the product rule that arises in hidden layers -- and a gradient descent parameter update. Every calculation is presented with explicit, verifiable numerical values using a 1-3-3-1 multilayer perceptron with two hidden layers and 22 trainable parameters. From these concrete examples, we derive general recursive formulas -- expressed as sensitivity propagation relations -- that extend the gradient computation to networks of arbitrary depth, and we connect these formulas to the automatic differentiation engines used in practice. The trained network is then validated against the exact solution, achieving a relative $L^2$ error of $4.290 \times 10^{-4}$ using only the physics-informed loss, without any data from the true solution. A companion Jupyter/PyTorch notebook reproduces every manual calculation and the full training pipeline, providing mutual validation between hand-derived and machine-computed gradients.

#96

Forecasting Ionospheric Irregularities on GNSS Lines of Sight Using Dynamic Graphs with Ephemeris Conditioning

Frontier LLMs 2026-04-20 arXiv cs.LG

Mert Can Turkmen, Eng Leong Tan, Yee Hui Lee

5.8

I 5.6 Im 5.5 P 5.8

Most data-driven ionospheric forecasting models operate on gridded products, which do not preserve the time-varying sampling structure of satellite-based sensing. We instead model the ionosphere as a dynamic graph over ionospheric pierce points (IPPs), with connectivity that evolves as satellite positions change. Because satellite trajectories are predictable, the graph topology over the forecast horizon can be constructed in advance. We exploit this property to condition forecasts on the future graph structure, which we term ephemeris conditioning. This enables prediction on lines of sight that appear only in the forecast horizon. We evaluate our framework on multi-GNSS (Global Navigation Satellite System) data from a co-located receiver pair in Singapore spanning January 2023 through April 2025. The task is to forecast Rate of TEC Index (ROTI)-defined irregularities at 5-minute cadence up to 2 hours ahead as binary probabilistic classification per node. The resulting model, IonoDGNN, achieves a Brier Skill Score (BSS) of 0.49 and a precision-recall area under the curve (PR-AUC) of 0.75, improving over persistence by 35\% in BSS and 52\% in PR-AUC, with larger gains at longer lead times. Ablations confirm that graph structure and ephemeris conditioning each contribute meaningfully, with conditioning proving essential for satellites that rise during the forecast horizon (receiver operating characteristic AUC: 0.95 vs.\ 0.52 without). Under simulated coverage dropout, the model retains predictive skill on affected nodes through spatial message passing from observed neighbors. These results suggest that dynamic graph forecasting on evolving lines of sight is a viable alternative to grid-based representations for ionospheric irregularity forecasting. The model and evaluation code will be released upon publication.

rag

#97

Parkinson's Disease Detection via Self-Supervised Dual-Channel Cross-Attention on Bilateral Wrist-Worn IMU Signals

Frontier LLMs 2026-04-20 arXiv cs.LG

Meheru Zannat

5.8

I 5.6 Im 5.5 P 5.8

Parkinson's disease (PD) is a chronic neurodegenerative disease. It shows multiple motor symptoms such as tremor, bradykinesia, postural instability, freezing of gait (FoG). PD is currently diagnosed clinically through physical exam by health-care professionals, which can be time consuming and highly subjective. Wearable IMU sensors has become a promising gateway for passive monitoring of PD patients. We propose a self-supervised cross-attention encoder that processes bilateral wrist-worn IMU signals from a public dataset called PADS, consisting of three groups, PD (Parkinson Disease), HC (Healthy Control) and DD (Differential Diagnosis) of a total of 469 subjects. We have achieved a mean accuracy of 93.12% for HC vs. PD classification and 87.04% for PD vs. DD classification. The results emphasize the clinical challenge of distinguishing Parkinson's from other neurodegenerative diseases. Self-supervised representation learning using contrastive infoNCE loss gained an accuracy of 93.56% for HC vs. PD and 92.50% for PD vs. DD using only 20% of labelled data. This demonstrates the effectiveness of our method in transfer learning for clinical use with minimal labels. The real-time applicability was tested by deploying the optimized model with a mean inference time of 48.32 ms per window on a Raspberry Pi CPU.

#98

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

Multimodal 2026-04-20 arXiv cs.CV

Rui Qian, Chuanhang Deng, Qiang Huang, Jian Xiong +5

5.7

I 6.3 Im 5.6 P 5.0

Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{ }$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model's ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. Instead of compressing all semantic reasoning and spatial localization into a single embedding, AnchorSeg constructs an ordered sequence of query banks: latent reasoning tokens that capture intermediate semantic states, and a segmentation anchor token that provides explicit spatial grounding. We model spatial conditioning as a factorized distribution over image tokens, where the anchor query determines localization signals while contextual queries provide semantic modulation. To bridge token-level predictions and pixel-level supervision, we propose Token--Mask Cycle Consistency (TMCC), a bidirectional training objective that enforces alignment across resolutions. By explicitly decoupling spatial grounding from semantic reasoning through structured language grounded query banks, AnchorSeg achieves state-of-the-art results on ReasonSeg test set (67.7\% gIoU and 68.1\% cIoU). All code and models are publicly available at https://github.com/rui-qian/AnchorSeg.

reasoningalignment

#99

EAST: Early Action Prediction Sampling Strategy with Token Masking

Multimodal 2026-04-20 arXiv cs.CV

Iva Sović, Ivan Martinović, Marin Oršić

5.7

I 6.9 Im 4.8 P 5.0

Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.

video

#100

LBFTI: Layer-Based Facial Template Inversion for Identity-Preserving Fine-Grained Face Reconstruction

Multimodal 2026-04-20 arXiv cs.CV

Zixuan Shen, Zhihua Xia, Kaikai Gan, Peipeng Yu

5.7

I 6.9 Im 4.8 P 5.0

In face recognition systems, facial templates are widely adopted for identity authentication due to their compliance with the data minimization principle. However, facial template inversion technologies have posed a severe privacy leakage risk by enabling face reconstruction from templates. This paper proposes a Layer-Based Facial Template Inversion (LBFTI) method to reconstruct identity-preserving fine-grained face images. Our scheme decomposes face images into three layers: foreground layers (including eyebrows, eyes, nose, and mouth), midground layers (skin), and background layers (other parts). LBFTI leverages dedicated generators to produce these layers, adopting a rigorous three-stage training strategy: (1) independent refined generation of foreground and midground layers, (2) fusion of foreground and midground layers with template secondary injection to produce complete panoramic face images with background layers, and (3) joint fine-tuning of all modules to optimize inter-layer coordination and identity consistency. Experiments demonstrate that our LBFTI not only outperforms state-of-the-art methods in machine authentication performance, with a 25.3% improvement in TAR, but also achieves better similarity in human perception, as validated by both quantitative metrics and a questionnaire survey.

rag

#101

A Generalized Synthetic Control Method for Baseline Estimation in Demand Response Services

Evaluations & Benchmarks 2026-04-20 arXiv cs.AI

Jonas Sievers, Mardavij Roozbehani

5.7

I 6.0 Im 5.1 P 5.4

Baseline estimation is critical to Demand Response (DR) settlement in electricity markets, yet existing machine learning methods remain limited in predictive performance, while methodologies from causal inference and counterfactual prediction are still underutilized in this domain. We introduce a Generalized Synthetic Control Method that builds on the classical Synthetic Control Method (SCM) from econometrics. While SCM provides a powerful framework for counterfactual estimation, classical SCM remains a static estimator: it fits the treated unit as a combination of contemporaneous donor units and therefore ignores predictable temporal structure in the residual error. We develop a generalized SCM framework that transforms baseline estimation into a dynamic counterfactual prediction problem by augmenting the donor representation with exogenous features, lagged treated load, and selected lagged donor signals. This enriched representation allows the estimator to capture autoregressive dependence, delayed donor-response patterns, and error-correction effects beyond the scope of standard SCM. The framework further accommodates nonlinear predictors when linear weighting is inadequate, with the greatest benefit arising in limited-data settings. Experiments on the Ausgrid smart-meter dataset show consistent improvements over classical SCM and strong benchmark methods, with the dominant performance gains driven by dynamic augmentation.

benchmark

#102

Duality for the Adversarial Total Variation

Frontier LLMs 2026-04-20 arXiv cs.LG

Leon Bungert, Lucas Schmitt

5.7

I 5.0 Im 5.8 P 5.8

Adversarial training of binary classifiers can be reformulated as regularized risk minimization involving a nonlocal total variation. Building on this perspective, we establish a characterization of the subdifferential of this total variation using duality techniques. To achieve this, we derive a dual representation of the nonlocal total variation and a related integration of parts formula, involving a nonlocal gradient and divergence. We provide such duality statements both in the space of continuous functions vanishing at infinity on proper metric spaces and for the space of essentially bounded functions on Euclidean domains. Furthermore, under some additional conditions we provide characterizations of the subdifferential in these settings.

#103

Scalable Physics-Informed Neural Differential Equations and Data-Driven Algorithms for HVAC Systems

Frontier LLMs 2026-04-20 arXiv cs.LG

Hanfeng Zhai, Hongtao Qiao, Hassan Mansour, Christopher Laughman

5.7

I 5.0 Im 5.8 P 5.8

We present a scalable, data-driven simulation framework for large-scale heating, ventilation, and air conditioning (HVAC) systems that couples physics-informed neural ordinary differential equations (PINODEs) with differential-algebraic equation (DAE) solvers. At the component level, we learn heat-exchanger dynamics using an implicit PINODE formulation that predicts conserved quantities (refrigerant mass $M_r$ and internal energy $E_\text{hx}$) as outputs, enabling physics-informed training via automatic differentiation of mass/energy balances. Stable long-horizon prediction is achieved through gradient-stabilized latent evolution with gated architectures and layer normalization. At the system level, we integrate learned components with DAE solvers (IDA and DASSL) that explicitly enforce junction constraints (pressure equilibrium and mass-flow consistency), and we use Bayesian optimization to tune solver parameters for accuracy--efficiency trade-offs. To reduce residual system-level bias, we introduce a lightweight corrector network trained on short trajectory segments. Across dual-compressor and scaled network studies, the proposed approach attains multi-fold speedups over high-fidelity simulation while keeping errors low (MAPE below a few percent) and scales to systems with up to 32 compressor--condenser pairs.

#104

Balance-Guided Sparse Identification of Multiscale Nonlinear PDEs with Small-coefficient Terms

Efficiency 2026-04-20 arXiv cs.LG

Zhenhua Dang, Lei Zhang, Long Wang, Guowei He

5.7

I 5.3 Im 5.5 P 5.8

Data-driven discovery of governing equations has advanced significantly in recent years; however, existing methods often struggle in multiscale systems where dynamically significant terms may have small coefficients. Therefore, we propose Balance-Guided SINDy (BG-SINDy) inspired by the principle of dominant balance, which reformulates $\ell_0$-constrained sparse regression as a term-level $\ell_{2,0}$-regularized problem and solves it using a progressive pruning strategy. Terms are ranked according to their relative contributions to the governing equation balance rather than their absolute coefficient magnitudes. Based on this criterion, BG-SINDy alternates between least-squares regression and elimination of negligible terms, thereby preserving dynamically significant terms even when their coefficients are small. Numerical experiments on the Korteweg--de Vries equation with a small dispersion coefficient, a modified Burgers equation with vanishing hyperviscosity, a modified Kuramoto--Sivashinsky equation with multiple small-coefficient terms, and a two-dimensional reaction--diffusion system demonstrate the validity of BG-SINDy in discovering small-coefficient terms. The proposed method thus provides an efficient approach for discovering governing equations that contain small-coefficient terms.

diffusion

#105

Bridge-Centered Metapath Classification Using R-GCN-VGAE for Disaster-Resilient Maintenance Decisions

Frontier LLMs 2026-04-20 arXiv cs.LG

Takato Yasuno

5.7

I 5.3 Im 5.5 P 5.8

Daily infrastructure management in preparation for disasters is critical for urban resilience. When bridges remain resilient against disaster-induced external forces, access to hospitals, shops, and residences via metapaths can be sustained, maintaining essential urban functions. However, prioritizing bridge maintenance under limited budgets requires quantifying the multi-dimensional roles that bridges play in disaster scenarios -- a challenge that existing single-indicator approaches fail to address. We focus on metapaths from national highways through bridges to buildings (hospitals, shops, residences), constructing a heterogeneous graph with road, bridge, and building layers. A Relation-centric Graph Convolutional Network Variational Autoencoder (R-GCN-VGAE) learns metapath-based feature representations, enabling classification of bridges into disaster-preparedness categories: Supply Chain (commercial logistics), Medical Access (emergency healthcare), and Residential Protection (preventing isolation). Using OSMnx and open data, we validate our methodology on three diverse cities in Ibaraki Prefecture, Japan: Mito (697 bridges), Chikusei (258 bridges), and Moriya (148 bridges), totaling 1,103 bridges. The heterogeneous graph construction from open data enables redefining bridge roles for disaster scenarios, supporting maintenance budget decision-making. We contributed that (1) Open-data methodology for constructing urban heterogeneous graphs. (2) Redefinition of bridge roles for disaster scenarios via metapath-based classification. (3) Establishment of maintenance budget decision support methodology. (4) k-NN tuning strategy validated across diverse city scales. (5) Empirical demonstration of UMAP superiority over t-SNE/PCA for multi-role bridge visualization.

#106

Linear-wave bound on electromagnetic energy equipartition at sub-electron scales in non-relativistic plasmas

Interpretability 2026-04-20 arXiv MechInterp

Vivek Shrivastav, Mani K Chettri, Britan Singh, Hemam D. Singh +1

5.7

I 5.3 Im 5.5 P 5.8

Recent Magnetospheric Multiscale (MMS) observations report approximate equality between electric and magnetic field energy spectral densities, $\varepsilon_{0} P[δE]/2 \approx P[δB]/(2μ_{0})$, at sub-electron scales in reconnection-driven magnetotail turbulence, interpreted as relaxation toward thermodynamic equilibrium. We derive the electric-to-magnetic energy ratio from the linear polarization of kinetic Alfvén waves and whistler-mode waves in the two-fluid framework and show that it saturates at $\mathcal{R}_{\infty}=(V_{A}/c)^{2}(m_{i}/m_{e})(β_{e}/2)$ deep in the sub-electron regime. Setting $\mathcal{R}_{\infty}=1$ yields the universal threshold $V_{A}/c \gtrsim \sqrt{2/[(m_{i}/m_{e})β_{e}]}$, which no non-relativistic space plasma satisfies. For typical magnetotail parameters, $\mathcal{R}_{\infty}\approx 2\times 10^{-3}$, approximately 500 times below the observed value, a discrepancy rooted in the non-relativistic ordering $(V_{A}/c)^{2}\ll 1$. Noise-floor estimates show that Search Coil Magnetometer and Electric Double Probe sensitivity convergence produces a spurious apparent equipartition throughout this regime. The observed equality likely reflects nonlinear dynamics, incoherent superposition of electromagnetic and electrostatic fluctuations, or instrumental noise contamination.

#107

High-power attosecond X-ray free-electron lasers: physics and design strategy

Interpretability 2026-04-20 arXiv MechInterp

Chenzhi Xu, Jiawei Yan, Ye Chen, Winfried Decking +4

5.7

I 5.3 Im 5.5 P 5.8

Attosecond pulses from X-ray free-electron laser (XFEL) have opened new opportunities for probing ultrafast electronic dynamics on the Angstrom--attosecond spatiotemporal scale. Most attosecond XFEL concepts rely on generating an ultrashort high-current spike through either external laser modulation or accelerator-based beam manipulation. Despite their different implementations, these approaches share the same essential physics, namely that the XFEL amplification is confined to a short effective lasing window within the electron beam. However, existing studies are often scheme-specific and do not yet provide a unified quantitative picture of how fundamental electron-beam properties constrain high-power attosecond performance. In this work, we investigate the general physics and scheme-independent requirements for generating high-power attosecond X-ray pulses from a short current spike. From the perspective of post-saturation superradiant evolution, we show that the effective lasing length of the electron beam governs both the attainable peak power and the pulse duration. We further examine the distinct roles of slice energy spread, slice emittance, energy chirp, undulator tapering, and transverse beam tilt. Our results reveal the trade-off between peak power, pulse shortening, and single-spike probability, and provide facility-independent guidelines for optimizing electron-beam phase-space manipulation toward terawatt-class attosecond XFEL operation.

#108

Fourth-order galaxy-galaxy-lensing: Theoretical framework and direct estimation

Interpretability 2026-04-20 arXiv MechInterp

Jonathan Oel, Lucas Porth, Peter Schneider, Elena Silvestre-Rosello

5.7

I 5.3 Im 5.5 P 5.8

Traditional galaxy-galaxy lensing is a well-established method of probing the statistical properties of the Universe's matter and galaxy distribution. However, this measure does not carry all the statistical information, provided the matter and galaxy distribution contain non-Gaussian features. In order to study these non-Gaussianities, it is necessary to consider higher-order statistical measures. The aim of this work is to extend the analytical basis describing the statistical correlations between galaxies and shear to the fourth order, with special emphasis on the associated aperture statistics. In order to include fourth-order statistics in future analysis of the relation between mass and galaxies, we further investigate whether we can expect to detect these statistics from observations of stage IV surveys. We define the four-point correlation function (4PCF) between the shear and the positions of triplets of foreground galaxies and derive its relation to the respective trispectrum. We convert the 4PCF to aperture statistics and derive the analytical form of the respective filter function, which we then implement in a numerical integration pipeline. Furthermore, we develop a direct estimator that allows us to measure galaxy-mass aperture moments of arbitrary order on pixelized data using a Fast-Fourier-Transform (FFT) algorithm. We show that the corresponding aperture measure $\langle\mathcal{N}^3 M_\mathrm{ap}\rangle$ can be calculated with sub-percent accuracy on relevant aperture scales, $θ$, by means of numerical integration. Furthermore, we apply the FFT-based direct estimator to a mock catalog with a realistic stage IV survey setup on a sky area of $2000~\mathrm{deg}^2$, and detect the connected part of the aperture statistics $\langle\mathcal{N}^3 M_\mathrm{ap}\rangle(θ)$ with a signal-to-noise ratio of roughly nine on small aperture scales.

#109

[AINews] Moonshot Kimi K2.6: the world's leading Open Model refreshes to catch up to Opus 4.6 (ahead of DeepSeek v4?)

Frontier LLMs 2026-04-21 Latent Space Newsletter

Latent.Space

5.7

I 6.0 Im 5.1 P 5.4

Yay Kimi!!!

#110

TypeScript Repository Indexing for Code Agent Retrieval

Agents & Tool Use 2026-04-20 arXiv Agents

Junsong Pu, Yichen Li, Zhuangbin Chen

5.6

I 5.6 Im 5.4 P 5.4

Graph-based code indexing can improve context retrieval for LLM-based code agents by preserving call chains and dependency relationships that keyword search and similarity retrieval often miss. ABCoder is an open-source framework that parses codebases into a function-level code index called UniAST, but its existing parsers combine lightweight AST parsers for syntactic analysis with language servers for semantic resolution, but because LSP-based resolution requires a JSON-RPC call for each symbol lookup, these per-symbol calls become a bottleneck on large TypeScript repositories. We present abcoder-ts-parser, a TypeScript parser built on the TypeScript Compiler API that works directly with the compiler's AST, semantic information, and module resolution logic. We evaluate the parser on three open-source TypeScript projects with up to 1.2 million lines of code and find that it produces reliable indexes significantly more efficiently than the existing architecture. For a live demonstration, watch: https://youtu.be/ryssr7ouvdE

agentllm

#111

Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation

Multimodal 2026-04-20 arXiv cs.CV

Chao Yuan, Yujian Zhao, Haoxuan Xu, Guanglin Niu

5.6

I 6.0 Im 5.6 P 5.0

In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual variants; Textual Feature Robustness Enhancement: A training-free latent space compensation mechanism suppresses noise interference through multi-view feature mean-pooling and residual connections, effectively capturing "Semantic Echoes"; Visual Semantic Compensation: VLM generates multi-perspective image descriptions, which are further enhanced through shared text reformulation to address visual semantic gaps. Experiments demonstrate that our method can improve the accuracy of the original model well without training and performs SOTA on three text-to-image person retrieval datasets.

vlmllmalignment

#112

Symbolic Synthesis for LTLf+ Obligations

Safety, Policy & Regulation 2026-04-20 arXiv cs.AI

Giuseppe De Giacomo, Christian Hagemeier, Daniel Hausmann, Nir Piterman

5.6

I 5.0 Im 5.9 P 5.4

We study synthesis for obligation properties expressed in LTLfp, the extension of LTLf to infinite traces. Obligation properties are positive Boolean combinations of safety and guarantee (co-safety) properties and form the second level of the temporal hierarchy of Manna and Pnueli. Although obligation properties are expressed over infinite traces, they retain most of the simplicity of LTLf. In particular, we show that they admit a translation into symbolically represented deterministic weak automata (DWA) obtained directly from the symbolic deterministic finite automata (DFA) for the underlying LTLf properties on trace prefixes. DWA inherit many of the attractive algorithmic features of DFA, including Boolean closure and polynomial-time minimization. Moreover, we show that synthesis for LTLfp obligation properties is theoretically highly efficient - solvable in linear time once the DWA is constructed. We investigate several symbolic algorithms for solving DWA games that arise in the synthesis of obligation properties and evaluate their effectiveness experimentally. Overall, the results indicate that synthesis for LTLfp obligation properties can be performed with virtually the same effectiveness as LTLf synthesis.

safety

#113

Wasserstein Distributionally Robust Risk-Sensitive Estimation via Conditional Value-at-Risk

Frontier LLMs 2026-04-20 arXiv cs.LG

Feras Al Taha, Eilyan Bitar

5.6

I 5.0 Im 5.5 P 5.8

We propose a distributionally robust approach to risk-sensitive estimation of an unknown signal x from an observed signal y. The unknown signal and observation are modeled as random vectors whose joint probability distribution is unknown, but assumed to belong to a given type-2 Wasserstein ball of distributions, termed the ambiguity set. The performance of an estimator is measured according to the conditional value-at-risk (CVaR) of the squared estimation error. Within this framework, we study the problem of computing affine estimators that minimize the worst-case CVaR over all distributions in the given ambiguity set. As our main result, we show that, when the nominal distribution at the center of the Wasserstein ball is finitely supported, such estimators can be exactly computed by solving a tractable semidefinite program. We evaluate the proposed estimators on a wholesale electricity price forecasting task using real market data and show that they deliver lower out-of-sample CVaR of squared error compared to existing methods.

#114

On the curlometer measurement of field-aligned and perpendicular currents in low Earth orbit: Swarm observations and whole geospace simulations

Interpretability 2026-04-20 arXiv MechInterp

R Gajewski, RT Desai, B Hnat, D Lin +12

5.6

I 5.0 Im 5.5 P 5.8

Measuring field-aligned currents (FACs) using magnetic field observations provides a powerful means to probe the multi-scale interactions between the magnetosphere, ionosphere and thermosphere. In this study, we apply the curlometer technique to Swarm spacecraft observations and to simulations of the coupled magnetosphere-ionosphere system. We begin by correlating current density curlometer estimates derived from Swarm tetrahedra with varying spatial scales and barycentre locations. This confirms an apparent departure from stationarity for FACs at spatio-temporal scales below 100 km where measurements appear highly uncorrelated. We then analyse simulated magnetic perturbations, where true four-point measurements are available. This shows how, even at meso-scales of hundreds of kilometres, time-shifted FAC estimates can diverge significantly from this ground truth. In both observational and simulated data we find poor tetrahedral configurations can produce spurious perpendicular currents due to numerical instability in the inversion process. This can be mitigated using appropriate quality metrics and high-quality FAC reconstructions still achieved with a tetrahedral face well-aligned to the local magnetic field. These results highlight the dynamic nature of FACs at large as well as small scales, and underscore the substantial advantages of true four-point observations for their accurate analysis.

#115

Coherent terahertz field tomographic imaging in warm Rydberg vapors

Interpretability 2026-04-20 arXiv MechInterp

Jan Nowosielski, Marcin Jastrzębski, Wojciech Wasilewski, Mateusz Mazelanik +1

5.6

I 5.0 Im 5.5 P 5.8

Rydberg atom-based sensors have emerged as highly sensitive tools for terahertz (THz) metrology, yet most current imaging techniques discard crucial phase information. In this Letter, we present a coherent THz-to-optical conversion scheme in warm Rb vapor that enables complex-amplitude field imaging. By manipulating the phase-matching conditions via an adjustable interference pattern of optical probe beams, we demonstrate the ability to perform tomographic reconstruction of the THz field distribution. We experimentally validate the spatial resolution and phase-sensitivity of the system by resolving sub-centimeter features and identifying incident angles of arrival. Our results establish a robust framework for phase-resolved THz imaging and holography using atomic vapors at room temperature.

#116

Optomechanical Detection of Individual Gas Collisions

Interpretability 2026-04-20 arXiv MechInterp

Yu-Han Tseng, Clarke A. Hardy, T. W. Penny, Cecily Lowe +3

5.6

I 5.0 Im 5.5 P 5.8

We experimentally demonstrate the detection of momentum transfers from individual collisions of Kr, Xe, and SF$_6$ with an optically levitated nanoparticle, finding good agreement with theoretical expectations. The observed event rates accurately measure the gas partial pressures, while the spectral shape provides a sensitive probe of the surface properties of the nanoparticle, including its temperature. The reconstruction of impulse signals as small as 200 keV/$c$ further establishes that levitated optomechanical sensors can reach the sensitivity required for precision measurements of fundamental particle interactions, and demonstrates a proof-of-principle for a primary pressure sensor based on the detection of individual gas particle collisions.

#117

AtomTwin.jl: a physics-native digital twin framework for neutral-atom quantum processors

Evaluations & Benchmarks 2026-04-20 arXiv Evals

Shannon Whitlock

5.6

I 6.6 Im 4.8 P 5.0

AtomTwin.jl is an open-source Julia package for developing and simulating quantum protocols, hardware configurations and building digital twins for neutral-atom quantum processors and related atomic quantum devices. AtomTwin operates between mathematical models and physical devices; modeling atoms, optical tweezers, laser fields, atomic motion, interactions, and noise processes natively from physical geometry and parameters, without requiring users to define Hamiltonians manually. The package provides hardware-level instruction sequences, high-performance solvers for coupled quantum and classical dynamics, and a ready-to-use model for ytterbium-171 atoms in an extensible framework designed to accommodate a greater variety of atomic species and hardware components in the future. This paper describes the software architecture, performance benchmarks against existing toolboxes, and a demonstrated end-to-end application: preparation of a logical Bell state in the $[[4,2,2]]$ error-detecting code with four $^{171}$Yb atoms in moveable tweezers.

benchmark

#118

Reading today's open-closed performance gap

Frontier LLMs 2026-04-20 Interconnects

Nathan Lambert

5.6

I 5.0 Im 5.5 P 5.8

The complex factors that determine the single evaluation number so many focus on. Plus, how this changes in the future.

#119

Understanding the Prompt Sensitivity

Frontier LLMs 2026-04-20 arXiv cs.CL

Yang Liu, Chenhui Chu

5.6

I 5.0 Im 5.5 P 5.8

Prompt sensitivity, which refers to how strongly the output of a large language model (LLM) depends on the exact wording of its input prompt, raises concerns among users about the LLM's stability and reliability. In this work, we consider LLMs as multivariate functions and perform a first-order Taylor expansion, thereby analyzing the relationship between meaning-preserving prompts, their gradients, and the log probabilities of the model's next token. We derive an upper bound on the difference between log probabilities using the Cauchy-Schwarz inequality. We show that LLMs do not internally cluster similar inputs like smaller neural networks do, but instead disperse them. This dispersing behavior leads to an excessively high upper bound on the difference of log probabilities between two meaning-preserving prompts, making it difficult to effectively reduce to 0. In our analysis, we also show which types of meaning-preserving prompt variants are more likely to introduce prompt sensitivity risks in LLMs. In addition, we demonstrate that the upper bound is strongly correlated with an existing prompt sensitivity metric, PromptSensiScore. Moreover, by analyzing the logit variance, we find that prompt templates typically exert a greater influence on logits than the questions themselves. Overall, our results provide a general interpretation for why current LLMs can be highly sensitive to prompts with the same meaning, offering crucial evidence for understanding the prompt sensitivity of LLMs. Code for experiments is available at https://github.com/ku-nlp/Understanding_the_Prompt_Sensitivity.

llm

#120

Space Force cancels contract with RTX for next-gen ground segment for GPS satellites

Government & Defense 2026-04-20 DefenseScoop

measley

5.5

I 5.6 Im 5.1 P 5.4

As of January 2026, the OCX program had reached a staggering cost of around $6.27 billion, according to the Space Force. The post Space Force cancels contract with RTX for next-gen ground segment for GPS satellites appeared first on DefenseScoop .

defense

#121

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

Multimodal 2026-04-20 arXiv cs.CV

Savya Khosla, Sethuraman T, Aryan Chadha, Alex Schwing +1

5.5

I 5.9 Im 5.3 P 5.0

Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense cross-modal understanding while reducing the token count by orders of magnitude. Specifically, T-REN delivers +5.9 mIoU on ADE20K open-vocabulary segmentation, +18.4% recall on COCO object-level text-image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch-based vision-language backbone. The code and model are available at https://github.com/savya08/T-REN.

alignmentvideo

#122

DSA-CycleGAN: A Domain Shift Aware CycleGAN for Robust Multi-Stain Glomeruli Segmentation

Multimodal 2026-04-20 arXiv cs.CV

Zeeshan Nisar, Friedrich Feuerhake, Thomas Lampert

5.5

I 6.3 Im 4.8 P 5.0

A key challenge in segmentation in digital histopathology is inter- and intra-stain variations as it reduces model performance. Labelling each stain is expensive and time-consuming so methods using stain transfer via CycleGAN, have been developed for training multi-stain segmentation models using labels from a single stain. Nevertheless, CycleGAN tends to introduce noise during translation because of the one-to-many nature of some stain pairs, which conflicts with its cycle consistency loss. To address this, we propose the Domain Shift Aware CycleGAN, which reduces the presence of such noise. Furthermore, we evaluate several advances from the field of machine learning aimed at resolving similar problems and compare their effectiveness against DSA-CycleGAN in the context of multi-stain glomeruli segmentation. Experiments demonstrate that DSA-CycleGAN not only improves segmentation performance in glomeruli segmentation but also outperforms other methods in reducing noise. This is particularly evident when translating between biologically distinct stains. The code is publicly available at https://github.com/zeeshannisar/DSA-CycleGAN.

#123

QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

Agents & Tool Use 2026-04-20 arXiv Agents

Terence Lim, Kumar Muthuraman, Michael Sury

5.4

I 5.3 Im 5.1 P 5.4

We introduce a multi-agent framework intended to emulate parts of a quantitative research team and support equity factor research on large financial panel datasets. QRAFTI integrates a research toolkit for panel data with MCP servers that expose data access, factor construction, and custom coding operations as callable tools. It can help replicate established factors, formulate and test new signals, and generate standardized research reports accompanied by narrative analysis and computational traces. On multi-step empirical tasks, using chained tool calls and reflection-based planning may offer better performance and explainability than dynamic code generation alone.

mcpagent

#124

MQ-9B SeaGuardian® offers Navy-Marine Corps a decisive edge in today’s battlespace

Government & Defense 2026-04-20 Breaking Defense

Jana Vislocky

5.4

I 5.3 Im 5.1 P 5.4

[Sponsored] The MQ-9B is uniquely suited to meet the challenges of modern conflict.

#125

Caudle: Navy plans to announce F/A-XX prime contractor in August

Government & Defense 2026-04-20 DefenseScoop

measley

5.4

I 5.3 Im 5.1 P 5.4

The F/A-XX program has been in limbo for about a year due to concerns over defense industrial base capacity. The post Caudle: Navy plans to announce F/A-XX prime contractor in August appeared first on DefenseScoop .

defense

#126

The implicated scientist: on the role of AI researchers in the development of weapons systems

Safety, Policy & Regulation 2026-04-20 arXiv cs.AI

Alexandra Volokhova, Alex Hernandez-Garcia

5.3

I 5.0 Im 5.1 P 5.4

Artificial intelligence (AI) technologies are increasingly used in modern weapons systems. Notably, these systems have recently been involved in mass killings and destruction at scale. Furthermore, there is currently a strong interest and competition among powerful players to accelerate the proliferation of weapons with automated or AI-based components, a phenomenon known as AI arms race. This competition poses a risk of causing even more deaths and devastation in the future, as well as increased power and wealth inequality. In this work, we aim to shed light on the role of AI researchers as implicated subjects in the harms caused by weapons enabled by AI technologies. We investigate and discuss the specifics of this implication and explore ways to transfigure this position of implication into one of differentiated, long-distance solidarity with the victims of technologically fortified injustices.

#127

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

Multimodal 2026-04-21 Hugging Face Blog

5.3

I 5.0 Im 5.1 P 5.4

agent

#128

Can we AI our way to a more sustainable world?

AI for Science 2026-04-20 Microsoft Research Blog

Doug Burger, Amy Luers, Ishai Menache

5.3

I 5.0 Im 5.1 P 5.4

Doug Burger, sustainability expert Amy Luers, and optimization researcher Ishai Menache examine the global emissions implications of datacenter operations, efficiency gains, and AI's potential across electrification, materials, and food systems. The post Can we AI our way to a more sustainable world? appeared first on Microsoft Research .

#129

A new F/A-XX timeline and many, many MUSVs at Sea Air Space Day 1

Government & Defense 2026-04-21 Breaking Defense

Breaking Defense Video

5.3

I 5.0 Im 5.1 P 5.4

Aaron Mehta and Diana Stancy share highlights from the first day of the Navy League’s annual conference.

#130

Space Force kills OCX GPS ground control system, citing ‘insurmountable’ challenges

Government & Defense 2026-04-20 Breaking Defense

Michael Marrow

5.3

I 5.0 Im 5.1 P 5.4

The Pentagon will instead continue with a current ground control system managed by Lockheed Martin.

pentagon

#131

The sights of Sea Air Space Day 1

Government & Defense 2026-04-20 Breaking Defense

Breaking Defense Staff

5.3

I 5.0 Im 5.1 P 5.4

A selection of photos from the show floor on the first day of the Navy League’s biggest conference.

#132

Army to rely on FMS, reinvesting to ensure helicopter manufacturers stay ‘healthy’

Government & Defense 2026-04-20 Breaking Defense

Carley Welch

5.3

I 5.0 Im 5.1 P 5.4

After the recent FY27 budget request drastically cut helicopter procurement funding, senior leaders and industry ensure foreign sales and reinvestment programs will keep the helo manufacturers afloat.

#133

With eyes on future NASA moon base, Space Force launches cislunar acquisition task force

Government & Defense 2026-04-20 Breaking Defense

Theresa Hitchens

5.3

I 5.0 Im 5.1 P 5.4

Meanwhile, the Air Force Research Laboratory is gearing up to launch its experimental cislunar monitoring satellite, called Oracle Prime, next year.

#134

Saildrone unveils new medium unmanned surface vessel for anti-sub warfare, ISR

Government & Defense 2026-04-20 Breaking Defense

Diana Stancy

5.3

I 5.0 Im 5.1 P 5.4

The MUSV comes in two variants: the Spectre Silent Endurance, and the Spectre Stealth Strike.

#135

F/A-XX fighter downselect coming in August: CNO

Government & Defense 2026-04-20 Breaking Defense

Valerie Insinna

5.3

I 5.0 Im 5.1 P 5.4

The competition for the Navy’s sixth-generation fighter contract has narrowed to Northrop Grumman and Boeing.

#136

Iran’s targeting of our THAAD/TPY-2 radars is a ‘big freaking deal’

Government & Defense 2026-04-20 Breaking Defense

Casey Laughman

5.3

I 5.0 Im 5.1 P 5.4

“It’s not about the cost per round. It’s about achieving operational success,” says Tom Karako from CSIS.

#137

Marine Corps prototyping AI tools for aviation supply, predictive maintenance

Government & Defense 2026-04-20 DefenseScoop

dlawrence

5.3

I 5.0 Im 5.1 P 5.4

“Let’s change it before it needs to be in the air, declare an emergency, land in some place we don’t want it to land, etc.,” Lt. Gen. William Swan, deputy commandant for aviation, said. The post Marine Corps prototyping AI tools for aviation supply, predictive maintenance appeared first on DefenseScoop .

defense

#138

Navy considers new Warfighting Development Center for robotic and autonomous systems

Government & Defense 2026-04-20 DefenseScoop

Brandi Vincent

5.3

I 5.0 Im 5.1 P 5.4

Chief of Naval Operations Adm. Daryl Caudle supplied modernization updates at the Navy League’s Sea Air Space convention. The post Navy considers new Warfighting Development Center for robotic and autonomous systems appeared first on DefenseScoop .

defense

#139

How cyber threats are reshaping the defense battlefield in 2026

Government & Defense 2026-04-20 DefenseScoop

swhitehorne

5.3

I 5.0 Im 5.1 P 5.4

A new report examines how AI, identity-based attacks and cloud exploitation are transforming the cyber threat landscape for defense organizations. The post How cyber threats are reshaping the defense battlefield in 2026 appeared first on DefenseScoop .

defense

#140

Sub that sank Iranian warship reflects Navy’s drive to adapt, CNO says

Safety, Policy & Regulation 2026-04-21 Defense One

Bradley Peniston

5.3

I 5.0 Im 5.1 P 5.4

“That was a glimpse of the future force,” Adm. Caudle said at Sea-Air-Space.

#141

A-10s escape retirement once again amid continued use in Iran war

Safety, Policy & Regulation 2026-04-21 Defense One

Thomas Novelly

5.3

I 5.0 Im 5.1 P 5.4

The Air Force will extend three squadrons to keep the Warthog flying through 2030.

#142

Why the US can’t copy Ukraine’s robot navy

Robotics 2026-04-21 Defense One

Patrick Tucker

5.3

I 5.0 Im 5.1 P 5.4

Command and control will remain a human endeavor—even as the Pacific fills with robo-boats.

#143

How to reopen the Strait of Hormuz

Government & Defense 2026-04-20 Defense One

Luke Coffey

5.3

I 5.0 Im 5.1 P 5.4

It will take existing, though battered, diplomatic and military frameworks plus some creative thinking.

#144

Generalized parton distributions of valence, sea, and gluon components of the proton

Efficiency 2026-04-20 arXiv Efficiency

Yiping Liu, Siqi Xu, Chandan Mondal, Xingbo Zhao +1

5.3

I 5.0 Im 5.1 P 5.4

We compute the generalized parton distributions (GPDs) of valence quarks, sea quarks, and gluons in the proton using light-front wave functions obtained within the basis light-front quantization (BLFQ) framework, providing a realistic description of the nucleon at a low resolution scale. The wave functions are derived from a light-front QCD Hamiltonian without an explicit confining potential and include the three-quark, three-quark-gluon, and three-quark-quark-antiquark Fock sectors. For the first time within BLFQ, we evaluate quark GPDs at nonzero skewness in both the DGLAP and ERBL regions, while gluon GPDs are computed in the DGLAP region. The resulting GPDs exhibit qualitative features similar to, but smaller than the GUMP1.0 global extraction of GPDs based on experimental and lattice QCD data at next-to-leading order accuracy. We further compute the associated Compton form factors and obtain results consistent with the global analysis.

quantization

#145

Advancing Vision Transformer with Enhanced Spatial Priors

Multimodal 2026-04-20 arXiv cs.CV

Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu +1

5.2

I 5.6 Im 4.8 P 5.0

In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships compared to the Manhattan distance used in RMT. Secondly, EVT abandons the decomposed attention mechanism featured in RMT and instead adopts a simpler spatially-independent grouping approach, providing the model with greater flexibility in controlling the number of tokens within each group. By addressing these modifications, EVT offers a more sophisticated and adaptable approach to incorporating spatial priors into the Self-Attention mechanism, thus overcoming some of the limitations associated with RMT and further enhancing its applicability in various computer vision tasks. Extensive experiments on Image Classification, Object Detection, Instance Segmentation, and Semantic Segmentation demonstrate that EVT exhibits exceptional performance. Without additional training data, EVT achieves 86.6% top1-acc on ImageNet-1k.

transformer

#146

Bayesian experimental design: grouped geometric pooled posterior via ensemble Kalman methods

Safety, Policy & Regulation 2026-04-20 arXiv stat.ML

Huchen Yang, Xinghao Dong, Jinlong Wu

5.2

I 5.0 Im 5.3 P 5.0

Bayesian experimental design (BED) for complex physical systems is often limited by the nested inference required to estimate the expected information gain (EIG) or its gradients. Each outer sample induces a different posterior, creating a large and heterogeneous set of inference targets. Existing methods have to sacrifice either accuracy or efficiency: they either perform per-outer-sample posterior inference, which yields higher fidelity but at prohibitive computational cost, or amortize the inner inference across all outer samples for computational reuse, at the risk of degraded accuracy under posterior heterogeneity. To improve accuracy and maintain cost at the amortized level, we propose a grouped geometric pooled posterior framework that partitions outer samples into groups and constructs a pooled proposal for each group. While such grouping strategy would normally require generating separate proposal samples for different groups, our tailored ensemble Kalman inversion (EKI) formulation generates these samples without extra forward-model evaluation cost. We also introduce a conservative diagnostic to assess importance-sampling quality to guide grouping. This grouping strategy improves within-group proposal-target alignment, yielding more accurate and stable estimators while keeping the cost comparable to amortized approaches. We evaluate the performance of our method on both Gaussian-linear and high-dimensional network-based model discrepancy calibration problems.

alignment

#147

AI Resistance: some recent anti-AI stuff that’s worth discussing

Industry 2026-04-20 HN AI

speckx

5.1

I 5.6 Im 4.5 P 5.0

Article URL: https://stephvee.ca/blog/artificial%20intelligence/ai-resistance-is-growing/ Comments URL: https://news.ycombinator.com/item?id=47839951 Points: 348 # Comments: 340

#148

Deezer says 44% of songs uploaded to its platform daily are AI-generated

Industry 2026-04-20 HN AI

FiddlerClamp

5.1

I 5.6 Im 4.5 P 5.0

Article URL: https://techcrunch.com/2026/04/20/deezer-says-44-of-songs-uploaded-to-its-platform-daily-are-ai-generated/ Comments URL: https://news.ycombinator.com/item?id=47835928 Points: 335 # Comments: 334

#149

Adaptive Kernel Selection for Kernelized Diffusion Maps

Generative Media 2026-04-20 arXiv stat.ML

Othmane Aboussaad, Adam Miraoui, Boumediene Hamzi, Houman Owhadi

5.1

I 5.0 Im 5.0 P 5.0

Selecting an appropriate kernel is a central challenge in kernel-based spectral methods. In \emph{Kernelized Diffusion Maps} (KDM), the kernel determines the accuracy of the RKHS estimator of a diffusion-type operator and hence the quality and stability of the recovered eigenfunctions. We introduce two complementary approaches to adaptive kernel selection for KDM. First, we develop a variational outer loop that learns continuous kernel parameters, including bandwidths and mixture weights, by differentiating through the Cholesky-reduced KDM eigenproblem with an objective combining eigenvalue maximization, subspace orthonormality, and RKHS regularization. Second, we propose an unsupervised cross-validation pipeline that selects kernel families and bandwidths using an eigenvalue-sum criterion together with random Fourier features for scalability. Both methods share a common theoretical foundation: we prove Lipschitz dependence of KDM operators on kernel weights, continuity of spectral projectors under a gap condition, a residual-control theorem certifying proximity to the target eigenspace, and exponential consistency of the cross-validation selector over a finite kernel dictionary.

diffusion

#150

Anthropic takes $5B from Amazon and pledges $100B in cloud spending in return

Industry 2026-04-20 TechCrunch AI

Julie Bort

4.9

I 5.6 Im 4.5 P 4.6

Amazon has made another circular AI deal: It's investing another $5 billion in Anthropic. Anthropic has agreed to spend $100 billion on AWS in return.

#151

Neutrally Evolving Interlocking Complexity in the Quandary Den

AI for Science 2026-04-20 arXiv cs.NE

Andrew Walsh

4.9

I 5.0 Im 4.5 P 5.0

Molecular biology features numerous complexes of proteins that coordinate in an interlocking fashion to fulfill different functions. Adaptive evolution explains some of this complexity, but needn't be the default when neutral explanations suffice. A new artificial life model ``organism,'' the Quandary Den, is introduced to explore different neutral evolution scenarios where complexity increases in the absence of greater informational needs. Two interlocking complexity scenarios emerge. Subfunctionalization leads to functionality diffusing through the complex. Masking allows intracomplex interference to accumulate genetically, requiring that it be blocked at the level of expression.

#152

House Democrats want OPM, OMB to halt plans to collect federal worker health data

Government & Defense 2026-04-20 FedScoop

malder

4.9

I 5.0 Im 4.5 P 5.0

A coalition of lawmakers said the Trump administration’s plans to require insurers to hand over federal worker data could put those employees in jeopardy. The post House Democrats want OPM, OMB to halt plans to collect federal worker health data appeared first on FedScoop .

#153

Modernization without compromise: Building trust into every federal service moment

Government & Defense 2026-04-20 FedScoop

swhitehorne

4.9

I 5.0 Im 4.5 P 5.0

Delivering fast, digital services is no longer enough. Agencies must also ensure accuracy, security and accessibility. The post Modernization without compromise: Building trust into every federal service moment appeared first on FedScoop .

#154

After watchdog slams understaffing, AI to vet Pentagon-backed professors’ China ties

Government & Defense 2026-04-20 C4ISRNET

Aliya Sternstein

4.9

I 5.0 Im 4.5 P 5.0

AI’s confusion over the nature of DOD research partnerships may mask real espionage if humans are not the final judge of foreign influence, experts warn.

pentagonchina

#155

When Identity Means Everything

Safety, Policy & Regulation 2026-04-20 War on the Rocks

WOTR Staff

4.9

I 5.0 Im 4.5 P 5.0

Welcome to The Ukraine Compass, a weekly digest of Ukrainian commentary and analysis from across the political spectrum only for War on the Rocks members. Each Monday, we bring you a curated selection of articles from Ukrainian media offering insight into how Ukrainians themselves debate the issues shaping their country.American coverage often narrows the view to the battlefield — these pieces widen it, revealing the texture of daily life, politics, and public argument in a nation at war. The perspectives gathered here are varied, candid, and often surprising, together forming a more complete picture of Ukraine as it really is.Frontline and StrategyЦензор.НЕТ — The post When Identity Means Everything appeared first on War on the Rocks .

rag

#156

Conformal Robust Set Estimation

Research 2026-04-20 arXiv stat.ML

Alejandro Cholaquidis, Emilien Joly, Leonardo Moreno

4.9

I 5.0 Im 4.5 P 5.0

Conformal prediction provides finite-sample, distribution-free coverage under exchangeability, but standard constructions may lack robustness in the presence of outliers or heavy tails. We propose a robust conformal method based on a non-conformity score defined as the half-mass radius around a point, equivalently the distance to its $(\lfloor n/2\rfloor+1)$-nearest neighbour. We show that the resulting conformal regions are marginally valid for any sample size and converge in probability to a robust population central set defined through a distance-to-a-measure functional. Under mild regularity conditions, we establish exponential concentration and tail bounds that quantify the deviation between the empirical conformal region and its population counterpart. These results provide a probabilistic justification for using robust geometric scores in conformal prediction, even for heavy-tailed or multi-modal distributions.

rag

#157

Google rolls out Gemini in Chrome in 7 new countries

Frontier LLMs 2026-04-20 TechCrunch AI

Ivan Mehta

4.8

I 5.0 Im 4.5 P 4.6

Google is rolling out Gemini in Chrome in Australia, Indonesia, Japan, the Philippines, Singapore, South Korea, and Vietnam. The company is rolling this feature out to both desktop and iOS in all of these countries except Japan.

#158

It’s not just one thing — it’s another thing

Industry 2026-04-20 TechCrunch AI

Amanda Silberling

4.8

I 5.0 Im 4.5 P 4.6

This sentence construction ("It's not just this — it's that") has become so common in AI-generated writing that it's no longer just a clue that a piece of writing may be synthetic — it's almost a guarantee.

#159

NSA spies are reportedly using Anthropic’s Mythos, despite Pentagon feud

Government & Defense 2026-04-20 TechCrunch AI

Rebecca Bellan

4.8

I 5.0 Im 4.5 P 4.6

NSA is said to be using Anthropic's restricted Mythos AI model.

pentagon

#160

CEO and CFO suddenly depart AI nuclear power upstart Fermi

Industry 2026-04-20 TechCrunch AI

Kirsten Korosec

4.8

I 5.0 Im 4.5 P 4.6

The startup, co-founded by former U.S. Energy Secretary Rick Perry, has faced headwinds with its AI campus in Texas.