← Archive / all digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — 2026-04-24

Coverage window: 2026-04-23 03:21 ET2026-04-24 03:02 ET
Press play to listen
Friday, April 24, 2026
13m 27s · top-4 narrated briefing
Must-read · top 3
#1 · Frontier LLMs
GPT-5.5 launch: frontier model + Codex superapp (OpenAI)
OpenAI released GPT-5.5 on April 23, positioned as "a new class of intelligence for real work" and rolled out across ChatGPT and the Codex app, with API access held back pending additional safeguards. In the bake-off against Anthropic's Opus 4.7 from the prior…
Score 8.8
#2 · Frontier LLMs
DeepSeek V4-Pro and V4-Flash — largest open-weights model ever, aggressive pricing
DeepSeek ended its five-month silence since V3.2 Speciale with two preview models in the V4 series, both MIT-licensed, both 1-million-token context, both mixture-of-experts. V4-Pro is 1.6 trillion total parameters with 49 billion active, making it the largest …
Score 7.8
#3 · Agents & Tool Use
What We Learned Building Cloud Agents
Cognition published a long follow-up to its earlier posts on multi-agent patterns, this time dissecting the infrastructure demands of cloud-based software engineering agents. The core claim is that the natural starting point for a cloud agent — take a CLI agen…
Score 7.6
Filter
6.5
Showing 228 of 228 items
#1

GPT-5.5 launch: frontier model + Codex superapp (OpenAI)

Frontier LLMs 2026-04-24 Latent Space (swyx & Alessio)NVIDIA AI BlogOpenAI ResearchOpenAI ResearchSimon Willison's WeblogOne Useful Thing (Ethan Mollick)
8.8
I 8.0 Im 7.0 P 10

OpenAI released GPT-5.5 on April 23, positioned as "a new class of intelligence for real work" and rolled out across ChatGPT and the Codex app, with API access held back pending additional safeguards. In the bake-off against Anthropic's Opus 4.7 from the prior week, Artificial Analysis crowns GPT-5.5 the top independently validated model in the world on its Intelligence Index, with GPT-5.5 at medium reasoning scoring the same as Opus 4.7 at max reasoning at roughly one quarter of the cost, about $1,200 versus $4,800 per million tokens end to end. Gemini 3.1 Pro Preview scores the same at around $900. Coverage across OpenAI's own system card, Simon Willison's preview notes, Ethan Mollick's early-access write-up, the NVIDIA blog, Latent Space's AI News, and TechCrunch converged on a profile of stronger long-horizon execution, noticeably better agentic coding, broader computer use, and improved token efficiency. Ethan Mollick's illustrative test, a procedurally generated 3D simulation of a harbor town from 3000 BCE to 3000 CE, was only rendered as an evolving town by GPT-5.5 Pro, with the new model completing in 20 minutes what GPT-5.4 Pro took 33 minutes to do.

The release is also a superapp moment for Codex. OpenAI bundled GPT-5.5 with a major Codex refresh that folds in the capabilities of its now-defunct Prism acquisition, adds built-in browser control, and ships with schedules, triggers, plugins, skills, and a unified project / thread / file workspace. Over 10,000 NVIDIA employees across engineering, product, legal, marketing, finance, sales, HR, and operations are already using GPT-5.5-powered Codex inside a Jensen-led company-wide rollout. NVIDIA's disclosure notes that Codex is served on GB200 NVL72 rack-scale systems, which the post frames as delivering roughly 35 times lower cost per million tokens and 50 times higher token output per second per megawatt versus prior-generation hardware, the economics that make enterprise-scale frontier inference viable. The practical signal is that ongoing engineering work with async gaps, pull request opens, continuous integration waits, review rounds, now fits inside one tool.

Pricing lands at about $5 per million input tokens and $30 per million output tokens for GPT-5.5 Pro, which is higher than Opus 4.7, but Artificial Analysis's intelligence-per-dollar curves make the combined offering competitive with the Claude stack on most real workloads. Simon Willison notes one remaining friction: the Codex API endpoint that some agent harnesses such as Pi and Opencode use, the semi-official "/backend-api/codex/responses" path, is now tacitly supported by OpenAI, which has hired the OpenClaw creator and publicly welcomes third parties integrating with ChatGPT subscriptions. This is a direct contrast with Anthropic, which blocked OpenClaw from doing the same with Claude subscriptions earlier in the year. The pattern coming out of the week is that raw 1-dimensional intelligence numbers are giving way to intelligence-per-dollar axes as the real comparison space, with OpenAI, Anthropic, and Google all now jockeying for the best place on that 2D Pareto frontier.

How it was discussed
  • OpenAI's own announcement frames GPT-5.5 as optimized for coding, research, and data analysis across tools.
  • NVIDIA emphasizes that 10,000+ NVIDIA employees are already using GPT-5.5-powered Codex on GB200 NVL72, citing 35× lower cost per million tokens vs prior hardware.
  • Simon Willison highlights the missing API access at launch and the OpenClaw/Codex subscription-backdoor dynamic contrasting with Anthropic's recent block.
  • Ethan Mollick's procedural-town test found GPT-5.5 Pro was the only model that actually modeled an evolving town rather than replacing buildings in place.
  • Latent Space/swyx notes Artificial Analysis's finding: GPT-5.5 (medium) matches Opus 4.7 (max) at ~¼ the cost, but Gemini 3.1 Pro Preview still undercuts both.
  • TechCrunch frames it as OpenAI's 'super app' moment, with Codex folding in the defunct Prism's browser-control stack.
#2
7.8
I 8.0 Im 6.5 P 7.8

DeepSeek ended its five-month silence since V3.2 Speciale with two preview models in the V4 series, both MIT-licensed, both 1-million-token context, both mixture-of-experts. V4-Pro is 1.6 trillion total parameters with 49 billion active, making it the largest open-weights model ever released — larger than Kimi K2.6 at 1.1 trillion, GLM-5.1 at 754 billion, and more than twice the size of DeepSeek V3.2 at 685 billion. V4-Flash is a 284-billion-parameter MoE with 13 billion active, sized to fit reasonably quantized onto high-end consumer workstations. Simon Willison flagged the possibility of running Flash on a 128GB M5 MacBook Pro with light quantization, and even Pro on that hardware if the inference engine can stream just the active experts from disk per token. The checkpoints are 865 gigabytes and 160 gigabytes respectively on Hugging Face.

The headline is price. V4-Flash is listed at $0.14 per million input tokens and $0.28 per million output, which places it as the cheapest small model in the frontier class, below GPT-5.4 Nano at $0.20 input and $1.25 output, Gemini 3.1 Flash-Lite at $0.25 and $1.50, and Claude Haiku 4.5 at $1 and $5. V4-Pro comes in at $1.74 input and $3.48 output, roughly one third the price of Gemini 3.1 Pro at $2 and $12, one fifth the price of GPT-5.4 at $2.50 and $15, and an order of magnitude cheaper on output tokens than Claude Opus 4.7 at $5 and $25 or GPT-5.5 at $5 and $30. Simon Willison's pelican-on-a-bicycle benchmark suggests V4-Pro is in the same ballpark as those priced competitors for the structural-reasoning-under-SVG-constraints task he uses as a qualitative probe.

DeepSeek frames both models as V4 previews, with a fuller release and associated technical report still pending. The architectural details disclosed so far are continuity with V3's multi-head latent attention and fine-grained expert routing, scaled roughly 2.3 times in total parameters for Pro. OpenRouter already exposes both endpoints, and DeepSeek's own API is live at the published prices. The significance for open weights is that the frontier-parity tier, the bracket where a model can plausibly stand in for Opus 4.7 or GPT-5.5 on broad tasks, now contains an MIT-licensed option that runs at roughly one fifth to one tenth the cost per million tokens. For the Chinese-lab cohort, it also re-establishes DeepSeek's cadence after a period where Qwen and Kimi had been doing most of the frontier-open-weights announcements. The practical question for the coming weeks is how Pro's 1.6 trillion parameters hold up on the harder agentic-coding and long-horizon benchmarks versus the closed frontier stack, and whether the 49-billion active-parameter count keeps inference cost viable at the listed prices under real load.

#3

What We Learned Building Cloud Agents

Agents & Tool Use 2026-04-23 Cognition AI (Devin)
7.6
I 7.5 Im 7.5 P 6.0

Cognition published a long follow-up to its earlier posts on multi-agent patterns, this time dissecting the infrastructure demands of cloud-based software engineering agents. The core claim is that the natural starting point for a cloud agent — take a CLI agent, containerize it, give it repo and toolchain access — looks achievable but hits three hard walls when it meets real engineering workflows. Cognition frames the post as a counter to recent signals, including Stripe's public description of its homegrown cloud agent, that building one in-house is the right path for large organizations.

The first wall is isolation. Containerized agents share a kernel, and agents generate their own code, run arbitrary commands, and probe the environment unpredictably. A kernel-level escape lets one compromised session reach every other container's filesystems, credentials, and network connections. Cognition's conclusion is the same as the broader infra community's for any untrusted workload: VM-level isolation. Their implementation is a microVM per agent session, with over a year of hypervisor engineering behind it. A side benefit is that agents in dedicated VMs can drive a full browser, desktop applications, and arbitrary tool stacks the way a developer on a workstation does, which turns out to matter for agentic work that wanders outside the terminal.

The second wall is state across async gaps. Real engineering is stop-and-go: open a pull request, wait on continuous integration, respond to code review, rerun tests, push a follow-up commit. Between each step there are minutes, hours, or days where the agent must preserve its full working state. Containers cannot reliably snapshot an individual container's memory, process trees, and filesystem, shut down compute, and restore exactly later. A container either burns compute to stay alive or loses the session on reschedule or timeout. Cognition's solution is full machine-state snapshotting at the hypervisor level, so a Devin session idles at zero compute cost and resumes bit-for-bit when a CI result or review comment arrives. Making this reliable across thousands of concurrent heterogeneous sessions, each with its own repos and runtimes, is described as the longest-running single piece of infrastructure they have built.

The third wall is orchestration, governance, and integrations. Each is a multi-quarter infra project on its own. Orchestration demands per-session provisioning, correct routing, warm VM pools tuned to demand, and environments kept current as codebases change daily. Governance demands inheriting the dispatching engineer's permissions across every system the agent touches, with tamper-evident audit logs. Integrations demand connecting to CI, monitoring, package registries, documentation, and source control, each with its own authentication model. Stripe's internal MCP server reportedly carries over 400 tools, and that is the scale of ongoing investment this layer requires. The article closes with the pattern Cognition says they consistently see from teams attempting in-house builds: the combined surface area becomes untenable rather than any single component being blocking. The implicit pitch is that two years of hypervisor engineering, snapshot machinery, and integration work is why Devin exists as a product. The technical content is concrete enough that it reads as an architecture reference even if read independently of that pitch.

#4
7.5
I 7.0 Im 6.8 P 6.5

UniT, the top-voted paper on Hugging Face Daily Papers for April 24, tackles one of the core bottlenecks in humanoid foundation models: the kinematic mismatch that makes massive egocentric human video data hard to use for training humanoid policies. Humans and humanoid robots move in different coordinate frames with different joint counts and different contact dynamics, so cross-embodiment transfer has historically required either careful data curation, per-robot retargeting, or explicit physics priors. UniT proposes a unified physical language that represents both human and humanoid behavior in a shared token space, letting a single model learn policy, world modeling, and cross-embodiment retargeting end-to-end from a mix of human demonstrations and humanoid execution traces.

The architecture combines a tokenizer that maps joint-angle trajectories, end-effector poses, and egocentric video frames into a common discrete vocabulary, a transformer-based world model trained to predict next-token physical state, and a policy head conditioned on task language that emits actions in the humanoid's configuration space. Training mixes OpenX-style humanoid telemetry, egocentric human video datasets such as Ego4D and Epic-Kitchens, and a smaller curated set of paired human-humanoid demonstrations. The authors report that the shared tokenization recovers the majority of the benefit from paired data even when paired examples are scarce, which is the regime that matters for humanoid scaling.

Benchmark numbers span manipulation and locomotion. On a cross-embodiment manipulation suite the authors construct around kitchen and workshop tasks, UniT improves on the best human-to-humanoid baseline by several points on task success while cutting the retargeting-data requirement roughly in half. On locomotion and whole-body control, the world-model component can roll out plausible multi-second trajectories from a single observation, which the paper uses for both reward-free policy bootstrapping and for curriculum generation. Ablations isolate the contribution of the shared tokenizer from the sheer scale effect and show that a naive multimodal transformer without the physical-language tokenization underperforms by a noticeable margin.

The positioning relative to prior VLA stacks, RT-2 and OpenVLA for manipulation, and recent humanoid-specific work from Unitree and Agility is the main discussion point. UniT argues that a single token space for both modeling and acting is the right unification axis rather than bolting humanoid heads onto vision-language backbones. The open question the authors flag is how far the physical language generalizes beyond the two embodiments trained on; the paper includes a small zero-shot test on a different humanoid platform with promising but not conclusive numbers. If the token space survives more embodiments, it becomes a plausible ingredient in the next wave of humanoid foundation models, where the binding constraint has shifted from compute to robot-data throughput.

benchmarkvlarobothumanoidvideo-gen
#5

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Research 2026-04-23 HF ↑21 Hugging Face Daily Papers
6.4
I 5.2 Im 5.5 P 7.0

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs.

benchmarkvlmvideo-gen
#6

An update on recent Claude Code quality reports

Frontier LLMs 2026-04-24 Simon Willison's Weblog
6.4
I 6.5 Im 6.0 P 5.5

An update on recent Claude Code quality reports It turns out the high volume of complaints that Claude Code was providing worse quality results over the past two months was grounded in real problems. The models themselves were not to blame, but three separate issues in the Claude Code harness caused complex but material problems which directly affected users. Anthropic's postmortem describes these in detail.

#7

AEL: Agent Evolving Learning for Open-Ended Environments

Interpretability 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu…
6.0
I 6.7 Im 5.7 P 4.5

LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle.

benchmarkagents
#8

National Australia Bank accelerates legacy migrations with Cursor

AI Coding 2026-04-23 Cursor Blog (Anysphere)
6.0
I 6.2 Im 5.8 P 5.0

National Australia Bank (NAB) standardized 6,000 developers on Cursor after evaluating Amazon Q and GitHub Copilot, with plans to scale toward 10,000+. Legacy modernizations — monolith-to-microservices refactors, migrations away from Assembly mainframes — are running 3× faster than expected. One merchant services team built a hardware-agnostic payment app in 3 weeks instead of the original 4-month scope. NAB cited model flexibility (engineers choose per-task), repository-wide context, and auto-rules that bake compliance requirements into agent behavior as the reasons they moved off Q/Copilot.

#9
5.8
I 5.5 Im 5.5 P 5.0

Anthropic and NEC announce strategic collaboration to deploy Claude to approximately 30,000 NEC Group employees worldwide. NEC becomes Anthropic's first Japan-based global partner and will jointly develop secure, domain-specific AI products for Japanese finance, manufacturing, local government, and cybersecurity customers. Claude Opus 4.7 and Claude Code will be integrated into NEC BluStellar Scenario. NEC will establish a Center of Excellence and extend Client Zero deployment of Claude Cowork internally.

#10
5.8
I 5.2 Im 5.0 P 5.7

Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths.

benchmarkdiffusionfinetunepretrain
#11

Quotient-Space Diffusion Models

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
Yixian Xu, Yusong Wang, Shengjie Luo, Kaiyuan Gao, Tianyu He…
5.7
I 5.0 Im 5.0 P 5.5

Diffusion-based generative models have reformed generative AI, and have enabled new capabilities in the science domain, for example, generating 3D structures of molecules. Due to the intrinsic problem structure of certain tasks, there is often a symmetry in the system, which identifies objects that can be converted by a group action as equivalent, hence the target distribution is essentially defined on the quotient space with respect to the group.

diffusionproteinmolecular
#12
5.7
I 5.5 Im 5.5 P 5.0

Listen to this post: Good morning, This week’s Stratechery Interview is with Google Cloud CEO Thomas Kurian . Kurian joined Google to lead the company’s cloud division in 2018; prior to that he was President of Product Development at Oracle, where he worked for 22 years. I previously spoke to Kurian in March 2021 , April 2024 , and April 2025 . The occasion for these interviews, at least for the last three years, is Kurian’s annual keynote at Google Cloud Next.

#13
5.6
I 5.5 Im 5.5 P 4.5

Ai2's OlmoEarth Studio now exposes custom embedding exports from its open-source Earth-observation foundation models. Users choose area of interest, time range, encoder variant, resolution, and imagery sources (Sentinel-2 etc.) and receive Cloud-Optimized GeoTIFFs of compact embedding vectors suitable for similarity search, segmentation, and unsupervised exploration. Weights and code are public. Ai2 also shows a PCA+k-means clustering of 1.1M Sentinel-2 samples demonstrating that locations with similar surface characteristics land close in embedding space. Supervised fine-tuning is also supported for higher-performance downstream use.

#14

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

Reinforcement Learning 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)
Yi-Ling Liu, Melvin Laux, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam
5.5
I 4.0 Im 5.9 P 5.5

Autonomous underwater vehicles are required to perform multiple tasks adaptively and in an explainable manner under dynamic, uncertain conditions and limited sensing, challenges that classical controllers struggle to address. This demands robust, generalizable, and inherently interpretable control policies for reliable long-term monitoring. Reinforcement learning, particularly multi-task RL, overcomes these limitations by leveraging shared representations to enable efficient adaptation across tasks and environments.

rlagentspretrain
#15

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)
Srija Anand, Ashwin Sankar, Ishvinder Sethi, Aaditya Pareek, Kartik Rajput…
5.5
I 6.0 Im 5.9 P 3.5

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters.

#16
5.5
I 5.0 Im 5.3 P 5.0

Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe , but before the Cursor-xAI deal . Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.

#17

Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

Reinforcement Learning 2026-04-23 arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)
Heng Yang
5.4
I 6.2 Im 4.0 P 4.5

We propose a sampling-based framework for finite-horizon trajectory and policy optimization under differentiable dynamics by casting controller design as inference. Specifically, we minimize a KL-regularized expected trajectory cost, which yields an optimal "Boltzmann-tilted" distribution over controller parameters that concentrates on low-cost solutions as temperature decreases.

benchmarkmultimodal
#18
5.4
I 5.5 Im 5.2 P 4.0

While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation.

rlmultimodalagentscodegencoding
#19
5.4
I 6.2 Im 4.7 P 3.7

Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored.

benchmarkpretrain
#20

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Naheed Rayhan, Sohely Jahan
5.3
I 5.5 Im 5.7 P 3.5

Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context.

agents
#21

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun…
5.3
I 6.2 Im 5.2 P 3.5

Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy.

benchmark
#22

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Multimodal 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)
Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson…
5.2
I 4.7 Im 4.0 P 5.5

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations.

benchmarkdpovlmfinetune
#23

GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Evaluations & Benchmarks 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Neeraj Gangwar, Rishabh Deshmukh, Michael Shavlovsky, Hancao Li, Vivek Mittal…
5.2
I 5.7 Im 4.0 P 4.5

As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation.

benchmarkfinetune
#24

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Evaluations & Benchmarks 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)
Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di…
5.2
I 5.0 Im 4.0 P 5.5

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery.

benchmark
#25

Replay-buffer engineering for noise-robust quantum circuit optimization

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Akash Kundu, Sebastian Feld
5.2
I 5.2 Im 4.7 P 4.5

Deep reinforcement learning (RL) for quantum circuit optimization faces three fundamental bottlenecks: replay buffers that ignore the reliability of temporal-difference (TD) targets, curriculum-based architecture search that triggers a full quantum-classical evaluation at every environment step, and the routine discard of noiseless trajectories when retraining under hardware noise. We address all three by treating the replay buffer as a primary algorithmic lever for quantum optimization.

benchmarkrlmolecularpretrain
#26

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang…
5.2
I 4.0 Im 5.7 P 4.5

Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems.

benchmarkagentscodegencoding
#27

Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv cs.NE (Neural & Evolutionary Computing)
Eylon E. Krause
5.2
I 5.5 Im 3.5 P 5.5

The choice of activation function plays a crucial role in the optimization and performance of deep neural networks. While the Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and effectiveness, its lack of smoothness may hinder gradient-based optimization in deep architectures. In this work we propose a family of $C^{2N}$-smooth activation functions whose gate follows a log-logistic CDF, achieving ReLU-like performance with purely rational arithmetic.

transformer
#28

Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)
Marius Huber, David R. Reich, Lena A. Jäger
5.2
I 5.0 Im 4.7 P 4.5

Persistent homology, a method from topological data analysis, extracts robust, multi-scale features from data. It produces stable representations of time series by applying varying thresholds to their values (a process known as a \textit{filtration}). We develop novel filtrations for time series and introduce topological methods for the analysis of eye-tracking data, by interpreting fixation sequences as time series, and constructing ``hybrid models'' that combine topological features with traditional statistical features.

#29

Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning

Research 2026-04-23 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)
Hanjun Cho, Gahyun Yoo, Hanseong Kim, Jay-Yoon Lee
5.2
I 5.5 Im 4.2 P 4.5

Numerical reasoning over expert-domain tables often exhibits high in-domain accuracy but limited robustness to domain shift. Models trained with supervised fine-tuning (SFT) on specific datasets tend to rely on header-operation shortcuts rather than structural reasoning. We introduce TaNOS, a continual pre-training framework comprising three components: (i) header anonymization to reduce lexical memorization, (ii) operation sketches that provide minimal structural cues, and (iii) self-supervised pretraining that constructs correctness-guaranteed program-question pairs from given tables in a program-first manner.

finetunepretrain
#30

Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Liane Vogel, Kavitha Srinivas, Niharika D'Souza, Sola Shirai, Oktie Hassanzadeh…
5.2
I 6.2 Im 4.7 P 3.5

Tabular foundation models aim to learn universal representations of tabular data that transfer across tasks and domains, enabling applications such as table retrieval, semantic search and table-based prediction. Despite the growing number of such models, it remains unclear which approach works best in practice, as existing methods are often evaluated under task-specific settings that make direct comparison difficult. To address this, we introduce TEmBed, the Tabular Embedding Test Bed, a comprehensive benchmark for systematically evaluating tabular embeddings across four representation levels: cell, row, column, and table.

benchmark
#31

There Will Be a Scientific Theory of Deep Learning

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)arXiv — Mechanistic Interpretability
Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon…
5.2
I 4.0 Im 4.7 P 5.5

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks.

mech-interp
#32
5.2
I 5.5 Im 5.0 P 4.0

In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges.

multimodalfinetune
#33
5.2
I 5.0 Im 5.4 P 3.7

Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design.

pretrain
#34
5.2
I 5.0 Im 5.0 P 4.3

Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe , but before the Cursor-xAI deal . Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.

#35

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Haolin Zhang, William Reber, Yuxuan Zhang, Guofei Gu, Jeff Huang
5.1
I 5.0 Im 5.7 P 3.5

Modern phishing campaigns increasingly evade snapshot-based URL classifiers using interaction gates (e.g., checkbox/slider challenges), delayed content rendering, and logo-less credential harvesters. This shifts URL triage from static classification toward an interactive forensics task: an analyst must actively navigate the page while isolating themselves from potential runtime exploits. We present TraceScope, a decoupled triage pipeline that operationalizes this workflow at scale.

agents
#36

MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Yining Xing, Zehong Ke, Yiqian Tu, Zhiyuan Liu, Wenhao Yu…
5.1
I 6.2 Im 4.3 P 3.5

Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, a Variational Autoencoder to structure expert trajectories into a compact 32-dimensional latent manifold, and an ultra-lightweight MLP-Mixer decoder to eliminate quadratic attention complexity.

benchmarkdiffusion
#37

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

Evaluations & Benchmarks 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Praval Sharma
5.0
I 5.0 Im 4.0 P 4.5

Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities.

multimodal
#38

Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv — Mechanistic Interpretability
Minji Jung, Minjae Lee, Yejin Kim, Sarang Choi, Minsuk Kahng
5.0
I 5.2 Im 4.0 P 4.5

LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe.

benchmark
#39

StructMem: Structured Memory for Long-Horizon Behavior in LLMs

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)
Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao…
5.0
I 4.0 Im 4.0 P 5.5

Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections.

agents
#40

GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou, Xudong Jin…
5.0
I 5.5 Im 3.5 P 4.5

Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities.

quantization
#41

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Frontier LLMs 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Hao-Yuan Chen
5.0
I 5.5 Im 3.5 P 4.5

Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results.

#42

On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification

Recurrent & Linear Attention 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv cs.NE (Neural & Evolutionary Computing)
Rishona Daniels, Duna Wattad, Ronny Ronen, David Saad, Shahar Kvatinsky
5.0
I 4.0 Im 4.0 P 5.5

Reservoir computing (RC) is an emerging recurrent neural network architecture that has attracted growing attention for its low training cost and modest hardware requirements. Memristor-based circuits are particularly promising for RC, as their intrinsic dynamics can reduce network size and parameter overhead in tasks such as time-series prediction and image recognition. Although RC has been demonstrated with several memristive devices, a comprehensive evaluation of device-level requirements remains limited.

quantization
#43

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)
Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek…
5.0
I 5.0 Im 5.2 P 3.5

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors.

#44

MathDuels: Evaluating LLMs as Problem Posers and Solvers

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)
Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik
5.0
I 6.2 Im 4.0 P 3.5

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions.

benchmark
#45

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Agents & Tool Use 2026-04-23 arXiv cs.CL (Computation & Language)
Xinyu Zhang, Boxuan Zhang, Yuchen Wan, Lingling Zhang, YiXing Yao…
5.0
I 5.2 Im 5.2 P 3.5

While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard.

benchmarkagents
#46

Locating acts of mechanistic reasoning in student team conversations with mechanistic machine learning

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv — Mechanistic Interpretability
Kaitlin Gili, Mainak Nistala, Kristen Wendell, Michael C. Hughes
5.0
I 4.5 Im 4.7 P 4.5

STEM education researchers are often interested in identifying moments of students' mechanistic reasoning for deeper analysis, but have limited capacity to search through many team conversation transcripts to find segments with a high concentration of such reasoning. We offer a solution in the form of an interpretable machine learning model that outputs time-varying probabilities that individual students are engaging in acts of mechanistic reasoning, leveraging evidence from their own utterances as well as contributions from the rest of the group.

mech-interp
#47

Interpretable facial dynamics as behavioral and perceptual traces of deepfakes

Multimodal 2026-04-23 arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)
Timothy Joseph Murphy, Jennifer Cook, Hélio Clemente José Cuve
5.0
I 4.0 Im 5.2 P 4.5

Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived.

benchmark
#48

Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms

Efficiency 2026-04-23 arXiv cs.LG (Machine Learning)
Jiyan Song, Wenyang Wang, Chengcheng Yan, Zhiquan Han, Feifei Zhao
5.0
I 5.2 Im 5.2 P 3.5

In the treatment of complex diseases, treatment regimens using a single drug often yield limited efficacy and can lead to drug resistance. In contrast, combination drug therapies can significantly improve therapeutic outcomes through synergistic effects. However, experimentally validating all possible drug combinations is prohibitively expensive, underscoring the critical need for efficient computational prediction methods. Although existing approaches based on deep learning and graph neural networks (GNNs) have made considerable progress, challenges remain in reducing structural bias, improving generalization capability, and enhancing model interpretability.

benchmarkmolecular
#49

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Yanran Zhang, Wenzhao Zheng, Yifei Li, Bingyao Yu, Yu Zheng…
5.0
I 5.5 Im 5.0 P 3.5

In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges.

multimodalfinetune
#50

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao…
5.0
I 5.2 Im 5.5 P 3.5

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs.

benchmarkvlmvideo-gen
#51

Sapiens2

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Rawal Khirodkar, He Wen, Julieta Martinez, Yuan Dong, Su Zhaoen…
5.0
I 6.0 Im 4.7 P 3.5

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives.

transformerpretrain
#52
5.0
I 5.0 Im 4.0 P 4.7

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery.

benchmark
#53

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Multimodal 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
Bowen Liu, Li Yang, Shanshan Song, Mingyu Tang, Zhifang Gao…
4.9
I 5.5 Im 3.5 P 4.5

Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames.

#54

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang, Haijun Wang…
4.9
I 4.0 Im 4.7 P 4.5

The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline.

finetune
#55

Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

Post-Training 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Olufunke O. Sarumi, Charles Welch, Daniel Braun
4.9
I 4.0 Im 4.7 P 4.5

Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators' provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism.

finetune
#56

Misinformation Span Detection in Videos via Audio Transcripts

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)
Breno Matos, Rennan C. Lima, Savvas Zannettou, Fabricio Benevenuto, Rodrygo L. T. Santos
4.9
I 5.0 Im 4.7 P 3.5

Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms.

#57

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)arXiv — Mechanistic Interpretability
Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed…
4.9
I 4.7 Im 4.0 P 4.5

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition.

benchmark
#58

AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

Agents & Tool Use 2026-04-23 arXiv cs.CL (Computation & Language)
Yuanjie Lyu, Chengyu Wang, Haonan Zheng, Yuanhao Yue, Junbing Yan…
4.9
I 5.2 Im 4.5 P 3.5

Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks.

benchmarkrlagentstool-use
#59
4.9
I 5.0 Im 4.7 P 3.5

Since software performance requirements are documented in natural language, quantifying them into mathematical forms is essential for software engineering. Yet, the vagueness in performance requirements and uncertainty of human cognition have caused highly uncertain ambiguity in the interpretations, rendering their automated quantification an unaddressed and challenging problem. In this paper, we formalize the problem and propose IRAP, an approach that quantifies performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation.

#60

PrismaDV: Automated Task-Aware Data Unit Test Generation

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Hao Chen, Arnab Phani, Sebastian Schelter
4.9
I 5.7 Im 4.0 P 3.5

Data is a central resource for modern enterprises, and data validation is essential for ensuring the reliability of downstream applications. However, existing automated data unit testing frameworks are largely task-agnostic: they validate datasets without considering the semantics and requirements of the code that consumes the data. We present PrismaDV, a compound AI system that analyzes downstream task code together with dataset profiles to identify data access patterns, infer implicit data assumptions, and generate task-aware executable data unit tests.

benchmark
#61

Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

Multimodal 2026-04-23 arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)
Wenxuan Bao, Yanjun Zhao, Xiyuan Yang, Jingrui He
4.9
I 4.0 Im 4.7 P 4.5

Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection.

benchmarkvlmpretrain
#62

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang
4.9
I 5.5 Im 4.7 P 3.5

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics.

#63

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

Research 2026-04-23 arXiv cs.CV (Computer Vision)
Kwan Yun, Changmin Lee, Ayeong Jeong, Youngseo Kim, Seungmi Lee…
4.9
I 5.2 Im 5.0 P 3.5

Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths.

benchmarkdiffusionfinetunepretrain
#64

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao
4.9
I 5.0 Im 5.4 P 3.5

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment.

pretrain
#65

Addressing Image Authenticity When Cameras Use Generative AI

Generative Media 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
Umar Masud, Abhijith Punnappurath, Luxi Zhao, David B. Lindell, Michael S. Brown
4.8
I 4.0 Im 4.7 P 4.5

The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras' capture-time hardware -- namely, the image signal processor (ISP) -- there is now a potential for hallucinated content in images directly output by our cameras.

#66

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Anuj Sadani, Deepak Kumar
4.8
I 5.2 Im 4.5 P 3.5

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost.

benchmarkagentslong-context
#67

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

Efficiency 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Eli Gildish, Michael Grebshtein, Igor Makienko
4.8
I 5.0 Im 3.5 P 4.5

Denoising of periodic signals and accurate waveform estimation are core tasks across many signal processing domains, including speech, music, medical diagnostics, radio, and sonar. Although deep learning methods have recently shown performance improvements over classical approaches, they require substantial computational resources and are usually trained separately for each signal observation. This study proposes a computationally efficient method based on DCNN and Re-sampling, termed R-DCNN, designed for operation under strict power and resource constraints. The approach targets signals with varying fundamental frequencies and requires only a single observation for training.

#68

CoFEE: Reasoning Control for LLM-Based Feature Discovery

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Maximilian Westermann, Ben Griffin, Aaron Ontoyin Yin, Zakari Salifu, Yagiz Ihlamur…
4.8
I 4.5 Im 4.0 P 4.5

Feature discovery from complex unstructured data is fundamentally a reasoning problem: it requires identifying abstractions that are predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals. With the introduction of ever-improving Large Language Models (LLMs), our method provides a structured method for addressing this challenge. LLMs are well suited for this task by being able to process large amounts of information, but unconstrained feature generation can lead to weak features. In this work, we study reasoning control in LLMs by inducing cognitive behaviors for improving feature discovery.

#69

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Milan De Koning, Ali Asgari, Pouria Derakhshanfar, Annibale Panichella
4.8
I 5.0 Im 4.7 P 3.5

LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data, leading to inflated performance estimates. In this paper, we investigate whether we can better reveal data leakage by combining metamorphic testing (MT) with negative log-likelihood (NLL), which has been used in prior work as a proxy for memorization.

benchmarkpretrain
#70

Measuring Opinion Bias and Sycophancy via LLM-based Coercion

Agents & Tool Use 2026-04-23 arXiv cs.CL (Computation & Language)arXiv — Mechanistic Interpretability
Rodrigo Nogueira, Giovana Kerche Bonás, Thales Sales Almeida, Andrea Roque, Ramon Pires…
4.8
I 4.5 Im 4.0 P 4.5

Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side.

agents
#71

Task-Driven Co-Design of Heterogeneous Multi-Robot Systems

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Maximilian Stralz, Meshal Alharbi, Yujun Huang, Gioele Zardini
4.8
I 4.5 Im 5.2 P 3.5

Designing multi-agent robotic systems requires reasoning across tightly coupled decisions spanning heterogeneous domains, including robot design, fleet composition, and planning. Much effort has been devoted to isolated improvements in these domains, whereas system-level co-design considering trade-offs and task requirements remains underexplored. In this work, we present a formal and compositional framework for the task-driven co-design of heterogeneous multi-robot systems.

robotagents
#72

Beyond Expected Information Gain: Stable Bayesian Optimal Experimental Design with Integral Probability Metrics and Plug-and-Play Extensions

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
Di Wu, Ling Liang, Haizhao Yang
4.8
I 4.5 Im 4.0 P 4.5

Bayesian Optimal Experimental Design (BOED) provides a rigorous framework for decision-making tasks in which data acquisition is often the critical bottleneck, especially in resource-constrained settings. Traditionally, BOED typically selects designs by maximizing expected information gain (EIG), commonly defined through the Kullback-Leibler (KL) divergence. However, classical evaluation of EIG often involves challenging nested expectations, and even advanced variational methods leave the underlying log-density-ratio objective unchanged. As a result, support mismatch, tail underestimation, and rare-event sensitivity remain intrinsic concerns for KL-based BOED.

#73

A-IC3: Learning-Guided Adaptive Inductive Generalization for Hardware Model Checking

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Xiaofeng Zhou, Guangyu Hu, Hongce Zhang, Wei Zhang
4.8
I 5.0 Im 4.5 P 3.5

The IC3 algorithm represents the state-of-the-art (SOTA) hardware model checking technique, owing to its robust performance and scalability. A significant body of research has focused on enhancing the solving efficiency of the IC3 algorithm, with particular attention to the inductive generalization process: a critical phase wherein the algorithm seeks to generalize a counterexample to inductiveness (CTI), which typically is a state leading to a bad state, into a broader set of states.

benchmarkagents
#74

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar
4.8
I 5.0 Im 5.0 P 3.5

Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views.

diffusiontransformer
#75

Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Yingkai Yang, Chaoqi Chen, Hui Huang
4.8
I 5.7 Im 4.0 P 3.5

Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection.

benchmark
#76

Seeing Fast and Slow: Learning the Flow of Time in Videos

Research 2026-04-23 HF ↑10 Hugging Face Daily Papers
4.8
I 4.0 Im 3.8 P 5.2

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos.

multimodalvideo-gen
#77

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Research 2026-04-22 HF ↑5 Hugging Face Daily Papers
4.8
I 4.0 Im 4.5 P 4.3

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes.

benchmarkagents

Large Language Models (LLMs) have demonstrated remarkable fluency and versatility across a wide range of NLP tasks, yet they remain prone to factual inaccuracies and hallucinations. This limitation poses significant risks in high-stakes domains such as healthcare, law, and scientific communication, where trust and verifiability are paramount. In this paper, we introduce DAVinCI - a Dual Attribution and Verification framework designed to enhance the factual reliability and interpretability of LLM outputs.

#79

Promoting Simple Agents: Ensemble Methods for Event-Log Prediction

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Benedikt Bollig, Matthias Függer, Thomas Nowak, Paul Zeinaty
4.7
I 4.0 Im 4.0 P 4.5

We compare lightweight automata-based models (n-grams) with neural architectures (LSTM, Transformer) for next-activity prediction in streaming event logs. Experiments on synthetic patterns and five real-world process mining datasets show that n-grams with appropriate context windows achieve comparable accuracy to neural models while requiring substantially fewer resources. Unlike windowed neural architectures, which show unstable performance patterns, n-grams provide stable and consistent accuracy. While we demonstrate that classical ensemble methods like voting improve n-gram performance, they require running many agents in parallel during inference, increasing memory consumption and latency.

transformeragents
#80

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Kaushitha Silva, Srinath Perera
4.7
I 5.0 Im 4.5 P 3.5

Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven planning and debugging, where language models trace execution steps to verify logic. However, these approaches depend on human-provided public test cases to ground the debugging and simulation loop. Manually authoring comprehensive input-output examples is a labor-intensive bottleneck in the software development lifecycle.

benchmarkagentscodegen
#81

Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Chris Schneider, Philipp Schoenegger, Ben Bariach
4.7
I 4.0 Im 4.0 P 4.5

Current model training approaches incorporate user information directly into shared weights, making individual data removal computationally infeasible without retraining. This paper presents a three-layer architecture that decouples personal data from shared weights by combining a static base model, composable domain-expert LoRA adapters that shape behavior without imparting user data, and per-user proxy artefacts whose deletion constitutes deterministic unlearning.

#82

Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach

Post-Training 2026-04-23 arXiv cs.CL (Computation & Language)
Flávio Soriano, Victoria F. Mello, Pedro B. Rigueira, Gisele L. Pappa, Wagner Meira…
4.7
I 4.5 Im 4.7 P 3.5

Analyses of legislative behavior often rely on voting records, overlooking the rich semantic and rhetorical content of political speech. In this paper, we ask three complementary questions about parliamentary discourse: how things are said, what is being said, and who is speaking in discursively similar ways. To answer these questions, we introduce a scalable and generalizable computational framework that combines diachronic stylometric analysis, contextual topic modeling, and semantic clustering of deputies' speeches.

#83

EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)
Praval Sharma, Ashok Samal, Leen-Kiat Soh, Deepti Joshi
4.7
I 5.0 Im 4.0 P 3.5

Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset.

benchmark
#84

Language as a Latent Variable for Reasoning Optimization

Post-Training 2026-04-23 arXiv cs.CL (Computation & Language)
Linjuan Wu, Haoran Wei, Jialong Tang, Shuang Luo, Baosong Yang…
4.7
I 5.0 Im 4.0 P 3.5

As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions.

benchmarkcot
#85

Finding Meaning in Embeddings: Concept Separation Curves

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)
Paul Keuren, Marc Ponsen, Robert Ayoub Bagheri
4.7
I 4.0 Im 5.2 P 3.5

Sentence embedding techniques aim to encode key concepts of a sentence's meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier's behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence-level concepts. Our approach is classifier-independent, allowing for an objective assessment of the model's performance.

#86

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Generative Media 2026-04-23 arXiv cs.CL (Computation & Language)
Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra
4.7
I 5.2 Im 4.0 P 3.5

Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity.

benchmarkvlmt2i
#87

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Nicolae Filat, Ahmed Hussain, Konstantinos Kalogiannis, Elena Burceanu
4.7
I 5.2 Im 4.0 P 3.5

Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions.

benchmark
#88

Low-Rank Adaptation Redux for Large Models

Efficiency 2026-04-23 arXiv cs.LG (Machine Learning)
Bingcong Li, Yilang Zhang, Georgios B. Giannakis
4.7
I 5.0 Im 4.2 P 3.5

Low-rank adaptation (LoRA) has emerged as the de facto standard for parameter-efficient fine-tuning (PEFT) of foundation models, enabling the adaptation of billion-parameter networks with minimal computational and memory overhead. Despite its empirical success and rapid proliferation of variants, it remains elusive which architectural choices, optimization techniques, and deployment constraints should guide practical method selection.

finetune
#89

Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
Sherly Alfonso-Sánchez, Cristián Bravo, Kristina G. Stankova
4.7
I 4.0 Im 4.2 P 4.5

Geographic context is often consider relevant to motor insurance risk, yet public actuarial datasets provide limited location identifiers, constraining how this information can be incorporated and evaluated in claim-frequency models. This study examines how geographic information from alternative data sources can be incorporated into actuarial models for Motor Third Party Liability (MTPL) claim prediction under such constraints. Using the BeMTPL97 dataset, we adopt a zone-level modeling framework and evaluate predictive performance on unseen postcodes.

transformerpretrain
#90

GFlowState: Visualizing the Training of Generative Flow Networks Beyond the Reward

Reinforcement Learning 2026-04-23 arXiv cs.LG (Machine Learning)
Florian Holeczek, Andreas Hinterreiter, Alex Hernandez-Garcia, Marc Streit, Christina Humer
4.7
I 4.5 Im 4.7 P 3.5

We present GFlowState, a visual analytics system designed to illuminate the training process of Generative Flow Networks (GFlowNets or GFNs). GFlowNets are a probabilistic framework for generating samples proportionally to a reward function. While GFlowNets have proved to be powerful tools in applications such as molecule and material discovery, their training dynamics remain difficult to interpret. Standard machine learning tools allow metric tracking but do not reveal how models explore the sample space, construct sample trajectories, or shift sampling probabilities during training.

#91

Transferable SCF-Acceleration through Solver-Aligned Initialization Learning

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Eike S. Eberhard, Viktor Kotsev, Timm Güthle, Stephan Günnemann
4.7
I 5.5 Im 3.5 P 3.5

Machine learning methods that predict initial guesses from molecular geometry can reduce this cost, but matrix-prediction models fail when extrapolating to larger molecules, degrading rather than accelerating convergence [Liu et al. 2025]. We show that this failure is a supervision problem, not an extrapolation problem: models trained on ground-state targets fit those targets well out of distribution, yet produce initial guesses that slow convergence. Solver-Aligned Initialization Learning (SAIL) resolves this for both Hamiltonian and density matrix models by differentiating through the SCF solver end-to-end.

molecular
#92

A Kernel Nonconformity Score for Multivariate Conformal Prediction

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
Louis Meyer, Wenkai Xu
4.7
I 4.5 Im 3.5 P 4.5

Multivariate conformal prediction requires nonconformity scores that compress residual vectors into scalars while preserving certain implicit geometric structure of the residual distribution. We introduce a Multivariate Kernel Score (MKS) that produces prediction regions that explicitly adapt to this geometry. We show that the proposed score resembles the Gaussian process posterior variance, unifying Bayesian uncertainty quantification with the coverage guarantees of frequentist-type. Moreover, the MKS can be decomposed into an anisotropic Maximum Mean Discrepancy (MMD) that interpolates between kernel density estimation and covariance-weighted distance.

#93

Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)arXiv — Mechanistic Interpretability
Frederik L. Dennig, Daniel A. Keim
4.7
I 4.0 Im 4.0 P 4.5

Parametric projections let analysts embed new points in real time, but input variations from measurement noise or data drift can produce unpredictable shifts in the 2D layout. Whether and where a projection is locally stable remains largely unexamined. In this paper, we present a stability evaluation framework that probes parametric projections with Gaussian perturbations around selected anchor points and assesses how neighborhoods deform in the 2D embedding.

#94
4.7
I 4.0 Im 4.5 P 4.0

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence.

benchmarkvlaagentscoding
#95

Will fusion power get cheap? Don’t count on it.

Safety & policy 2026-04-23 MIT Technology Review — AI
4.7
I 4.0 Im 4.5 P 4.3

Fusion power could provide a steady, zero-emissions source of electricity in the future—if companies can get plants built and running. But a new study suggests that even if that future arrives, it might not come cheap. Technologies tend to get less expensive over time. Lithium-ion batteries are now about 90% cheaper than they were in 2013. But historically, different technologies tend to go through this curve at different rates. And the cost of fusion might not sink as quickly as the prices of batteries or solar.

#96

Pentagon uses GenAI.mil to create 100K agents

Government & Defense 2026-04-23 DefenseScoop
4.7
I 4.0 Im 4.5 P 4.3

Defense officials recently used the Pentagon’s enterprise-wide generative artificial intelligence platform to create 100,000 agents amid a broader push by department leadership to speed up AI adoption, according to a senior member of the research and engineering directorate. The Pentagon first introduced its GenAI.mil platform for its workforce in December, with the aim of providing commercial tools to millions of personnel across the DOD. Defense Secretary Pete Hegseth and CTO Emil Michael have both championed the capability and encouraged its widespread use .

#98

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, Michal Kuszewski
4.6
I 4.0 Im 5.2 P 3.5

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer).

agentscoding
#99

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Eghbal A. Hosseini, Brian Cheung, Evelina Fedorenko, Alex H. Williams
4.6
I 4.5 Im 4.7 P 3.5

Neural networks exhibit a remarkable degree of representational convergence across diverse architectures, training objectives, and even data modalities. This convergence is predictive of alignment with brain representation. A recent hypothesis suggests this arises from learning the underlying structure in the environment in similar ways. However, it is unclear how individual stimuli elicit convergent representations across networks. An image can be perceived in multiple ways and expressed differently using words. Here, we introduce a methodology based on the Generalized Procrustes Algorithm to measure intra-modal representational convergence at the single-stimulus level.

#100

SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

Generative Media 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger, Umberto Michelucci
4.6
I 4.0 Im 4.0 P 4.5

Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline.

benchmark
#101

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue, Vianney Jouhet, Fleur Mougin
4.6
I 4.0 Im 5.2 P 3.5

In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events.

#102
4.6
I 4.0 Im 5.2 P 3.5

The increasing integration of artificial intelligence (AI) in higher education has raised important questions regarding students' transparency in reporting AI-assisted work. This study investigates the psychological mechanisms underlying university students' willingness to disclose AI use by applying the Cognition--Affect--Conation (CAC) framework. A sequential explanatory mixed-methods design was employed. In the quantitative phase, survey data were collected from 546 university students and analysed using structural equation modelling to examine the relationships among cognitive perceptions, affective responses, and disclosure intention.

#103

Causal Disentanglement for Full-Reference Image Quality Assessment

Generative Media 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
Zhen Zhang, Jielei Chu, Tian Zhang, Weide Liu, Fengmao Lv…
4.6
I 4.0 Im 4.0 P 4.5

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images.

benchmark
#104

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Songen Gu, Yuhang Zheng, Weize Li, Yupeng Zheng, Yating Feng…
4.6
I 4.7 Im 4.3 P 3.5

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning.

benchmarkdiffusionrobotmanipulation
#105

Ufil: A Unified Framework for Infrastructure-based Localization

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Simon Schäfer, Lucas Hegerath, Marius Molz, Massimo Marcon, Bassam Alrifaee
4.6
I 4.5 Im 4.7 P 3.5

Infrastructure-based localization enhances road safety and traffic management by providing state estimates of road users. Development is hindered by fragmented, application-specific stacks that tightly couple perception, tracking, and middleware. We introduce Ufil, a Unified Framework for Infrastructure-Based Localization with a standardized object model and reusable multi-object tracking components. Ufil offers interfaces and reference implementations for prediction, detection, association, state update, and track management, allowing researchers to improve components without reimplementing the pipeline. Ufil is open-source C++/ROS 2 software with documentation and executable examples.

#106

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang…
4.6
I 4.0 Im 5.0 P 3.5

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization.

diffusionvlarobot
#107

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Katharina Prasse, Steffen Jung, Isaac Bravo, Stefanie Walter, Patrick Knab…
4.6
I 5.5 Im 4.0 P 3.5

Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation.

benchmarkvlmcot
#108

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Minghao Yin, Wenbo Hu, Jiale Xu, Ying Shan, Kai Han
4.6
I 5.0 Im 4.5 P 3.5

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask.

diffusiontransformerpretrain
#109

Extract PDF text in your browser with LiteParse for the web

Frontier LLMs 2026-04-23 Simon Willison's Weblog
4.5
I 4.5 Im 3.5 P 5.0

LlamaIndex have a most excellent open source project called LiteParse , which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js. Spatial text parsing Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself.

#110

llm-openai-via-codex 0.1a0

Frontier LLMs 2026-04-23 Simon Willison's Weblog
4.5
I 4.5 Im 3.5 P 5.0

Release: llm-openai-via-codex 0.1a0 Hijacks your Codex CLI credentials to make API calls with LLM, as described in my post about GPT-5.5 . Tags: openai , llm , codex-cli

#111

Millisecond Converter

Frontier LLMs 2026-04-24 Simon Willison's Weblog
4.5
I 4.5 Im 3.5 P 5.0

Tool: Millisecond Converter LLM reports prompt durations in milliseconds and I got fed up of having to think about how to convert those to seconds and minutes. Tags: tools

#112

Seeing Fast and Slow: Learning the Flow of Time in Videos

Generative Media 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
Yen-Siang Wu, Rundong Luo, Jingsen Zhu, Tao Tu, Ali Farhadi…
4.5
I 4.0 Im 3.8 P 4.5

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos.

multimodalvideo-gen
#113

A Scale-Adaptive Framework for Joint Spatiotemporal Super-Resolution with Diffusion Models

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Max Defez, Filippo Quarenghi, Mathieu Vrac, Stephan Mandt, Tom Beucler
4.5
I 4.0 Im 3.8 P 4.5

Deep-learning video super-resolution has progressed rapidly, but climate applications typically super-resolve (increase resolution) either space or time, and joint spatiotemporal models are often designed for a single pair of super-resolution (SR) factors (upscaling spatial and temporal ratio between the low-resolution sequence and the high-resolution sequence), limiting transfer across spatial resolutions and temporal cadences (frame rates).

diffusionlong-context
#114

Bounding the Black Box: A Statistical Certification Framework for AI Risk Regulation

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Natan Levy, Gadi Perl
4.5
I 4.0 Im 4.7 P 3.5

Artificial intelligence now decides who receives a loan, who is flagged for criminal investigation, and whether an autonomous vehicle brakes in time. Governments have responded: the EU AI Act, the NIST Risk Management Framework, and the Council of Europe Convention all demand that high-risk systems demonstrate safety before deployment. Yet beneath this regulatory consensus lies a critical vacuum: none specifies what ``acceptable risk'' means in quantitative terms, and none provides a technical method for verifying that a deployed system actually meets such a threshold.

#115

Alignment has a Fantasia Problem

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Nathanael Jo, Zoe De Simone, Mitchell Gordon, Ashia Wilson
4.5
I 4.0 Im 4.7 P 3.5

Modern AI assistants are trained to follow instructions, implicitly assuming that users can clearly articulate their goals and the kind of assistance they need. Decades of behavioral research, however, show that people often engage with AI systems before their goals are fully formed. When AI systems treat prompts as complete expressions of intent, they can appear to be useful or convenient, but not necessarily aligned with the users' needs. We call these failures Fantasia interactions.

#116

Probably Approximately Consensus: On the Learning Theory of Finding Common Ground

Efficiency 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Carter Blair, Ben Armstrong, Shiri Alouf-Heffetz, Nimrod Talmon, Davide Grossi
4.5
I 4.0 Im 3.5 P 4.5

A primary goal of online deliberation platforms is to identify ideas that are broadly agreeable to a community of users through their expressed preferences. Yet, consensus elicitation should ideally extend beyond the specific statements provided by users and should incorporate the relative salience of particular topics. We address this issue by modelling consensus as an interval in a one-dimensional opinion space derived from potentially high-dimensional data via embedding and dimensionality reduction.

#117

Why are all LLMs Obsessed with Japanese Culture? On the Hidden Cultural and Regional Biases of LLMs

Post-Training 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)
Joseba Fernandez de Landa, Carla Perez-Almendros, Jose Camacho-Collados
4.5
I 4.0 Im 3.5 P 4.5

LLMs have been showing limitations when it comes to cultural coverage and competence, and in some cases show regional biases such as amplifying Western and Anglocentric viewpoints. While there have been works analysing the cultural capabilities of LLMs, there has not been specific work on highlighting LLM regional preferences when it comes to cultural-related questions. In this work, we propose a new dataset based on a comprehensive taxonomy of Culture-Related Open Questions (CROQ).

finetune
#118

Fairness under uncertainty in sequential decisions

Reinforcement Learning 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Michelle Seng Ah Lee, Kirtan Padh, David Watson, Niki Kilbertus, Jatinder Singh
4.5
I 4.0 Im 3.5 P 4.5

Fair machine learning (ML) methods help identify and mitigate the risk that algorithms encode or automate social injustices. Algorithmic approaches alone cannot resolve structural inequalities, but they can support socio-technical decision systems by surfacing discriminatory biases, clarifying trade-offs, and enabling governance. Although fairness is well studied in supervised learning, many real ML applications are online and sequential, with prior decisions informing future ones.

rl
#119

Hybrid Deep Learning Approach for Coupled Demand Forecasting and Supply Chain Optimization

Efficiency 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)
Nusrat Yasmin Nadia, Md Habibul Arif, Habibor Rahman Rabby, Md Iftekhar Monzur Tanvir, Md. Jakir Hossen…
4.5
I 4.0 Im 3.5 P 4.5

Supply chain resilience and efficiency are vital in industries characterized by volatile demand and uncertain supply, such as textiles and personal protective equipment (PPE). Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness. This paper proposes a Hybrid AI Framework for Demand-Supply Forecasting and Optimization (HAF-DS), which integrates a Long Short-Term Memory (LSTM)-based demand forecasting module with a mixed integer linear programming (MILP) optimization layer. The LSTM captures temporal and contextual demand dependencies, while the optimization layer prescribes cost-efficient replenishment and allocation decisions.

#120

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)
Yuto Nishida, Naoki Shikoda, Yosuke Kishinami, Ryo Fujii, Makoto Morishita…
4.5
I 4.5 Im 4.0 P 3.5

Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms.

#121

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Post-Training 2026-04-23 arXiv cs.CL (Computation & Language)
Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Wenchao Dong, Jaehong Kim…
4.5
I 4.0 Im 4.7 P 3.5

Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making.

#122

SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)
Hans Ole Hatzel, Ekaterina Artemova, Haimo Paul Stiemer, Evelyn Gius, Chris Biemann
4.5
I 4.5 Im 4.2 P 3.5

We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations.

finetunepretrain
#123

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)
Bernard Muller, Antonio Armando Ortiz Barrañón, LaVonne Roberts
4.5
I 4.0 Im 4.7 P 3.5

We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings.

#124

Multilinguality at the Edge: Developing Language Models for the Global South

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)
Lester James V. Miranda, Songbo Hu, Roi Reichart, Anna Korhonen
4.5
I 5.0 Im 3.5 P 3.5

Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete.

#125

From Tokens to Concepts: Leveraging SAE for SPLADE

Research 2026-04-23 arXiv cs.CL (Computation & Language)arXiv — Mechanistic Interpretability
Yuxuan Zong, Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski
4.5
I 4.0 Im 3.5 P 4.5

Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models.

#126

Cross-Domain Data Selection and Augmentation for Automatic Compliance Detection

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)
Fariz Ikhwantri, Dusica Marijan
4.5
I 4.0 Im 3.5 P 4.5

Automating the detection of regulatory compliance remains a challenging task due to the complexity and variability of legal texts. Models trained on one regulation often fail to generalise to others. This limitation underscores the need for principled methods to improve cross-domain transfer. We study data selection as a strategy to mitigate negative transfer in compliance detection framed as a natural language inference (NLI) task. Specifically, we evaluate four approaches for selecting augmentation data from a larger source domain: random sampling, Moore-Lewis's cross-entropy difference, importance weighting, and embedding-based retrieval.

#127

AI-Gram: When Visual Agents Interact in a Social Network

Agents & Tool Use 2026-04-23 arXiv cs.CL (Computation & Language)
Andrew Shin
4.5
I 4.5 Im 4.0 P 3.5

We present AI-Gram, a live platform enabling image-based interactions, to study social dynamics in a fully autonomous multi-agent visual network where all participants are LLM-driven agents. Using the platform, we conduct experiments on how agents communicate and adapt through visual media, and observe the spontaneous emergence of visual reply chains, indicating rich communicative structure. At the same time, agents exhibit aesthetic sovereignty resisting stylistic convergence toward social partners, anchoring under adversarial influence, and a decoupling between visual similarity and social ties.

agents
#128

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Agents & Tool Use 2026-04-23 arXiv cs.CL (Computation & Language)
Qijun Han, Haoqin Tu, Zijun Wang, Haoyue Dai, Yiyang Zhou…
4.5
I 4.0 Im 4.5 P 3.5

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence.

benchmarkvlaagentscoding
#129

Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yanjiang Guo, Jiaming Liu…
4.5
I 4.0 Im 4.7 P 3.5

Post-training is essential for turning pretrained generalist robot policies into reliable task-specific controllers, but existing human-in-the-loop pipelines remain tied to physical execution: each correction requires robot time, scene setup, resets, and operator supervision in the real world. Meanwhile, action-conditioned world models have been studied mainly for imagination, synthetic data generation, and policy evaluation. We propose \textbf{Human-in-the-World-Model (Hi-WM)}, a post-training framework that uses a learned world model as a reusable corrective substrate for failure-targeted policy improvement.

robotmanipulationpretrain
#130

SLAM as a Stochastic Control Problem with Partial Information: Optimal Solutions and Rigorous Approximations

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Ilir Gusija, Fady Alajaji, Serdar Yüksel
4.5
I 4.5 Im 4.2 P 3.5

Simultaneous localization and mapping (SLAM) is a foundational state estimation problem in robotics in which a robot accurately constructs a map of its environment while also localizing itself within this construction. We study the active SLAM problem through the lens of optimal stochastic control, thereby recasting it as a decision-making problem under partial information. After reviewing several commonly studied models, we present a general stochastic control formulation of active SLAM together with a rigorous treatment of motion, sensing, and map representation.

robot
#131

A Replicable Robotics Awareness Method Using LLM-Enabled Robotics Interaction: Evidence from a Corporate Challenge

Robotics 2026-04-23 arXiv cs.RO (Robotics)
S. A. Prieto, M. A. Gopee, Y. Ben Arab, B. García de Soto, J. Esteba…
4.5
I 4.0 Im 4.7 P 3.5

Large language models are increasingly being explored as interfaces between humans and robotic systems, yet there remains limited evidence on how such technologies can be used not only for interaction, but also as a structured means of introducing robotics to non-specialist users in real organizational settings. This paper introduces and evaluates a challenge-based method for robotics awareness, implemented through an LLM-enabled humanoid robot activity conducted with employees of AD Ports Group in the United Arab Emirates.

robothumanoid
#132

The Sample Complexity of Multicalibration

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)
Natalie Collina, Jiuyao Lu, Georgy Noarov, Aaron Roth
4.5
I 4.0 Im 3.5 P 4.5

We study the minimax sample complexity of multicalibration in the batch setting. A learner observes $n$ i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most $\varepsilon$ with respect to a given family of groups. For every fixed $κ> 0$, in the regime $|G|\le \varepsilon^{-κ}$, we prove that $\widetildeΘ(\varepsilon^{-3})$ samples are necessary and sufficient, up to polylogarithmic factors.

#133

Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Isabel Kurth, Paulo Yanez Sarmiento, Bernhard Y. Renard
4.5
I 5.0 Im 3.5 P 3.5

Explaining deep neural network predictions on genome sequences enables biological insight and hypothesis generation-often of greater interest than predictive performance alone. While explanations of convolutional neural networks (CNNs) have been shown to capture relevant patterns in genome sequences, it is unclear whether this transfers to more expressive Transformer-based genome language models (gLMs). To answer this question, we adapt AttnLRP, an extension of layer-wise relevance propagation to the attention mechanism, and apply it to the state-of-the-art gLM DNABERT-2. Thereby, we propose strategies to transfer explanations from token and nucleotide level.

transformer
#134

A-THENA: Early Intrusion Detection for IoT with Time-Aware Hybrid Encoding and Network-Specific Augmentation

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Ioannis Panopoulos, Maria Lamprini A. Bartsioka, Sokratis Nikolaidis, Stylianos I. Venieris, Dimitra I. Kaklamani…
4.5
I 4.7 Im 4.0 P 3.5

The proliferation of Internet of Things (IoT) devices has significantly expanded attack surfaces, making IoT ecosystems particularly susceptible to sophisticated cyber threats. To address this challenge, this work introduces A-THENA, a lightweight early intrusion detection system (EIDS) that significantly extends preliminary findings on time-aware encodings. A-THENA employs an advanced Transformer-based architecture augmented with a generalized Time-Aware Hybrid Encoding (THE), integrating packet timestamps to effectively capture temporal dynamics essential for accurate and early threat detection. The proposed system further employs a Network-Specific Augmentation (NA) pipeline, which enhances model robustness and generalization.

benchmarktransformercoding
#135

Verifying Machine Learning Interpretability Requirements through Provenance

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Lynn Vonderhaar, Juan Couder, Daryela Cisneros, Omar Ochoa
4.5
I 4.0 Im 4.7 P 3.5

Machine Learning (ML) Engineering is a growing field that necessitates an increase in the rigor of ML development. It draws many ideas from software engineering and more specifically, from requirements engineering. Existing literature on ML Engineering defines quality models and Non-Functional Requirements (NFRs) specific to ML, in particular interpretability being one such NFR. However, a major challenge occurs in verifying ML NFRs, including interpretability.

#136

Dynamical Priors as a Training Objective in Reinforcement Learning

Reinforcement Learning 2026-04-23 arXiv cs.LG (Machine Learning)
Sukesh Subaharan
4.5
I 4.5 Im 4.0 P 3.5

Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP-RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning.

rlagents
#137

A single algorithm for both restless and rested rotting bandits

Research 2026-04-23 arXiv stat.ML (Statistical ML)
Julien Seznec, Pierre Ménard, Alessandro Lazaric, Michal Valko
4.5
I 5.5 Im 3.5 P 3.5

In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time).

#138

Grounding Video Reasoning in Physical Signals

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Alibay Osmanli, Zixu Cheng, Shaogang Gong
4.5
I 5.2 Im 4.0 P 3.5

Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU.

benchmark
#139

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li…
4.5
I 4.0 Im 5.2 P 3.5

Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment.

benchmarkmultimodal
#140
4.5
I 4.0 Im 4.7 P 3.5

Wilson coefficients in dimension-six effective field theory are constrained in a combined fit to several ATLAS measurements. These inputs probe Higgs-boson processes across multiple production and decay modes, di-Higgs signatures in the $b\bar{b}γγ$ and $b\bar{b}ττ$ final states, $WW$ and $WZ$ diboson signatures, electroweak $Zjj$ final states, high-mass Drell-Yan interactions, and top-antitop events in both resolved and boosted topologies. Precision electroweak observables from LEP, SLD, and ATLAS are also included.

#141

Hybrid Policy Distillation for LLMs

Research 2026-04-22 HF ↑8 Hugging Face Daily Papers
4.5
I 4.0 Im 3.5 P 4.8

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling.

distillation
#142

The Download: introducing the Nature issue

Safety & policy 2026-04-23 MIT Technology Review — AI
4.5
I 4.0 Im 4.0 P 4.3

This is today’s edition of The Download , our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Introducing: the Nature issue When we talk about “nature,” we usually mean something untouched by humans. But little of that world exists today.  From microplastics in rainforest wildlife to artificial light in the Arctic Ocean, human influence now reaches every corner of Earth. In this context, what even is nature?

#143

Quoting Maggie Appleton

Frontier LLMs 2026-04-23 Simon Willison's Weblog
4.5
I 4.0 Im 4.0 P 4.3

[...] if you ever needed another reason to learn in public by digital gardening or podcasting or streaming or whathaveyou, add on that people will assume you’re more competent than you are. This will get you invites to very cool exclusive events filled with high-achieving, interesting people, even though you have no right to be there. A+ side benefit. — Maggie Appleton , Gathering Structures ( via ) Tags: blogging , maggie-appleton

#144
4.5
I 4.0 Im 4.0 P 4.3

It’s no secret that Operation Epic Fury and the associated war in the Middle East have sparked major disruptions in the complex, global logistics network that U.S. Transportation Command relies on to move, equip and support the joint force. But according to its commander Air Force Gen. Randall Reed, those disturbances are also enabling Transcom and its military partners to integrate and refine their joint logistics operations, and expand deployments of real-time data and AI-enabled visualization assets.

#147

EXCLUSIVE: Lockheed exits Navy trainer aircraft competition

Government & Defense 2026-04-23 Breaking Defense
4.5
I 4.0 Im 4.0 P 4.3

After the surprise move, the field of competitors for the Navy’s Undergraduate Jet Training System has now narrowed to SNC, Boeing, and Textron Aviation Defense in partnership with Leonardo.

#149

Why Do Many Western Defense Tech Firms Struggle in Ukraine?

Government & Defense 2026-04-23 War on the Rocks
4.5
I 5.0 Im 4.0 P 3.5

Michael Kofman joined Ryan at a live event earlier this year to discuss the performance of American defense technology in Ukraine and why it often falls short. They examine the challenges of fielding and iterating systems in combat, from poor implementation and weak feedback loops to deeper mismatches between design and battlefield reality. They also explore what it takes to succeed in this environment and what it means for future conflicts. Thanks to Leonid Capital Partners for hosting the event at which this podcast was recorded.

#150
4.5
I 5.0 Im 4.0 P 3.5

China’s greatest technological ambition and its greatest political obsession are quietly destroying each other. The same censorship apparatus the Party built to control its people is now corrupting the AI systems its leaders depend on. The United States, by leaning into an open marketplace of information and ideas, will gain advantage as it takes a different path. AI is increasingly training newer, faster AI models . This typically involves scraping the internet for content and then loading it into datasets for new programs.

#155

Bridging the Training-Deployment Gap: Gated Encoding and Multi-Scale Refinement for Efficient Quantization-Aware Image Enhancement

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
Dat To-Thanh, Nghia Nguyen-Trong, Hoang Vo, Hieu Bui-Minh, Tinh-Anh Nguyen-Nhu
4.4
I 4.0 Im 3.5 P 4.5

Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low-quality mobile photos into high-quality images, their performance is often degraded when converted to lower-precision formats for actual use on mobile phones. To address this training-deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine-grained visual features.

codingquantization
#156

Efficient Logic Gate Networks for Video Copy Detection

Multimodal 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)
Katarzyna Fojcik
4.4
I 4.0 Im 3.5 P 4.5

Video copy detection requires robust similarity estimation under diverse visual distortions while operating at very large scale. Although deep neural networks achieve strong performance, their computational cost and descriptor size limit practical deployment in high-throughput systems. In this work, we propose a video copy detection framework based on differentiable Logic Gate Networks (LGNs), which replace conventional floating-point feature extractors with compact, logic-based representations. Our approach combines aggressive frame miniaturization, binary preprocessing, and a trainable LGN embedding model that learns both logical operations and interconnections.

#157

Neuromorphic Computing Based on Parametrically-Driven Oscillators and Frequency Combs

Research 2026-04-23 arXiv cs.NE (Neural & Evolutionary Computing)
Mahadev Sunil Kumar, Adarsh Ganesan
4.4
I 4.7 Im 4.0 P 3.5

Parametrically driven oscillators provide a natural platform for neuromorphic computation, where nonlinear mode coupling and intrinsic dynamics enable both memory and high-dimensional transformation. Here, we investigate a two-mode system exhibiting 2:1 parametric resonance and demonstrate its operation as a reservoir computer across distinct dynamical regimes, including sub-threshold, parametric resonance, and frequency-comb states. By encoding input signals into the drive amplitude and sampling the resulting temporal and spectral responses, we perform one step-ahead prediction of benchmark chaotic systems, including Mackey-Glass, Rossler, and Lorenz dynamics.

benchmarkcoding
#158

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin
4.4
I 4.0 Im 4.7 P 3.5

Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone.

mech-interp
#159

DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Tahar Chettaoui, Eduarda Caldeira, Guray Ozgur, Raghavendra Ramachandra, Fadi Boutros…
4.4
I 5.0 Im 3.8 P 3.5

Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels.

diffusion
#160

russellromney/honker

Frontier LLMs 2026-04-24 Simon Willison's Weblog
4.3
I 4.0 Im 3.5 P 4.5

russellromney/honker "Postgres NOTIFY/LISTEN semantics" for SQLite, implemented as a Rust SQLite extension and various language bindings to help make use of it. The design of this looks very solid. It lets you write Python code for queues that looks like this: import honker db = honker . open ( "app.db" ) emails = db . queue ( "emails" ) emails . enqueue ({ "to" : "[email protected]" }) # Consume (in a worker process) async for job in emails . claim ( "worker-1" ): send ( job .

#161

Serving the For You feed

Frontier LLMs 2026-04-24 Simon Willison's Weblog
4.3
I 4.0 Im 3.5 P 4.5

Serving the For You feed One of Bluesky's most interesting features is that anyone can run their own custom "feed" implementation and make it available to other users - effectively enabling custom algorithms that can use any mechanism they like to recommend posts. spacecowboy runs the For You Feed , used by around 72,000 people. This guest post on the AT Protocol blog explains how it works. The architecture is fascinating .

#162

Vista4D: Video Reshooting with 4D Point Clouds

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Kuan Heng Lin, Zhizheng Liu, Pablo Salamanca, Yash Kant, Ryan Burgert…
4.3
I 5.0 Im 3.5 P 3.5

We present Vista4D, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. Specifically, given an input video, our method re-synthesizes the scene with the same dynamics from a different camera trajectory and viewpoint. Existing video reshooting methods often struggle with depth estimation artifacts of real-world dynamic videos, while also failing to preserve content appearance and failing to maintain precise camera control for challenging new trajectories.

#163

DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)
Xu Wang, Zhiru Wang, Shiyun Xie, Chengwei Pan, Yisong Chen
4.3
I 5.0 Im 3.5 P 3.5

While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training.

#164

Context Unrolling in Omni Models

Research 2026-04-23 HF ↑2 Hugging Face Daily Papers
4.3
I 4.0 Im 4.0 P 3.8

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity.

benchmarkmultimodal
#168

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Chee Wei Tan, Yuchen Wang, Shangxin Guo
4.2
I 4.0 Im 4.0 P 3.5

This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability.

rlagentsfinetune
#169

Agentic AI-assisted coding offers a unique opportunity to instill epistemic grounding during software development

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Magnus Palmblad, Jared M. Ragland, Benjamin A. Neely
4.2
I 4.0 Im 4.0 P 3.5

The capabilities of AI-assisted coding are progressing at breakneck speed. Chat-based vibe coding has evolved into fully fledged AI-assisted, agentic software development using agent scaffolds where the human developer creates a plan that agentic AIs implement. One current trend is utilizing documents beyond this plan document, such as project and method-scoped documents. Here we propose GROUNDING.md, a community-governed, field-scoped epistemic grounding document, using mass spectrometry-based proteomics as an example.

agentscoding
#170

Using ASP(Q) to Handle Inconsistent Prioritized Data

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Meghyn Bienvenu, Camille Bourgaux, Robin Jean, Giuseppe Mazzotta
4.2
I 4.0 Im 4.0 P 3.5

We explore the use of answer set programming (ASP) and its extension with quantifiers, ASP(Q), for inconsistency-tolerant querying of prioritized data, where a priority relation between conflicting facts is exploited to define three notions of optimal repairs (Pareto-, globally- and completion-optimal). We consider the variants of three well-known semantics (AR, brave and IAR) that use these optimal repairs, and for which query answering is in the first or second level of the polynomial hierarchy for a large class of logical theories.

coding
#171

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)
Yilong Chen, Yanxi Xie, Zitian Gao, He Xin, Yihao Xiao…
4.2
I 4.0 Im 4.0 P 3.5

Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features.

#172

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)
Minh Duc Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager, Katharina von der Wense
4.2
I 4.0 Im 4.0 P 3.5

Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection.

benchmarkcodegen
#173

UKP_Psycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)
Darya Hryhoryeva, Amaia Zurinaga, Hamidreza Jamalabadi, Iryna Gurevych
4.2
I 4.0 Im 4.0 P 3.5

This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings.

#174

Job Skill Extraction via LLM-Centric Multi-Module Framework

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)
Guojing Li, Zichuan Fu, Junyi Li, Faxue Liu, Wenxia Zhou…
4.2
I 4.0 Im 3.5 P 3.5

Span-level skill extraction from job advertisements underpins candidate-job matching and labor-market analytics, yet generative large language models (LLMs) often yield malformed spans, boundary drift, and hallucinations, especially with long-tail terms and cross-domain shift. We present SRICL, an LLM-centric framework that combines semantic retrieval (SR), in-context learning (ICL), and supervised fine-tuning (SFT) with a deterministic verifier. SR pulls in-domain annotated sentences and definitions from ESCO to form format-constrained prompts that stabilize boundaries and handle coordination. SFT aligns output behavior, while the verifier enforces pairing, non-overlap, and BIO legality with minimal retries.

finetune
#175

How English Print Media Frames Human-Elephant Conflicts in India

Research 2026-04-23 arXiv cs.CL (Computation & Language)
Bonala Sai Punith, Salveru Jayati, Garima Shakya, Shubham Kumar Nigam
4.2
I 4.0 Im 3.5 P 3.5

Human-elephant conflict (HEC) is rising across India as habitat loss and expanding human settlements force elephants into closer contact with people. While the ecological drivers of conflict are well-studied, how the news media portrays them remains largely unexplored. This work presents the first large-scale computational analysis of media framing of HEC in India, examining 1,968 full-length news articles consisting of 28,986 sentences, from a major English-language outlet published between January 2022 and September 2025.

transformer
#176

Reasoning Primitives in Hybrid and Non-Hybrid LLMs

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)
Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corrêa
4.2
I 4.0 Im 3.5 P 3.5

Reasoning in large language models is often treated as a monolithic capability, but its observed gains may arise from more basic operations. We study reasoning through two such primitives, recall and state-tracking, and ask whether hybrid architectures that combine attention-based retrieval with recurrent state updates are better suited than attention-only models for tasks that jointly require both. Using matched Olmo3 transformer and hybrid models in instruction-tuned and reasoning-augmented variants, we evaluate these models on a set of controlled tasks involving a mixture of state-tracking and recall primitives, state-based recall.

transformer
#177

Decoupled DiLoCo for Resilient Distributed Pre-training

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)
Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, Nova Fallen…
4.2
I 4.5 Im 3.5 P 3.5

Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput.

#178

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

Research 2026-04-23 arXiv cs.CL (Computation & Language)
Michele Miranda, Xinlan Yan, Nishant Mishra, Rachel Murphy, Ameen Abu-Hanna…
4.2
I 4.0 Im 4.0 P 3.5

Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction.

#179

MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)
Maziar Kianimoghadam Jouneghani
4.2
I 4.0 Im 3.5 P 3.5

We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive framework that switches between multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance.

#180

mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code

Research 2026-04-23 arXiv cs.CL (Computation & Language)
Adam Skurla, Dominik Macko, Jakub Simko
4.2
I 4.0 Im 3.5 P 3.5

Multi-domain detection of the machine-generated code snippets in various programming languages is a challenging task. SemEval-2026 Task~13 copes with this challenge in various angles, as a binary detection problem as well as attribution of the source. Specifically, its subtasks also cover generator LLM family detection, as well as a hybrid code co-generated by humans and machines, or adversarially modified codes hiding its origin.

#181

A Case Study in Recovery of Drones using Discrete-Event Systems

Research 2026-04-23 arXiv cs.RO (Robotics)
Liam P. Burns, Dayse M. Cavalcanti, Felipe G. Cabral, Max H. de Queiroz, Melissa Greeff…
4.2
I 4.5 Im 3.5 P 3.5

Discrete-event systems and supervisory control theory provide a rigorous framework for specifying correct-by-construction behavior. However, their practical application to swarm robotics remains largely underexplored. In this paper, we investigate a topological recovery method based on discrete-event-systems within a swarm robotics context. We propose a hybrid architecture that combines a high-level discrete event systems supervisor with a low-level continuous controller, allowing lost drones to safely recover from fault or attack events and re-enter a controlled region. The method is demonstrated using ten simulated UAVs in the py-bullet-drones framework.

robot
#182

Fine-Tuning Regimes Define Distinct Continual Learning Problems

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Paul-Tiberiu Iordache, Elena Burceanu
4.2
I 4.0 Im 4.0 P 3.5

Continual learning (CL) studies how models acquire tasks sequentially while retaining previously learned knowledge. Despite substantial progress in benchmarking CL methods, comparative evaluations typically keep the fine-tuning regime fixed. In this paper, we argue that the fine-tuning regime, defined by the trainable parameter subspace, is itself a key evaluation variable. We formalize adaptation regimes as projected optimization over fixed trainable subspaces, showing that changing the trainable depth alters the effective update signal through which both current task fitting and knowledge preservation operate.

benchmarkfinetune
#183

On the algebra of Koopman eigenfunctions and on some of their infinities

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Zahra Monfared, Saksham Malhotra, Sekiya Hajime, Ioannis Kevrekidis, Felix Dietrich
4.2
I 4.0 Im 3.5 P 3.5

For continuous-time dynamical systems with reversible trajectories, the nowhere-vanishing eigenfunctions of the Koopman operator of the system form a multiplicative group. Here, we exploit this property to accelerate the systematic numerical computation of the eigenspaces of the operator. Given a small set of (so-called ``principal'') eigenfunctions that are approximated conventionally, we can obtain a much larger set by constructing polynomials of the principal eigenfunctions. This enriches the set, and thus allows us to more accurately represent application-specific observables. Often, eigenfunctions exhibit localized singularities (e.g.

#184

An effective variant of the Hartigan $k$-means algorithm

Research 2026-04-23 arXiv cs.LG (Machine Learning)
François Clément, Stefan Steinerberger
4.2
I 4.0 Im 3.5 P 3.5

The k-means problem is perhaps the classical clustering problem and often synonymous with Lloyd's algorithm (1957). It has become clear that Hartigan's algorithm (1975) gives better results in almost all cases, Telgarsky-Vattani note a typical improvement of $5\%$ -- $10\%$. We point out that a very minor variation of Hartigan's method leads to another $2\%$ -- $5\%$ improvement; the improvement tends to become larger when either dimension or $k$ increase.

#185

Compliance Moral Hazard and the Backfiring Mandate

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Jian Ni, Lecheng Zheng, John R Birge
4.2
I 4.0 Im 4.0 P 3.5

Competing firms that serve shared customer populations face a fundamental information aggregation problem: each firm holds fragmented signals about risky customers, but individual incentives impede efficient collective detection. We develop a mechanism design framework for decentralized risk analytics, grounded in anti-money laundering in banking networks. Three strategic frictions distinguish our setting: compliance moral hazard, adversarial adaptation, and information destruction through intervention.

benchmark
#186

Transferable Physics-Informed Representations via Closed-Form Head Adaptation

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Jian Cheng Wong, Isaac Yin Chung Lai, Pao-Hsiung Chiu, Chin Chun Ooi, Abhishek Gupta…
4.2
I 4.0 Im 3.5 P 3.5

Physics-informed neural networks (PINNs) have garnered significant interest for their potential in solving partial differential equations (PDEs) that govern a wide range of physical phenomena. By incorporating physical laws into the learning process, PINN models have demonstrated the ability to learn physical outcomes reasonably well. However, current PINN approaches struggle to predict or solve new PDEs effectively when there is a lack of training examples, indicating they do not generalize well to unseen problem instances.

#187

Neural surrogates for crystal growth dynamics with variable supersaturation: explicit vs. implicit conditioning

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Matteo Rigoni, Daniele Lanzoni, Francesco Montalenti, Roberto Bergamaschini
4.2
I 4.0 Im 3.5 P 3.5

Simulations of crystal growth are performed by using Convolutional Recurrent Neural Network surrogate models, trained on a dataset of time sequences computed by numerical integration of Allen-Cahn dynamics including faceting via kinetic anisotropy. Two network architectures are developed to take into account the effects of a variable supersaturation value. The first infers it implicitly by processing an input mini-sequence of a few evolution frames and then returns a consistent continuation of the evolution.

#188

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Efficiency 2026-04-23 arXiv cs.LG (Machine Learning)
Ashley N. Abraham, Andrew Strelzoff, Haley R. Dozier, Althea C. Henslee, Mark A. Chappell
4.2
I 4.0 Im 3.5 P 3.5

Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time.

quantization
#189

Geometric Characterisation and Structured Trajectory Surrogates for Clinical Dataset Condensation

Efficiency 2026-04-23 arXiv cs.LG (Machine Learning)
Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David Clifton…
4.2
I 4.0 Im 3.5 P 3.5

Dataset condensation constructs compact synthetic datasets that retain the training utility of large real-world datasets, enabling efficient model development and potentially supporting downstream research in governed domains such as healthcare. Trajectory matching (TM) is a widely used condensation approach that supervises synthetic data using changes in model parameters observed during training on real data, yet the structure of this supervision signal remains poorly understood.

#190

A temporal deep learning framework for calibration of low-cost air quality sensors

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Arindam Sengupta, Tony Bush, Ben Marner, Jose Miguel Pérez, Soledad Le Clainche
4.2
I 4.0 Im 3.5 P 3.5

Low-cost air quality sensors (LCS) provide a practical alternative to expensive regulatory-grade instruments, making dense urban monitoring networks possible. Yet their adoption is limited by calibration challenges, including sensor drift, environmental cross-sensitivity, and variability in performance from device to device. This work presents a deep learning framework for calibrating LCS measurements of PM$_{2.5}$, PM$_{10}$, and NO$_2$ using a Long Short-Term Memory (LSTM) network, trained on co-located reference data from the OxAria network in Oxford, UK.

coding
#191

Conditional anomaly detection with soft harmonic functions

Research 2026-04-23 arXiv cs.LG (Machine Learning)
Michal Valko, Branislav Kveton, Hamed Valizadegan, Gregory F. Cooper, Milos Hauskrecht
4.2
I 4.0 Im 3.5 P 3.5

In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support.

#192

Novelty-Based Generation of Continuous Landscapes with Diverse Local Optima Networks

Research 2026-04-23 arXiv cs.NE (Neural & Evolutionary Computing)
Kippei Mizuta, Shoichiro Tanaka, Shuhei Tanaka, Toshiharu Hatanaka
4.2
I 4.0 Im 4.2 P 3.5

Local Optima Networks (LONs) represent the global structure of search spaces as graphs, but their construction requires iterative execution of a search algorithm to find local optima and approximate transitions between Basins of Attraction (BoAs). In continuous optimization, this high computational cost prevents systematic investigation of the relationship between LON features and evolutionary algorithm performance. To address this issue, we propose an alternative definition of BoAs for Max-Set of Gaussians (MSG) landscapes with explicitly tunable multimodality. This bypasses search-based BoA identification, enabling low-cost LON construction.

multimodal
#193

On a class of constrained particle filters for continuous-discrete state space models

Research 2026-04-23 arXiv — State Space Models
Utku Erdogan, Gabriel J. Lord, Joaquin Miguez
4.2
I 4.0 Im 3.5 P 3.5

Particle filters (PFs) are recursive Monte Carlo algorithms for Bayesian tracking and prediction in state space models. This paper addresses continuous-discrete filtering problems, where the hidden state evolves as an Itô stochastic differential equation (SDE) and observations arrive at discrete times. We propose a novel class of constrained PFs that enforce compact support on the state at each observation instant, thereby limiting exploration to plausible regions of the state space. Unlike earlier approaches that truncate the likelihood, the proposed method constrains the dynamics directly, yielding improved numerical stability.

ssm
#194

High-energy photon hologram of a photon gas

Research 2026-04-23 arXiv — Mechanistic Interpretability
P. O. Kazinski, A. A. Sokolov
4.2
I 4.0 Im 3.5 P 3.5

The photon hologram of a one-particle density matrix of a photon gas is derived including the case where the energy of a probe photon is above the electron-positron pair creation threshold. The explicit expressions for the holograms of a photon gas with one-particle density matrix in the form of a single Gaussian and of coherent and incoherent lattices of Gaussians are obtained. The conditions for resonant cones of coherent scattering by coherent and incoherent lattices are found. These conditions turn out to be different.

#195

Optical nonlinear anomalous Hall effect reveals the hidden spin order in antiferromagnets

Research 2026-04-23 arXiv — Mechanistic Interpretability
A. Schmid, D. Siebenkotten, D. Dai, J. Godinho, T. Ostatnický…
4.2
I 4.0 Im 3.5 P 3.5

Reading antiferromagnetic order remains a central obstacle for antiferromagnetic memory and logic because zero net magnetisation precludes conventional magnetic readout. Domain imaging typically relies on x-ray magnetic linear dichroism (XMLD) microscopy at synchrotron sources, but XMLD is even under time reversal and cannot distinguish 180°-reversed magnetic states. Here we report the first experimental observation of the optical nonlinear anomalous Hall effect, predicted for antiferromagnets with combined parity - time-reversal ($PT$) symmetry.

#196

Electronic and Vibrational Properties of On-Surface Synthesized Gulf-Edged Chiral Graphene Nanoribbons

Research 2026-04-23 arXiv — Mechanistic Interpretability
Xuanchen Li, Amogh Kinikar, Vikas Sharma, Andres Ortega Guerrero, George F. S. Whitehead…
4.2
I 4.5 Im 3.5 P 3.5

On-surface synthesis enables the fabrication of graphene nanoribbons (GNRs) with atomic precision, allowing their electronic, optical, and magnetic properties to be tuned by engineering edge structure and width. Progress on the synthesis of chiral GNRs has nevertheless remained limited, largely because existing precursor designs rely on laterally fused acene units and cannot access edge topologies beyond armchair and zigzag. Here, we introduce a new on-surface synthesis motif that yields a gulf-edged chiral GNR.

#197

Correlation between active regions' spectra at high radio frequencies and solar flare occurrences

Research 2026-04-23 arXiv — Mechanistic Interpretability
Sara Mulas, Alberto Pellizzoni, Marco Marongiu, Adriana Marcucci, Simona Righini…
4.2
I 4.0 Im 3.5 P 3.5

High radio frequencies observations with the Italian network of large single-dish radio telescopes resulted in ~450 solar images between 2018 and 2023 in K-band frequency range (18-26 GHz). Solar radio mapping at these frequencies allows the probing of the Active Regions (ARs) chromospheric magnetic field close to the Transition Region, where strong flares and coronal mass ejection events occur.

#198

Performance characterisation of the Hamamatsu R760 photomultiplier tube for the PLUME detector

Research 2026-04-23 arXiv — Mechanistic Interpretability
A. Bellavista, A. Carbone, V. Chaumat, F. Ferrari, T. Nguyen-Trung…
4.2
I 4.0 Im 3.5 P 3.5

The Probe for Luminosity Measurement detector is a novel luminometer designed to monitor the luminosity and beam conditions of the Large Hadron Collider at the interaction point of the LHCb experiment, starting from Run 3. The detector is based on a hodoscope composed of 48 Hamamatsu R760 photomultiplier tubes, which detect the Cherenkov light produced by charged particles originating from the interaction region. The accurate and stable operation of these sensors is essential to ensure reliable luminosity measurements throughout the full data-taking period.

#199

Impact of Primordial Black Hole population on 21 cm observables at high redshift

Research 2026-04-23 arXiv — Mechanistic Interpretability
Atrideb Chatterjee, Barun Maity, Koushiki
4.2
I 4.0 Im 3.5 P 3.5

The 21-cm signal, one of the most promising probes of the high-redshift Universe, has traditionally been modelled without accounting for the effects of active galactic nuclei (AGN) in the pre-JWST era, primarily due to the lack of observational evidence for AGNs at z > 6. However, following the discovery of several AGNs at redshifts as high as z ~ 10 by JWST, it has become imperative to incorporate the impact of these early AGNs when predicting the 21-cm signal.

#200

Constraining dark matter self-interaction from kinetic heating in neutron stars

Research 2026-04-23 arXiv — Mechanistic Interpretability
Sambo Sarkar
4.2
I 4.0 Im 3.5 P 3.5

Dark matter search strategies have started advancing towards the neutrino fog. In this regard, compact objects such as neutron stars have already demonstrated their ability in probing such low DM-nucleon cross-sections from dark matter induced effects. In the optically thin limit, effect of dark matter self-interaction becomes relevant and may assist the capture and thermalization of dark matter inside stars, imparting observable changes on neutron star temperatures.

#201

Exploring the statistical anisotropy of primordial curvature perturbations with pulsar timing arrays

Research 2026-04-23 arXiv — Mechanistic Interpretability
Fengting Xie, Zhi-Chao Zhao, Qing-Hua Zhu, Xin Li
4.2
I 4.0 Im 3.5 P 3.5

The recent detection of a stochastic gravitational wave background by pulsar timing arrays has opened a new window in understanding supermassive black hole binaries and in probing the universe at the early time. Recently, pulsar timing array (PTA) collaborations have been further paving the way to probe anisotropies in the stochastic gravitational wave background. This study investigates dipole-type statistical anisotropy in the primordial power spectrum within a phenomenological framework.

#202

Dilepton Production as a Probe of Pion Condensation in Hot and Dense QCD Matter

Research 2026-04-23 arXiv — Mechanistic Interpretability
Aritra Bandyopadhyay, Chowdhury Aminul Islam, Krzysztof Redlich, Chihiro Sasaki
4.2
I 4.0 Im 3.5 P 3.5

We investigate dilepton production from an isospin-asymmetric hot and dense medium in order to explore the role of isospin imbalance in electromagnetic spectral properties. We focus in particular on modifications of the dilepton production rate associated with the onset of pion condensation, which can occur in the presence of a finite isospin chemical potential. We employ the Nambu--Jona-Lasinio model with isoscalar--vector interaction. We examine the phase structure in the $T-μ_I$ plane and estimate the vector current correlator--resummed dilepton rate for an effective quark chemical potential.

#203

Multi-wavelength study of EP250416a / GRB 250416C: An Optically Dark Long GRB with a Late Jet Break

Research 2026-04-23 arXiv — Mechanistic Interpretability
Guoying Zhao, Duo-Le Cao, Rong-Feng Shen, Hui Sun, Chi-Chuan Jin…
4.2
I 4.0 Im 3.5 P 3.5

We present multi-wavelength study of the $γ$/X-ray transient EP250416a (also designated GRB 250416C), triggered by the Einstein Probe (EP) Wide-field X-ray Telescope and also by SVOM and Konus-Wind. Observations spanning the gamma-ray, X-ray, and optical bands facilitated detailed analysis of the burst's prompt emission, afterglow evolution, and physical origin. EP250416a exhibits a burst duration of 30 s in X-ray and 17.7 s in gamma-rays, with joint spectral fitting of 0.5-5000 keV data gives $E\rm_{peak}=342_{-232}^{+90}$ keV.

#204

XRISM High-Resolution X-ray Spectroscopy of Cygnus X-1 -- Orbital and Short-Term Variability of Iron Absorption

Research 2026-04-23 arXiv — Mechanistic Interpretability
Kaito Ninoyu, Shinya Yamada, Natalie Hell, Elisa Costantini, Oluwashina Adegoke…
4.2
I 4.0 Im 3.5 P 3.5

We present the first high-resolution spectroscopy of the black hole high-mass X-ray binary Cygnus X-1 with XRISM, including orbital-phase-resolved analyses and tentative evidence of short-term variability in the Fe-K band on second timescales. Using data from the Performance Verification phase in April 2024, we analyzed spectral variability across orbital phases with the Resolve microcalorimeter and the Xtend CCD imager. The unprecedented resolution of Resolve reveals variability in highly ionized Fe absorption lines.

#205

Quantum jump correlations in long-range dissipative spin systems

Research 2026-04-23 arXiv — Mechanistic Interpretability
Giulia Salatino, Anna Delmonte, Zejian Li, Rosario Fazio, Alberto Biella
4.2
I 4.0 Im 3.5 P 3.5

We characterize nonequilibrium phases in long-range dissipative spin systems through the statistical properties of quantum jump trajectories. While the average dynamics governed by the Lindblad master equation provides access to steady-state expectation values of order parameters, the quantum trajectory framework reveals features encoded in the spatial and temporal correlations of detection events. Focusing on a model exhibiting a paramagnetic-to-ferromagnetic phase transition, we investigate the full counting statistics of quantum jumps using a tilted Lindbladian approach.

#206

A Multi-Stage Warm-Start Deep Learning Framework for Unit Commitment

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Muhy Eddin Za'ter, Anna Van Boven, Bri-Mathias Hodge, Kyri Baker
4.1
I 4.0 Im 3.5 P 3.5

Maintaining instantaneous balance between electricity supply and demand is critical for reliability and grid instability. System operators achieve this through solving the task of Unit Commitment (UC),ca high dimensional large-scale Mixed-integer Linear Programming (MILP) problem that is strictly and heavily governed by the grid physical constraints. As grid integrate variable renewable sources, and new technologies such as long duration storage in the grid, UC must be optimally solved for multi-day horizons and potentially with greater frequency.

transformer
#207

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Guangxiang Zhao, Qilong Shi, Xusen Xiao, Xiangzheng Zhang, Tong Yang…
4.1
I 4.0 Im 3.5 P 3.5

Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrieve these skills at inference time to guide future reasoning. Unlike the prevailing \emph{reasoning from scratch} paradigm, our approach first recalls relevant skills for each query, helping the model avoid redundant detours and focus on effective solution paths.

codingcot
#208

To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)
Nevena Lazić, Liam Fowl, András György, Csaba Szepesvári
4.1
I 4.0 Im 3.5 P 3.5

We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that were not observed during training, and it was shown that one reason behind this is the difficulty of copying (or generating) unseen tokens. We show both theoretically and empirically that a particular representational collapse also has a crucial role: the unembeddings (last-layer weights) of unseen tokens collapse to nearly the same vector during training.

transformermech-interp
#209

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Isabella Liu, An-Chieh Cheng, Rui Yan, Geng Chen, Ri-Zhao Qiu…
4.1
I 4.0 Im 3.5 P 3.5

Long-horizon manipulation remains challenging for vision-language-action (VLA) policies: real tasks are multi-step, progress-dependent, and brittle to compounding execution errors. We present LoHo-Manip, a modular framework that scales short-horizon VLA execution to long-horizon instruction following via a dedicated task-management VLM.

vlmvlarobotmanipulation
#210

A Compact Peristaltic Pump Based on Magneto-Elastic Hysteresis with Single Pneumatic Control

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Minjo Park, Metin Sitti
4.1
I 4.0 Im 3.5 P 3.5

Pumping fluids is fundamental to a wide range of industrial, environmental, and biomedical applications. Among various pumping mechanisms, peristaltic pumps enable efficient and safe fluid transport by deforming an elastic tube without direct contact with the working fluid. Although previous studies have introduced mechanical, pneumatic, or magnetic actuations to drive membrane deformation, these approaches often lead to complex pump architectures and control schemes. In this study, we present a soft membrane pump that achieves peristaltic motion through a single pneumatic input combined with an embedded passive magnet.

#211

Effects of Swarm Size Variability on Operator Workload

Robotics 2026-04-23 arXiv cs.RO (Robotics)
William Hunt, Aleksandra Landowska, Horia A. Maior, Sarvapali D. Ramchurn, Mohammad Soorati
4.1
I 4.0 Im 3.5 P 3.5

Real-world deployments of human--swarm teams depend on balancing operator workload to leverage human strengths without inducing overload. A key challenge is that swarm size is often dynamic: robots may join or leave the mission due to failures or redeployment, causing abrupt workload fluctuations. Understanding how such changes affect human workload and performance is critical for robust human--swarm interaction design. This paper investigates how the magnitude and direction of changes in swarm size influence operator workload.

robot
#212

A Bayesian Reasoning Framework for Robotic Systems in Autonomous Casualty Triage

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Szymon Rusiecki, Cecilia Morales, Pia Störy, Kimberly Elenberg, Leonard Weiss…
4.1
I 4.0 Im 3.5 P 3.5

Autonomous robots deployed in mass casualty incidents (MCI) face the challenge of making critical decisions based on incomplete and noisy perceptual data. We present an autonomous robotic system for casualty assessment that fuses outputs from multiple vision-based algorithms, estimating signs of severe hemorrhage, visible trauma, or physical alertness, into a coherent triage assessment. At the core of our system is a Bayesian network, constructed from expert-defined rules, which enables probabilistic reasoning about a casualty's condition even with missing or conflicting sensory inputs.

robot
#213

X2-N: A Transformable Wheel-legged Humanoid Robot with Dual-mode Locomotion and Manipulation

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Yan Ning, Xingzhou Chen, Delong Li, Hao Zhang, Hanfu Gai…
4.1
I 4.0 Im 3.5 P 3.5

Wheel-legged robots combine the efficiency of wheeled locomotion with the versatility of legged systems, enabling rapid traversal over both continuous and discrete terrains. However, conventional designs typically employ fixed wheels as feet and limited degrees of freedom (DoFs) at the hips, resulting in reduced stability and mobility during legged locomotion compared to humanoids with flat feet. In addition, most existing platforms lack a full upper body with arms, which limits their ability to perform dexterous manipulation tasks.

rlrobotmanipulationhumanoid
#214

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

Robotics 2026-04-23 arXiv cs.RO (Robotics)
Kuan Xu, Ruimeng Liu, Yizhuo Yang, Denan Liang, Tongxing Jin…
4.1
I 4.0 Im 3.5 P 3.5

Bridging the gap between embodied intelligence and embedded deployment remains a key challenge in intelligent robotic systems, where perception, reasoning, and planning must operate under strict constraints on computation, memory, energy, and real-time execution. In vision-language navigation (VLN), existing approaches often face a fundamental trade-off between strong reasoning capabilities and efficient deployment on real-world platforms. In this paper, we present a deployable embodied VLN system that achieves both high efficiency and robust high-level reasoning on real-world robotic platforms.

vlmrobot
#215

Causality-Encoded Diffusion Models for Interventional Sampling and Edge Inference

Research 2026-04-23 arXiv stat.ML (Statistical ML)
Li Chen, Xiaotong Shen, Wei Pan
4.1
I 4.0 Im 3.8 P 3.5

Standard diffusion models are flexible estimators of complex distributions, but they do not encode causal structures and therefore do not by themselves support causal analysis. We propose a causality-encoded diffusion framework that incorporates a known directed acyclic graph by training conditional diffusion models consistent with the graph factorisation. The resulting sampler approximately recovers the observational distribution and enables interventional sampling by fixing intervened variables while propagating effects through the graph during reverse diffusion.

diffusion
#216

Context Unrolling in Omni Models

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He…
4.1
I 4.0 Im 4.0 P 3.5

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity.

benchmarkmultimodal
#221

GSA announces a fresh cohort of Presidential Innovation Fellows

Government & Defense 2026-04-23 FedScoop — AI
4.1
I 4.0 Im 4.0 P 3.5

The General Services Administration announced 17 new Presidential Innovation Fellows on Thursday, refreshing the technologist-focused program. A release shared with FedScoop ahead of the announcement described the 2026 cohort as “experts from top tech companies, startups, and organizations around the country.” Per that announcement, the fellows will serve their yearlong tours of duty at 10 federal agencies. The PIF program is located under GSA’s Technology Transformation Services and has been around since 2012 .

#222

A Formal Defense Pact in the Indo-Pacific Is the Wrong Answer

Government & Defense 2026-04-23 War on the Rocks
4.1
I 4.0 Im 4.0 P 3.5

The debate over how best to deter China in the western Pacific has reached a new level of ambition. Ely Ratner, a former senior defense official in the Biden administration, proposed a “Pacific Defense Pact” — a legally binding multilateral treaty among the United States, Japan, Australia, and the Philippines. This reflects serious concerns over China’s rise and its potential future use of force along the first island chain. The underlying diagnosis is sound: Existing U.S.

#224

Even More Guarantees for Variational Inference in the Presence of Symmetries

Research 2026-04-23 arXiv stat.ML (Statistical ML)
Lena Zellinger, Antonio Vergari
4.0
I 4.0 Im 3.5 P 3.5

When approximating an intractable density via variational inference (VI) the variational family is typically chosen as a simple parametric family that very likely does not contain the target. This raises the question: Under which conditions can we recover characteristics of the target despite misspecification? In this work, we extend previous results on robust VI with location-scale families under target symmetries. We derive sufficient conditions guaranteeing exact recovery of the mean when using the forward Kullback-Leibler divergence and $α$-divergences.

#225

Multiscale Super Resolution without Image Priors

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)
Daniel Fu, Gabby Litterio, Pedro Felzenszwalb, Rashid Zia
4.0
I 4.0 Im 3.5 P 3.5

We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens).

#226

It's a big one

Frontier LLMs 2026-04-24 Simon Willison's Weblog
3.9
I 3.5 Im 3.0 P 4.0

This week's edition of my email newsletter (aka content from this blog delivered to your inbox) features 4 pelicans riding bicycles, 1 possum on an e-scooter, up to 5 raccoons with ham radios hiding in crowds, 5 blog posts, 8 links, 3 quotes and a new chapter of my Agentic Engineering Patterns guide. Tags: newsletter

#228
2.5
I 2.0 Im 1.5 P 3.0

GeForce NOW is doubling down on what matters most: gamers. This week’s upgrades bring smarter libraries, making it easier than ever for gamers to turn a PC collection into a cloud-powered flex. It starts with giving existing libraries time to shine. Gamers can bring the games they love to the cloud, stream them with high performance and see the value of a GeForce NOW membership grow with new games, rewards and features. First up, finding something to play gets an upgrade.

Items
228
Multi-source
50
Long-form (≥7.5)
4
Sources OK / attempted
57 / 77
Top category
Research (88)