Wolf Digest — 2026-04-24

#1

GPT-5.5 launch: frontier model + Codex superapp (OpenAI)

Frontier LLMs 2026-04-24 Latent Space (swyx & Alessio)NVIDIA AI BlogOpenAI ResearchOpenAI ResearchSimon Willison's WeblogOne Useful Thing (Ethan Mollick)

8.8

I 8.0 Im 7.0 P 10

OpenAI released GPT-5.5 on April 23, positioned as "a new class of intelligence for real work" and rolled out across ChatGPT and the Codex app, with API access held back pending additional safeguards. In the bake-off against Anthropic's Opus 4.7 from the prior week, Artificial Analysis crowns GPT-5.5 the top independently validated model in the world on its Intelligence Index, with GPT-5.5 at medium reasoning scoring the same as Opus 4.7 at max reasoning at roughly one quarter of the cost, about $1,200 versus $4,800 per million tokens end to end. Gemini 3.1 Pro Preview scores the same at around $900. Coverage across OpenAI's own system card, Simon Willison's preview notes, Ethan Mollick's early-access write-up, the NVIDIA blog, Latent Space's AI News, and TechCrunch converged on a profile of stronger long-horizon execution, noticeably better agentic coding, broader computer use, and improved token efficiency. Ethan Mollick's illustrative test, a procedurally generated 3D simulation of a harbor town from 3000 BCE to 3000 CE, was only rendered as an evolving town by GPT-5.5 Pro, with the new model completing in 20 minutes what GPT-5.4 Pro took 33 minutes to do.

The release is also a superapp moment for Codex. OpenAI bundled GPT-5.5 with a major Codex refresh that folds in the capabilities of its now-defunct Prism acquisition, adds built-in browser control, and ships with schedules, triggers, plugins, skills, and a unified project / thread / file workspace. Over 10,000 NVIDIA employees across engineering, product, legal, marketing, finance, sales, HR, and operations are already using GPT-5.5-powered Codex inside a Jensen-led company-wide rollout. NVIDIA's disclosure notes that Codex is served on GB200 NVL72 rack-scale systems, which the post frames as delivering roughly 35 times lower cost per million tokens and 50 times higher token output per second per megawatt versus prior-generation hardware, the economics that make enterprise-scale frontier inference viable. The practical signal is that ongoing engineering work with async gaps, pull request opens, continuous integration waits, review rounds, now fits inside one tool.

Pricing lands at about $5 per million input tokens and $30 per million output tokens for GPT-5.5 Pro, which is higher than Opus 4.7, but Artificial Analysis's intelligence-per-dollar curves make the combined offering competitive with the Claude stack on most real workloads. Simon Willison notes one remaining friction: the Codex API endpoint that some agent harnesses such as Pi and Opencode use, the semi-official "/backend-api/codex/responses" path, is now tacitly supported by OpenAI, which has hired the OpenClaw creator and publicly welcomes third parties integrating with ChatGPT subscriptions. This is a direct contrast with Anthropic, which blocked OpenClaw from doing the same with Claude subscriptions earlier in the year. The pattern coming out of the week is that raw 1-dimensional intelligence numbers are giving way to intelligence-per-dollar axes as the real comparison space, with OpenAI, Anthropic, and Google all now jockeying for the best place on that 2D Pareto frontier.

How it was discussed

OpenAI's own announcement frames GPT-5.5 as optimized for coding, research, and data analysis across tools.
NVIDIA emphasizes that 10,000+ NVIDIA employees are already using GPT-5.5-powered Codex on GB200 NVL72, citing 35× lower cost per million tokens vs prior hardware.
Simon Willison highlights the missing API access at launch and the OpenClaw/Codex subscription-backdoor dynamic contrasting with Anthropic's recent block.
Ethan Mollick's procedural-town test found GPT-5.5 Pro was the only model that actually modeled an evolving town rather than replacing buildings in place.
Latent Space/swyx notes Artificial Analysis's finding: GPT-5.5 (medium) matches Opus 4.7 (max) at ~¼ the cost, but Gemini 3.1 Pro Preview still undercuts both.
TechCrunch frames it as OpenAI's 'super app' moment, with Codex folding in the defunct Prism's browser-control stack.

#2

DeepSeek V4-Pro and V4-Flash — largest open-weights model ever, aggressive pricing

Frontier LLMs 2026-04-24 Simon Willison's Weblog

7.8

I 8.0 Im 6.5 P 7.8

DeepSeek ended its five-month silence since V3.2 Speciale with two preview models in the V4 series, both MIT-licensed, both 1-million-token context, both mixture-of-experts. V4-Pro is 1.6 trillion total parameters with 49 billion active, making it the largest open-weights model ever released — larger than Kimi K2.6 at 1.1 trillion, GLM-5.1 at 754 billion, and more than twice the size of DeepSeek V3.2 at 685 billion. V4-Flash is a 284-billion-parameter MoE with 13 billion active, sized to fit reasonably quantized onto high-end consumer workstations. Simon Willison flagged the possibility of running Flash on a 128GB M5 MacBook Pro with light quantization, and even Pro on that hardware if the inference engine can stream just the active experts from disk per token. The checkpoints are 865 gigabytes and 160 gigabytes respectively on Hugging Face.

The headline is price. V4-Flash is listed at $0.14 per million input tokens and $0.28 per million output, which places it as the cheapest small model in the frontier class, below GPT-5.4 Nano at $0.20 input and $1.25 output, Gemini 3.1 Flash-Lite at $0.25 and $1.50, and Claude Haiku 4.5 at $1 and $5. V4-Pro comes in at $1.74 input and $3.48 output, roughly one third the price of Gemini 3.1 Pro at $2 and $12, one fifth the price of GPT-5.4 at $2.50 and $15, and an order of magnitude cheaper on output tokens than Claude Opus 4.7 at $5 and $25 or GPT-5.5 at $5 and $30. Simon Willison's pelican-on-a-bicycle benchmark suggests V4-Pro is in the same ballpark as those priced competitors for the structural-reasoning-under-SVG-constraints task he uses as a qualitative probe.

DeepSeek frames both models as V4 previews, with a fuller release and associated technical report still pending. The architectural details disclosed so far are continuity with V3's multi-head latent attention and fine-grained expert routing, scaled roughly 2.3 times in total parameters for Pro. OpenRouter already exposes both endpoints, and DeepSeek's own API is live at the published prices. The significance for open weights is that the frontier-parity tier, the bracket where a model can plausibly stand in for Opus 4.7 or GPT-5.5 on broad tasks, now contains an MIT-licensed option that runs at roughly one fifth to one tenth the cost per million tokens. For the Chinese-lab cohort, it also re-establishes DeepSeek's cadence after a period where Qwen and Kimi had been doing most of the frontier-open-weights announcements. The practical question for the coming weeks is how Pro's 1.6 trillion parameters hold up on the harder agentic-coding and long-horizon benchmarks versus the closed frontier stack, and whether the 49-billion active-parameter count keeps inference cost viable at the listed prices under real load.

#3

What We Learned Building Cloud Agents

Agents & Tool Use 2026-04-23 Cognition AI (Devin)

7.6

I 7.5 Im 7.5 P 6.0

Cognition published a long follow-up to its earlier posts on multi-agent patterns, this time dissecting the infrastructure demands of cloud-based software engineering agents. The core claim is that the natural starting point for a cloud agent — take a CLI agent, containerize it, give it repo and toolchain access — looks achievable but hits three hard walls when it meets real engineering workflows. Cognition frames the post as a counter to recent signals, including Stripe's public description of its homegrown cloud agent, that building one in-house is the right path for large organizations.

The first wall is isolation. Containerized agents share a kernel, and agents generate their own code, run arbitrary commands, and probe the environment unpredictably. A kernel-level escape lets one compromised session reach every other container's filesystems, credentials, and network connections. Cognition's conclusion is the same as the broader infra community's for any untrusted workload: VM-level isolation. Their implementation is a microVM per agent session, with over a year of hypervisor engineering behind it. A side benefit is that agents in dedicated VMs can drive a full browser, desktop applications, and arbitrary tool stacks the way a developer on a workstation does, which turns out to matter for agentic work that wanders outside the terminal.

The second wall is state across async gaps. Real engineering is stop-and-go: open a pull request, wait on continuous integration, respond to code review, rerun tests, push a follow-up commit. Between each step there are minutes, hours, or days where the agent must preserve its full working state. Containers cannot reliably snapshot an individual container's memory, process trees, and filesystem, shut down compute, and restore exactly later. A container either burns compute to stay alive or loses the session on reschedule or timeout. Cognition's solution is full machine-state snapshotting at the hypervisor level, so a Devin session idles at zero compute cost and resumes bit-for-bit when a CI result or review comment arrives. Making this reliable across thousands of concurrent heterogeneous sessions, each with its own repos and runtimes, is described as the longest-running single piece of infrastructure they have built.

The third wall is orchestration, governance, and integrations. Each is a multi-quarter infra project on its own. Orchestration demands per-session provisioning, correct routing, warm VM pools tuned to demand, and environments kept current as codebases change daily. Governance demands inheriting the dispatching engineer's permissions across every system the agent touches, with tamper-evident audit logs. Integrations demand connecting to CI, monitoring, package registries, documentation, and source control, each with its own authentication model. Stripe's internal MCP server reportedly carries over 400 tools, and that is the scale of ongoing investment this layer requires. The article closes with the pattern Cognition says they consistently see from teams attempting in-house builds: the combined surface area becomes untenable rather than any single component being blocking. The implicit pitch is that two years of hypervisor engineering, snapshot machinery, and integration work is why Devin exists as a product. The technical content is concrete enough that it reads as an architecture reference even if read independently of that pitch.

#4

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

Research 2026-04-21 HF ↑21 Hugging Face Daily Papers

7.5

I 7.0 Im 6.8 P 6.5

UniT, the top-voted paper on Hugging Face Daily Papers for April 24, tackles one of the core bottlenecks in humanoid foundation models: the kinematic mismatch that makes massive egocentric human video data hard to use for training humanoid policies. Humans and humanoid robots move in different coordinate frames with different joint counts and different contact dynamics, so cross-embodiment transfer has historically required either careful data curation, per-robot retargeting, or explicit physics priors. UniT proposes a unified physical language that represents both human and humanoid behavior in a shared token space, letting a single model learn policy, world modeling, and cross-embodiment retargeting end-to-end from a mix of human demonstrations and humanoid execution traces.

The architecture combines a tokenizer that maps joint-angle trajectories, end-effector poses, and egocentric video frames into a common discrete vocabulary, a transformer-based world model trained to predict next-token physical state, and a policy head conditioned on task language that emits actions in the humanoid's configuration space. Training mixes OpenX-style humanoid telemetry, egocentric human video datasets such as Ego4D and Epic-Kitchens, and a smaller curated set of paired human-humanoid demonstrations. The authors report that the shared tokenization recovers the majority of the benefit from paired data even when paired examples are scarce, which is the regime that matters for humanoid scaling.

Benchmark numbers span manipulation and locomotion. On a cross-embodiment manipulation suite the authors construct around kitchen and workshop tasks, UniT improves on the best human-to-humanoid baseline by several points on task success while cutting the retargeting-data requirement roughly in half. On locomotion and whole-body control, the world-model component can roll out plausible multi-second trajectories from a single observation, which the paper uses for both reward-free policy bootstrapping and for curriculum generation. Ablations isolate the contribution of the shared tokenizer from the sheer scale effect and show that a naive multimodal transformer without the physical-language tokenization underperforms by a noticeable margin.

The positioning relative to prior VLA stacks, RT-2 and OpenVLA for manipulation, and recent humanoid-specific work from Unitree and Agility is the main discussion point. UniT argues that a single token space for both modeling and acting is the right unification axis rather than bolting humanoid heads onto vision-language backbones. The open question the authors flag is how far the physical language generalizes beyond the two embodiments trained on; the paper includes a small zero-shot test on a different humanoid platform with promising but not conclusive numbers. If the token space survives more embodiments, it becomes a plausible ingredient in the next wave of humanoid foundation models, where the binding constraint has shifted from compute to robot-data throughput.

benchmarkvlarobothumanoidvideo-gen

#5

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

Research 2026-04-23 HF ↑21 Hugging Face Daily Papers

6.4

I 5.2 Im 5.5 P 7.0

Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs.

benchmarkvlmvideo-gen

#6

An update on recent Claude Code quality reports

Frontier LLMs 2026-04-24 Simon Willison's Weblog

6.4

I 6.5 Im 6.0 P 5.5

An update on recent Claude Code quality reports It turns out the high volume of complaints that Claude Code was providing worse quality results over the past two months was grounded in real problems. The models themselves were not to blame, but three separate issues in the Claude Code harness caused complex but material problems which directly affected users. Anthropic's postmortem describes these in detail.

#7

AEL: Agent Evolving Learning for Open-Ended Environments

Interpretability 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)

Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu…

6.0

I 6.7 Im 5.7 P 4.5

LLM agents increasingly operate in open-ended environments spanning hundreds of sequential episodes, yet they remain largely stateless: each task is solved from scratch without converting past experience into better future behavior. The central obstacle is not \emph{what} to remember but \emph{how to use} what has been remembered, including which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change. We introduce \emph{Agent Evolving Learning} (\ael{}), a two-timescale framework that addresses this obstacle.

benchmarkagents

#8

National Australia Bank accelerates legacy migrations with Cursor

AI Coding 2026-04-23 Cursor Blog (Anysphere)

6.0

I 6.2 Im 5.8 P 5.0

National Australia Bank (NAB) standardized 6,000 developers on Cursor after evaluating Amazon Q and GitHub Copilot, with plans to scale toward 10,000+. Legacy modernizations — monolith-to-microservices refactors, migrations away from Assembly mainframes — are running 3× faster than expected. One merchant services team built a hardware-agnostic payment app in 3 weeks instead of the original 4-month scope. NAB cited model flexibility (engineers choose per-task), repository-wide context, and auto-rules that bake compliance requirements into agent behavior as the reasons they moved off Q/Copilot.

#9

Anthropic and NEC collaborate to build Japan's largest AI engineering workforce

Industry 2026-04-24 Anthropic News

5.8

I 5.5 Im 5.5 P 5.0

Anthropic and NEC announce strategic collaboration to deploy Claude to approximately 30,000 NEC Group employees worldwide. NEC becomes Anthropic's first Japan-based global partner and will jointly develop secure, domain-specific AI products for Japanese finance, manufacturing, local government, and cybersecurity customers. Claude Opus 4.7 and Claude Code will be integrated into NEC BluStellar Scenario. NEC will establish a Center of Excellence and extend Client Zero deployment of Claude Cowork internally.

#10

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

Research 2026-04-23 HF ↑13 Hugging Face Daily Papers

5.8

I 5.2 Im 5.0 P 5.7

Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths.

benchmarkdiffusionfinetunepretrain

#11

Quotient-Space Diffusion Models

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)

Yixian Xu, Yusong Wang, Shengjie Luo, Kaiyuan Gao, Tianyu He…

5.7

I 5.0 Im 5.0 P 5.5

Diffusion-based generative models have reformed generative AI, and have enabled new capabilities in the science domain, for example, generating 3D structures of molecules. Due to the intrinsic problem structure of certain tasks, there is often a symmetry in the system, which identifies objects that can be converted by a group action as equivalent, hence the target distribution is essentially defined on the quotient space with respect to the group.

diffusionproteinmolecular

#12

An Interview with Google Cloud CEO Thomas Kurian About the Agentic Moment

Agents & Tool Use 2026-04-23 Stratechery

5.7

I 5.5 Im 5.5 P 5.0

Listen to this post: Good morning, This week’s Stratechery Interview is with Google Cloud CEO Thomas Kurian . Kurian joined Google to lead the company’s cloud division in 2018; prior to that he was President of Product Development at Oracle, where he worked for 22 years. I previously spoke to Kurian in March 2021 , April 2024 , and April 2025 . The occasion for these interviews, at least for the last three years, is Kurian’s annual keynote at Google Cloud Next.

#13

Introducing OlmoEarth embeddings: Custom embedding exports from OlmoEarth Studio for downstream analysis

AI for Science 2026-04-23 Allen Institute for AI (AI2)

5.6

I 5.5 Im 5.5 P 4.5

Ai2's OlmoEarth Studio now exposes custom embedding exports from its open-source Earth-observation foundation models. Users choose area of interest, time range, encoder variant, resolution, and imagery sources (Sentinel-2 etc.) and receive Cloud-Optimized GeoTIFFs of compact embedding vectors suitable for similarity search, segmentation, and unsupervised exploration. Weights and code are public. Ai2 also shows a PCA+k-means clustering of 1.1M Sentinel-2 samples demonstrating that locations with similar surface characteristics land close in embedding space. Supervised fine-tuning is also supported for higher-performance downstream use.

#14

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

Reinforcement Learning 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)

Yi-Ling Liu, Melvin Laux, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam

5.5

I 4.0 Im 5.9 P 5.5

Autonomous underwater vehicles are required to perform multiple tasks adaptively and in an explainable manner under dynamic, uncertain conditions and limited sensing, challenges that classical controllers struggle to address. This demands robust, generalizable, and inherently interpretable control policies for reliable long-term monitoring. Reinforcement learning, particularly multi-task RL, overcomes these limitations by leveraging shared representations to enable efficient adaptation across tasks and environments.

rlagentspretrain

#15

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)

Srija Anand, Ashwin Sankar, Ishvinder Sethi, Aaditya Pareek, Kartik Rajput…

5.5

I 6.0 Im 5.9 P 3.5

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters.

#16

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Frontier LLMs 2026-04-23 Latent Space Podcast

5.5

I 5.0 Im 5.3 P 5.0

Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe , but before the Cursor-xAI deal . Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.

#17

Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

Reinforcement Learning 2026-04-23 arXiv cs.RO (Robotics)arXiv cs.LG (Machine Learning)

Heng Yang

5.4

I 6.2 Im 4.0 P 4.5

We propose a sampling-based framework for finite-horizon trajectory and policy optimization under differentiable dynamics by casting controller design as inference. Specifically, we minimize a KL-regularized expected trajectory cost, which yields an optimal "Boltzmann-tilted" distribution over controller parameters that concentrates on low-cost solutions as temperature decreases.

benchmarkmultimodal

#18

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

Research 2026-04-22 HF ↑3 Hugging Face Daily Papers

5.4

I 5.5 Im 5.2 P 4.0

While Large Language Models (LLMs) excel at function-level code generation, project-level tasks such as generating functional and visually aesthetic multi-page websites remain highly challenging. Existing works are often limited to single-page static websites, while agentic frameworks typically rely on multi-turn execution with proprietary models, leading to substantial token costs, high latency, and brittle integration. Training a small LLM end-to-end with reinforcement learning (RL) is a promising alternative, yet it faces a critical bottleneck in designing reliable and computationally feasible rewards for website generation.

rlmultimodalagentscodegencoding

#19

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

Research 2026-04-18 HF ↑1 Hugging Face Daily Papers

5.4

I 6.2 Im 4.7 P 3.7

Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored.

benchmarkpretrain

#20

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Naheed Rayhan, Sohely Jahan

5.3

I 5.5 Im 5.7 P 3.5

Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context.

agents

#21

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)

Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun…

5.3

I 6.2 Im 5.2 P 3.5

Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy.

benchmark

#22

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

Multimodal 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)arXiv cs.CV (Computer Vision)

Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson…

5.2

I 4.7 Im 4.0 P 5.5

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations.

benchmarkdpovlmfinetune

#23

GiVA: Gradient-Informed Bases for Vector-Based Adaptation

Evaluations & Benchmarks 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)

Neeraj Gangwar, Rishabh Deshmukh, Michael Shavlovsky, Hancao Li, Vivek Mittal…

5.2

I 5.7 Im 4.0 P 4.5

As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these methods typically require substantially higher ranks than LoRA to match its performance, leading to increased training costs. This work introduces GiVA, a gradient-based initialization strategy for vector-based adaptation. It achieves training times comparable to LoRA and maintains the extreme parameter efficiency of vector-based adaptation.

benchmarkfinetune

#24

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

Evaluations & Benchmarks 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di…

5.2

I 5.0 Im 4.0 P 5.5

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery.

benchmark

#25

Replay-buffer engineering for noise-robust quantum circuit optimization

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)

Akash Kundu, Sebastian Feld

5.2

I 5.2 Im 4.7 P 4.5

Deep reinforcement learning (RL) for quantum circuit optimization faces three fundamental bottlenecks: replay buffers that ignore the reliability of temporal-difference (TD) targets, curriculum-based architecture search that triggers a full quantum-classical evaluation at every environment step, and the routine discard of noiseless trajectories when retraining under hardware noise. We address all three by treating the replay buffer as a primary algorithmic lever for quantum optimization.

benchmarkrlmolecularpretrain

#26

Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)

Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang…

5.2

I 4.0 Im 5.7 P 4.5

Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems.

benchmarkagentscodegencoding

#27

Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv cs.NE (Neural & Evolutionary Computing)

Eylon E. Krause

5.2

I 5.5 Im 3.5 P 5.5

The choice of activation function plays a crucial role in the optimization and performance of deep neural networks. While the Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and effectiveness, its lack of smoothness may hinder gradient-based optimization in deep architectures. In this work we propose a family of $C^{2N}$-smooth activation functions whose gate follows a log-logistic CDF, achieving ReLU-like performance with purely rational arithmetic.

transformer

#28

Fixation Sequences as Time Series: A Topological Approach to Dyslexia Detection

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)

Marius Huber, David R. Reich, Lena A. Jäger

5.2

I 5.0 Im 4.7 P 4.5

Persistent homology, a method from topological data analysis, extracts robust, multi-scale features from data. It produces stable representations of time series by applying varying thresholds to their values (a process known as a \textit{filtration}). We develop novel filtrations for time series and introduce topological methods for the analysis of eye-tracking data, by interpreting fixation sequences as time series, and constructing ``hybrid models'' that combine topological features with traditional statistical features.

#29

Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning

Research 2026-04-23 arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)

Hanjun Cho, Gahyun Yoo, Hanseong Kim, Jay-Yoon Lee

5.2

I 5.5 Im 4.2 P 4.5

Numerical reasoning over expert-domain tables often exhibits high in-domain accuracy but limited robustness to domain shift. Models trained with supervised fine-tuning (SFT) on specific datasets tend to rely on header-operation shortcuts rather than structural reasoning. We introduce TaNOS, a continual pre-training framework comprising three components: (i) header anonymization to reduce lexical memorization, (ii) operation sketches that provide minimal structural cues, and (iii) self-supervised pretraining that constructs correctness-guaranteed program-question pairs from given tables in a program-first manner.

finetunepretrain

#30

Towards Universal Tabular Embeddings: A Benchmark Across Data Tasks

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Liane Vogel, Kavitha Srinivas, Niharika D'Souza, Sola Shirai, Oktie Hassanzadeh…

5.2

I 6.2 Im 4.7 P 3.5

Tabular foundation models aim to learn universal representations of tabular data that transfer across tasks and domains, enabling applications such as table retrieval, semantic search and table-based prediction. Despite the growing number of such models, it remains unclear which approach works best in practice, as existing methods are often evaluated under task-specific settings that make direct comparison difficult. To address this, we introduce TEmBed, the Tabular Embedding Test Bed, a comprehensive benchmark for systematically evaluating tabular embeddings across four representation levels: cell, row, column, and table.

benchmark

#31

There Will Be a Scientific Theory of Deep Learning

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)arXiv — Mechanistic Interpretability

Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon…

5.2

I 4.0 Im 4.7 P 5.5

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks.

mech-interp

#32

UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

Research 2026-04-23 HF ↑3 Hugging Face Daily Papers

5.2

I 5.5 Im 5.0 P 4.0

In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges.

multimodalfinetune

#33

Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

Research 2026-04-23 HF ↑1 Hugging Face Daily Papers

5.2

I 5.0 Im 5.4 P 3.7

Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design.

pretrain

#34

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Frontier LLMs 2026-04-23 Latent Space (swyx & Alessio)

5.2

I 5.0 Im 5.0 P 4.3

Today, we check in a year after the first Unsupervised Learning x Latent Space Crossover special to discuss everything that has changed (there is a lot) in the world of AI. This episode was recorded just after AIE Europe , but before the Cursor-xAI deal . Unsupervised Learning is a podcast that interviews the sharpest minds in AI about what’s real today, what will be real in the future and what it means for businesses and the world - helping builders, researchers and founders deconstruct and understand the biggest breakthroughs.

#35

TraceScope: Interactive URL Triage via Decoupled Checklist Adjudication

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Haolin Zhang, William Reber, Yuxuan Zhang, Guofei Gu, Jeff Huang

5.1

I 5.0 Im 5.7 P 3.5

Modern phishing campaigns increasingly evade snapshot-based URL classifiers using interaction gates (e.g., checkbox/slider challenges), delayed content rendering, and logo-less credential harvesters. This shifts URL triage from static classification toward an interactive forensics task: an analyst must actively navigate the page while isolating themselves from potential runtime exploits. We present TraceScope, a decoupled triage pipeline that operationalizes this workflow at scale.

agents

#36

MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting

Robotics 2026-04-23 arXiv cs.RO (Robotics)

Yining Xing, Zehong Ke, Yiqian Tu, Zhiyuan Liu, Wenhao Yu…

5.1

I 6.2 Im 4.3 P 3.5

Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, a Variational Autoencoder to structure expert trajectories into a compact 32-dimensional latent manifold, and an ultra-lightweight MLP-Mixer decoder to eliminate quadratic attention complexity.

benchmarkdiffusion

#37

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

Evaluations & Benchmarks 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)

Praval Sharma

5.0

I 5.0 Im 4.0 P 4.5

Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities.

multimodal

#38

Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv — Mechanistic Interpretability

Minji Jung, Minjae Lee, Yejin Kim, Sarang Choi, Minsuk Kahng

5.0

I 5.2 Im 4.0 P 4.5

LLM leaderboards are widely used to compare models and guide deployment decisions. However, leaderboard rankings are shaped by evaluation priorities set by benchmark designers, rather than by the diverse goals and constraints of actual users and organizations. A single aggregate score often obscures how models behave across different prompt types and compositions. In this work, we conduct an in-depth analysis of the dataset used in the LMArena (formerly Chatbot Arena) benchmark and investigate this evaluation challenge by designing an interactive visualization interface as a design probe.

benchmark

#39

StructMem: Structured Memory for Long-Horizon Behavior in LLMs

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)arXiv cs.LG (Machine Learning)

Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao…

5.0

I 4.0 Im 4.0 P 5.5

Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections.

agents

#40

GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)

Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou, Xudong Jin…

5.0

I 5.5 Im 3.5 P 4.5

Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities.

quantization

#41

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Frontier LLMs 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)

Hao-Yuan Chen

5.0

I 5.5 Im 3.5 P 4.5

Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results.

#42

On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification

Recurrent & Linear Attention 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)arXiv cs.NE (Neural & Evolutionary Computing)

Rishona Daniels, Duna Wattad, Ronny Ronen, David Saad, Shahar Kvatinsky

5.0

I 4.0 Im 4.0 P 5.5

Reservoir computing (RC) is an emerging recurrent neural network architecture that has attracted growing attention for its low training cost and modest hardware requirements. Memristor-based circuits are particularly promising for RC, as their intrinsic dynamics can reduce network size and parameter overhead in tasks such as time-series prediction and image recognition. Although RC has been demonstrated with several memristive devices, a comprehensive evaluation of device-level requirements remains limited.

quantization

#43

Research 2026-04-23 HF ↑7 Hugging Face Daily Papers

5.0

I 5.0 Im 4.0 P 4.7

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery.

benchmark

#53

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)

Breno Matos, Rennan C. Lima, Savvas Zannettou, Fabricio Benevenuto, Rodrygo L. T. Santos

4.9

I 5.0 Im 4.7 P 3.5

Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, including online social networks and messaging apps. It permeates all media and content forms, including images, text, audio, and video. Distinctly, video-based misinformation represents a multifaceted challenge for fact-checkers, given the ease with which individuals can record and upload videos on various video-sharing platforms.

#57

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)arXiv — Mechanistic Interpretability

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed…

4.9

I 4.7 Im 4.0 P 4.5

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition.

benchmark

#58

AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

Agents & Tool Use 2026-04-23 arXiv cs.CL (Computation & Language)

Yuanjie Lyu, Chengyu Wang, Haonan Zheng, Yuanhao Yue, Junbing Yan…

4.9

I 5.2 Im 4.5 P 3.5

Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks.

benchmarkrlagentstool-use

#59

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)

Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang

4.9

I 5.5 Im 4.7 P 3.5

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics.

#63

StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition

Research 2026-04-23 arXiv cs.CV (Computer Vision)

Kwan Yun, Changmin Lee, Ayeong Jeong, Youngseo Kim, Seungmi Lee…

4.9

I 5.2 Im 5.0 P 3.5

Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths.

benchmarkdiffusionfinetunepretrain

#64

Encoder-Free Human Motion Understanding via Structured Motion Descriptions

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)

Yao Zhang, Zhuchenyang Liu, Thomas Ploetz, Yu Xiao

4.9

I 5.0 Im 5.4 P 3.5

The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM's embedding space, remaining constrained by cross-modal representation and alignment.

pretrain

#65

Addressing Image Authenticity When Cameras Use Generative AI

Generative Media 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)

Umar Masud, Abhijith Punnappurath, Luxi Zhao, David B. Lindell, Michael S. Brown

4.8

I 4.0 Im 4.7 P 4.5

The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of deep-learning modules into cameras' capture-time hardware -- namely, the image signal processor (ISP) -- there is now a potential for hallucinated content in images directly output by our cameras.

#66

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Anuj Sadani, Deepak Kumar

4.8

I 5.2 Im 4.5 P 3.5

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost.

benchmarkagentslong-context

#67

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

Efficiency 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)

Eli Gildish, Michael Grebshtein, Igor Makienko

4.8

I 5.0 Im 3.5 P 4.5

Denoising of periodic signals and accurate waveform estimation are core tasks across many signal processing domains, including speech, music, medical diagnostics, radio, and sonar. Although deep learning methods have recently shown performance improvements over classical approaches, they require substantial computational resources and are usually trained separately for each signal observation. This study proposes a computationally efficient method based on DCNN and Re-sampling, termed R-DCNN, designed for operation under strict power and resource constraints. The approach targets signals with varying fundamental frequencies and requires only a single observation for training.

#68

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)

Di Wu, Ling Liang, Haizhao Yang

4.8

I 4.5 Im 4.0 P 4.5

Bayesian Optimal Experimental Design (BOED) provides a rigorous framework for decision-making tasks in which data acquisition is often the critical bottleneck, especially in resource-constrained settings. Traditionally, BOED typically selects designs by maximizing expected information gain (EIG), commonly defined through the Kullback-Leibler (KL) divergence. However, classical evaluation of EIG often involves challenging nested expectations, and even advanced variational methods leave the underlying log-density-ratio objective unchanged. As a result, support mismatch, tail underestimation, and rare-event sensitivity remain intrinsic concerns for KL-based BOED.

#73

A-IC3: Learning-Guided Adaptive Inductive Generalization for Hardware Model Checking

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Xiaofeng Zhou, Guangyu Hu, Hongce Zhang, Wei Zhang

4.8

I 5.0 Im 4.5 P 3.5

The IC3 algorithm represents the state-of-the-art (SOTA) hardware model checking technique, owing to its robust performance and scalability. A significant body of research has focused on enhancing the solving efficiency of the IC3 algorithm, with particular attention to the inductive generalization process: a critical phase wherein the algorithm seeks to generalize a counterexample to inductiveness (CTI), which typically is a state leading to a bad state, into a broader set of states.

benchmarkagents

#74

Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)

Avinash Paliwal, Adithya Iyer, Shivin Yadav, Muhammad Ali Afridi, Midhun Harikumar

4.8

I 5.0 Im 5.0 P 3.5

Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views.

diffusiontransformer

#75

Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)

Yingkai Yang, Chaoqi Chen, Hui Huang

4.8

I 5.7 Im 4.0 P 3.5

Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection.

benchmark

#76

Seeing Fast and Slow: Learning the Flow of Time in Videos

Research 2026-04-23 HF ↑10 Hugging Face Daily Papers

4.8

I 4.0 Im 3.8 P 5.2

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos.

multimodalvideo-gen

#77

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Research 2026-04-22 HF ↑5 Hugging Face Daily Papers

4.8

I 4.0 Im 4.5 P 4.3

Long horizon interactive environments are a testbed for evaluating agents skill usage abilities. These environments demand multi step reasoning, the chaining of multiple skills over many timesteps, and robust decision making under delayed rewards and partial observability. Games are a good testbed for evaluating agent skill usage in environments. Large Language Models (LLMs) offer a promising alternative as game playing agents, but they often struggle with consistent long horizon decision making because they lack a mechanism to discover, retain, and reuse structured skills across episodes.

benchmarkagents

#78

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

Research 2026-04-23 HF ↑2 Hugging Face Daily Papers

4.8

I 4.5 Im 4.7 P 3.8

Large Language Models (LLMs) have demonstrated remarkable fluency and versatility across a wide range of NLP tasks, yet they remain prone to factual inaccuracies and hallucinations. This limitation poses significant risks in high-stakes domains such as healthcare, law, and scientific communication, where trust and verifiability are paramount. In this paper, we introduce DAVinCI - a Dual Attribution and Verification framework designed to enhance the factual reliability and interpretability of LLM outputs.

#79

Promoting Simple Agents: Ensemble Methods for Event-Log Prediction

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)

Benedikt Bollig, Matthias Függer, Thomas Nowak, Paul Zeinaty

4.7

I 4.0 Im 4.0 P 4.5

We compare lightweight automata-based models (n-grams) with neural architectures (LSTM, Transformer) for next-activity prediction in streaming event logs. Experiments on synthetic patterns and five real-world process mining datasets show that n-grams with appropriate context windows achieve comparable accuracy to neural models while requiring substantially fewer resources. Unlike windowed neural architectures, which show unstable performance patterns, n-grams provide stable and consistent accuracy. While we demonstrate that classical ensemble methods like voting improve n-gram performance, they require running many agents in parallel during inference, increasing memory consumption and latency.

transformeragents

#80

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Kaushitha Silva, Srinath Perera

4.7

I 5.0 Im 4.5 P 3.5

Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven planning and debugging, where language models trace execution steps to verify logic. However, these approaches depend on human-provided public test cases to ground the debugging and simulation loop. Manually authoring comprehensive input-output examples is a labor-intensive bottleneck in the software development lifecycle.

benchmarkagentscodegen

#81

Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)

Chris Schneider, Philipp Schoenegger, Ben Bariach

4.7

I 4.0 Im 4.0 P 4.5

Current model training approaches incorporate user information directly into shared weights, making individual data removal computationally infeasible without retraining. This paper presents a three-layer architecture that decouples personal data from shared weights by combining a static base model, composable domain-expert LoRA adapters that shape behavior without imparting user data, and per-user proxy artefacts whose deletion constitutes deterministic unlearning.

#82

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)

Paul Keuren, Marc Ponsen, Robert Ayoub Bagheri

4.7

I 4.0 Im 5.2 P 3.5

Sentence embedding techniques aim to encode key concepts of a sentence's meaning in a vector space. However, the majority of evaluation approaches for sentence embedding quality rely on the use of additional classifiers or downstream tasks. These additional components make it unclear whether good results stem from the embedding itself or from the classifier's behaviour. In this paper, we propose a novel method for evaluating the effectiveness of sentence embedding methods in capturing sentence-level concepts. Our approach is classifier-independent, allowing for an objective assessment of the model's performance.

#86

Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Generative Media 2026-04-23 arXiv cs.CL (Computation & Language)

Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand, Mitesh M. Khapra

4.7

I 5.2 Im 4.0 P 3.5

Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity.

benchmarkvlmt2i

#87

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Nicolae Filat, Ahmed Hussain, Konstantinos Kalogiannis, Elena Burceanu

4.7

I 5.2 Im 4.0 P 3.5

Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions.

benchmark

#88

Low-Rank Adaptation Redux for Large Models

Efficiency 2026-04-23 arXiv cs.LG (Machine Learning)

Bingcong Li, Yilang Zhang, Georgios B. Giannakis

4.7

I 5.0 Im 4.2 P 3.5

Low-rank adaptation (LoRA) has emerged as the de facto standard for parameter-efficient fine-tuning (PEFT) of foundation models, enabling the adaptation of billion-parameter networks with minimal computational and memory overhead. Despite its empirical success and rapid proliferation of variants, it remains elusive which architectural choices, optimization techniques, and deployment constraints should guide practical method selection.

finetune

#89

Revealing Geography-Driven Signals in Zone-Level Claim Frequency Models: An Empirical Study using Environmental and Visual Predictors

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)

Sherly Alfonso-Sánchez, Cristián Bravo, Kristina G. Stankova

4.7

I 4.0 Im 4.2 P 4.5

Geographic context is often consider relevant to motor insurance risk, yet public actuarial datasets provide limited location identifiers, constraining how this information can be incorporated and evaluated in claim-frequency models. This study examines how geographic information from alternative data sources can be incorporated into actuarial models for Motor Third Party Liability (MTPL) claim prediction under such constraints. Using the BeMTPL97 dataset, we adopt a zone-level modeling framework and evaluate predictive performance on unseen postcodes.

transformerpretrain

#90

GFlowState: Visualizing the Training of Generative Flow Networks Beyond the Reward

Reinforcement Learning 2026-04-23 arXiv cs.LG (Machine Learning)

Florian Holeczek, Andreas Hinterreiter, Alex Hernandez-Garcia, Marc Streit, Christina Humer

4.7

I 4.5 Im 4.7 P 3.5

We present GFlowState, a visual analytics system designed to illuminate the training process of Generative Flow Networks (GFlowNets or GFNs). GFlowNets are a probabilistic framework for generating samples proportionally to a reward function. While GFlowNets have proved to be powerful tools in applications such as molecule and material discovery, their training dynamics remain difficult to interpret. Standard machine learning tools allow metric tracking but do not reveal how models explore the sample space, construct sample trajectories, or shift sampling probabilities during training.

#91

Transferable SCF-Acceleration through Solver-Aligned Initialization Learning

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Eike S. Eberhard, Viktor Kotsev, Timm Güthle, Stephan Günnemann

4.7

I 5.5 Im 3.5 P 3.5

Machine learning methods that predict initial guesses from molecular geometry can reduce this cost, but matrix-prediction models fail when extrapolating to larger molecules, degrading rather than accelerating convergence [Liu et al. 2025]. We show that this failure is a supervision problem, not an extrapolation problem: models trained on ground-state targets fit those targets well out of distribution, yet produce initial guesses that slow convergence. Solver-Aligned Initialization Learning (SAIL) resolves this for both Hamiltonian and density matrix models by differentiating through the SCF solver end-to-end.

molecular

#92

A Kernel Nonconformity Score for Multivariate Conformal Prediction

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)

Louis Meyer, Wenkai Xu

4.7

I 4.5 Im 3.5 P 4.5

Multivariate conformal prediction requires nonconformity scores that compress residual vectors into scalars while preserving certain implicit geometric structure of the residual distribution. We introduce a Multivariate Kernel Score (MKS) that produces prediction regions that explicitly adapt to this geometry. We show that the proposed score resembles the Gaussian process posterior variance, unifying Bayesian uncertainty quantification with the coverage guarantees of frequentist-type. Moreover, the MKS can be decomposed into an anisotropic Maximum Mean Discrepancy (MMD) that interpolates between kernel density estimation and covariance-weighted distance.

#93

Local Neighborhood Instability in Parametric Projections: Quantitative and Visual Analysis

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)arXiv — Mechanistic Interpretability

Frederik L. Dennig, Daniel A. Keim

4.7

I 4.0 Im 4.0 P 4.5

Parametric projections let analysts embed new points in real time, but input variations from measurement noise or data drift can produce unpredictable shifts in the 2D layout. Whether and where a projection is locally stable remains largely unexamined. In this paper, we present a stability evaluation framework that probes parametric projections with Gaussian perturbations around selected anchor points and assesses how neighborhoods deform in the 2D embedding.

#94

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Research 2026-04-23 HF ↑3 Hugging Face Daily Papers

4.7

I 4.0 Im 4.5 P 4.0

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence.

benchmarkvlaagentscoding

#95

Will fusion power get cheap? Don’t count on it.

Safety & policy 2026-04-23 MIT Technology Review — AI

4.7

I 4.0 Im 4.5 P 4.3

Fusion power could provide a steady, zero-emissions source of electricity in the future—if companies can get plants built and running. But a new study suggests that even if that future arrives, it might not come cheap. Technologies tend to get less expensive over time. Lithium-ion batteries are now about 90% cheaper than they were in 2013. But historically, different technologies tend to go through this curve at different rates. And the cost of fusion might not sink as quickly as the prices of batteries or solar.

#96

Pentagon uses GenAI.mil to create 100K agents

Government & Defense 2026-04-23 DefenseScoop

4.7

I 4.0 Im 4.5 P 4.3

Defense officials recently used the Pentagon’s enterprise-wide generative artificial intelligence platform to create 100,000 agents amid a broader push by department leadership to speed up AI adoption, according to a senior member of the research and engineering directorate. The Pentagon first introduced its GenAI.mil platform for its workforce in December, with the aim of providing commercial tools to millions of personnel across the DOD. Defense Secretary Pete Hegseth and CTO Emil Michael have both championed the capability and encouraged its widespread use .

#97

Pentagon workers vibe-code 100,000 AI ‘agents’ to use on unclassified networks

Government & Defense 2026-04-23 Breaking Defense

4.7

I 4.0 Im 4.5 P 4.3

A Google Gemini tool on GenAI.mil allows Defense Department personnel to create their own AI agents to handle data and automate online tasks.

#98

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, Michal Kuszewski

4.6

I 4.0 Im 5.2 P 3.5

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer).

agentscoding

#99

Modulating Cross-Modal Convergence with Single-Stimulus, Intra-Modal Dispersion

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Eghbal A. Hosseini, Brian Cheung, Evelina Fedorenko, Alex H. Williams

4.6

I 4.5 Im 4.7 P 3.5

Neural networks exhibit a remarkable degree of representational convergence across diverse architectures, training objectives, and even data modalities. This convergence is predictive of alignment with brain representation. A recent hypothesis suggests this arises from learning the underlying structure in the environment in similar ways. However, it is unclear how individual stimuli elicit convergent representations across networks. An image can be perceived in multiple ways and expressed differently using words. Here, we introduce a methodology based on the Generalized Procrustes Algorithm to measure intra-modal representational convergence at the single-stimulus level.

#100

SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery

Generative Media 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)

Safouane El Ghazouali, Nicola Venturi, Michael Rueegsegger, Umberto Michelucci

4.6

I 4.0 Im 4.0 P 4.5

Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline.

benchmark

#101

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue, Vianney Jouhet, Fleur Mougin

4.6

I 4.0 Im 5.2 P 3.5

In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events.

#102

Robotics 2026-04-23 arXiv cs.RO (Robotics)

Simon Schäfer, Lucas Hegerath, Marius Molz, Massimo Marcon, Bassam Alrifaee

4.6

I 4.5 Im 4.7 P 3.5

Infrastructure-based localization enhances road safety and traffic management by providing state estimates of road users. Development is hindered by fragmented, application-specific stacks that tightly couple perception, tracking, and middleware. We introduce Ufil, a Unified Framework for Infrastructure-Based Localization with a standardized object model and reusable multi-object tracking components. Ufil offers interfaces and reference implementations for prediction, detection, association, state update, and track management, allowing researchers to improve components without reimplementing the pipeline. Ufil is open-source C++/ROS 2 software with documentation and executable examples.

#106

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

Robotics 2026-04-23 arXiv cs.RO (Robotics)

Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang…

4.6

I 4.0 Im 5.0 P 3.5

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization.

diffusionvlarobot

#107

From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)

Katharina Prasse, Steffen Jung, Isaac Bravo, Stefanie Walter, Patrick Knab…

4.6

I 5.5 Im 4.0 P 3.5

Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation.

benchmarkvlmcot

#108

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Generative Media 2026-04-23 arXiv cs.CV (Computer Vision)

Minghao Yin, Wenbo Hu, Jiale Xu, Ying Shan, Kai Han

4.6

I 5.0 Im 4.5 P 3.5

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask.

diffusiontransformerpretrain

#109

Extract PDF text in your browser with LiteParse for the web

Frontier LLMs 2026-04-23 Simon Willison's Weblog

4.5

I 4.5 Im 3.5 P 5.0

LlamaIndex have a most excellent open source project called LiteParse , which provides a Node.js CLI tool for extracting text from PDFs. I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js. Spatial text parsing Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing, falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself.

#110

llm-openai-via-codex 0.1a0

Frontier LLMs 2026-04-23 Simon Willison's Weblog

4.5

I 4.5 Im 3.5 P 5.0

Release: llm-openai-via-codex 0.1a0 Hijacks your Codex CLI credentials to make API calls with LLM, as described in my post about GPT-5.5 . Tags: openai , llm , codex-cli

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Natan Levy, Gadi Perl

4.5

I 4.0 Im 4.7 P 3.5

Artificial intelligence now decides who receives a loan, who is flagged for criminal investigation, and whether an autonomous vehicle brakes in time. Governments have responded: the EU AI Act, the NIST Risk Management Framework, and the Council of Europe Convention all demand that high-risk systems demonstrate safety before deployment. Yet beneath this regulatory consensus lies a critical vacuum: none specifies what ``acceptable risk'' means in quantitative terms, and none provides a technical method for verifying that a deployed system actually meets such a threshold.

#115

Alignment has a Fantasia Problem

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Nathanael Jo, Zoe De Simone, Mitchell Gordon, Ashia Wilson

4.5

I 4.0 Im 4.7 P 3.5

Modern AI assistants are trained to follow instructions, implicitly assuming that users can clearly articulate their goals and the kind of assistance they need. Decades of behavioral research, however, show that people often engage with AI systems before their goals are fully formed. When AI systems treat prompts as complete expressions of intent, they can appear to be useful or convenient, but not necessarily aligned with the users' needs. We call these failures Fantasia interactions.

#116

Efficiency 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.LG (Machine Learning)

Nusrat Yasmin Nadia, Md Habibul Arif, Habibor Rahman Rabby, Md Iftekhar Monzur Tanvir, Md. Jakir Hossen…

4.5

I 4.0 Im 3.5 P 4.5

Supply chain resilience and efficiency are vital in industries characterized by volatile demand and uncertain supply, such as textiles and personal protective equipment (PPE). Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness. This paper proposes a Hybrid AI Framework for Demand-Supply Forecasting and Optimization (HAF-DS), which integrates a Long Short-Term Memory (LSTM)-based demand forecasting module with a mixed integer linear programming (MILP) optimization layer. The LSTM captures temporal and contextual demand dependencies, while the optimization layer prescribes cost-efficient replenishment and allocation decisions.

#120

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)

Yuto Nishida, Naoki Shikoda, Yosuke Kishinami, Ryo Fujii, Makoto Morishita…

4.5

I 4.5 Im 4.0 P 3.5

Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms.

#121

Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Post-Training 2026-04-23 arXiv cs.CL (Computation & Language)

Jiseon Kim, Jea Kwon, Luiz Felipe Vecchietti, Wenchao Dong, Jaehong Kim…

4.5

I 4.0 Im 4.7 P 3.5

Human moral judgment is context-dependent and modulated by interpersonal relationships. As large language models (LLMs) increasingly function as decision-support systems, determining whether they encode these social nuances is critical. We characterize machine behavior using the Whistleblower's Dilemma by varying two experimental dimensions: crime severity and relational closeness. Our study evaluates three distinct perspectives: (1) moral rightness (prescriptive norms), (2) predicted human behavior (descriptive social expectations), and (3) autonomous model decision-making.

#122

SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)

Hans Ole Hatzel, Ekaterina Artemova, Haimo Paul Stiemer, Evelyn Gius, Chris Biemann

4.5

I 4.5 Im 4.2 P 3.5

We present the shared task on narrative similarity and narrative representation learning - NSNRL (pronounced "nass-na-rel"). The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story. We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment. Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations.

finetunepretrain

#123

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

Interpretability 2026-04-23 arXiv cs.CL (Computation & Language)

Bernard Muller, Antonio Armando Ortiz Barrañón, LaVonne Roberts

4.5

I 4.0 Im 4.7 P 3.5

We previously introduced a training-free method for dysarthria severity assessment based on d-prime separability of phonological feature subspaces in frozen self-supervised speech representations, validated on 890 speakers across 5 languages with HuBERT-base. Here, we scale the analysis to 3,374 speakers from 25 datasets spanning 12 languages and 5 aetiologies (Parkinson's disease, cerebral palsy, ALS, Down syndrome, and stroke), plus healthy controls, using 6 SSL backbones. We report three findings.

#124

Multilinguality at the Edge: Developing Language Models for the Global South

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)

Lester James V. Miranda, Songbo Hu, Roi Reichart, Anna Korhonen

4.5

I 5.0 Im 3.5 P 3.5

Where and how language models (LMs) are deployed determines who can benefit from them. However, there are several challenges that prevent effective deployment of LMs in non-English-speaking and hardware constrained communities in the Global South. We call this challenge the last mile: the intersection of multilinguality and edge deployment, where the goals are aligned but the technical requirements often compete.

#125

From Tokens to Concepts: Leveraging SAE for SPLADE

Research 2026-04-23 arXiv cs.CL (Computation & Language)arXiv — Mechanistic Interpretability

Yuxuan Zong, Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski

4.5

I 4.0 Im 3.5 P 4.5

Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models.

#126

Research 2026-04-23 arXiv cs.LG (Machine Learning)arXiv stat.ML (Statistical ML)

Natalie Collina, Jiuyao Lu, Georgy Noarov, Aaron Roth

4.5

I 4.0 Im 3.5 P 4.5

We study the minimax sample complexity of multicalibration in the batch setting. A learner observes $n$ i.i.d. samples from an unknown distribution and must output a (possibly randomized) predictor whose population multicalibration error, measured by Expected Calibration Error (ECE), is at most $\varepsilon$ with respect to a given family of groups. For every fixed $κ> 0$, in the regime $|G|\le \varepsilon^{-κ}$, we prove that $\widetildeΘ(\varepsilon^{-3})$ samples are necessary and sufficient, up to polylogarithmic factors.

#133

Evaluating Post-hoc Explanations of the Transformer-based Genome Language Model DNABERT-2

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Isabel Kurth, Paulo Yanez Sarmiento, Bernhard Y. Renard

4.5

I 5.0 Im 3.5 P 3.5

Explaining deep neural network predictions on genome sequences enables biological insight and hypothesis generation-often of greater interest than predictive performance alone. While explanations of convolutional neural networks (CNNs) have been shown to capture relevant patterns in genome sequences, it is unclear whether this transfers to more expressive Transformer-based genome language models (gLMs). To answer this question, we adapt AttnLRP, an extension of layer-wise relevance propagation to the attention mechanism, and apply it to the state-of-the-art gLM DNABERT-2. Thereby, we propose strategies to transfer explanations from token and nucleotide level.

transformer

#134

A-THENA: Early Intrusion Detection for IoT with Time-Aware Hybrid Encoding and Network-Specific Augmentation

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Ioannis Panopoulos, Maria Lamprini A. Bartsioka, Sokratis Nikolaidis, Stylianos I. Venieris, Dimitra I. Kaklamani…

4.5

I 4.7 Im 4.0 P 3.5

The proliferation of Internet of Things (IoT) devices has significantly expanded attack surfaces, making IoT ecosystems particularly susceptible to sophisticated cyber threats. To address this challenge, this work introduces A-THENA, a lightweight early intrusion detection system (EIDS) that significantly extends preliminary findings on time-aware encodings. A-THENA employs an advanced Transformer-based architecture augmented with a generalized Time-Aware Hybrid Encoding (THE), integrating packet timestamps to effectively capture temporal dynamics essential for accurate and early threat detection. The proposed system further employs a Network-Specific Augmentation (NA) pipeline, which enhances model robustness and generalization.

benchmarktransformercoding

#135

Verifying Machine Learning Interpretability Requirements through Provenance

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Lynn Vonderhaar, Juan Couder, Daryela Cisneros, Omar Ochoa

4.5

I 4.0 Im 4.7 P 3.5

Machine Learning (ML) Engineering is a growing field that necessitates an increase in the rigor of ML development. It draws many ideas from software engineering and more specifically, from requirements engineering. Existing literature on ML Engineering defines quality models and Non-Functional Requirements (NFRs) specific to ML, in particular interpretability being one such NFR. However, a major challenge occurs in verifying ML NFRs, including interpretability.

#136

Dynamical Priors as a Training Objective in Reinforcement Learning

Reinforcement Learning 2026-04-23 arXiv cs.LG (Machine Learning)

Sukesh Subaharan

4.5

I 4.5 Im 4.0 P 3.5

Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP-RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning.

rlagents

#137

Research 2026-04-23 arXiv — Mechanistic Interpretability

ATLAS Collaboration

4.5

I 4.0 Im 4.7 P 3.5

Wilson coefficients in dimension-six effective field theory are constrained in a combined fit to several ATLAS measurements. These inputs probe Higgs-boson processes across multiple production and decay modes, di-Higgs signatures in the $b\bar{b}γγ$ and $b\bar{b}ττ$ final states, $WW$ and $WZ$ diboson signatures, electroweak $Zjj$ final states, high-mass Drell-Yan interactions, and top-antitop events in both resolved and boosted topologies. Precision electroweak observables from LEP, SLD, and ATLAS are also included.

#141

Hybrid Policy Distillation for LLMs

Research 2026-04-22 HF ↑8 Hugging Face Daily Papers

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)arXiv cs.CV (Computer Vision)

Dat To-Thanh, Nghia Nguyen-Trong, Hoang Vo, Hieu Bui-Minh, Tinh-Anh Nguyen-Nhu

4.4

I 4.0 Im 3.5 P 4.5

Image enhancement models for mobile devices often struggle to balance high output quality with the fast processing speeds required by mobile hardware. While recent deep learning models can enhance low-quality mobile photos into high-quality images, their performance is often degraded when converted to lower-precision formats for actual use on mobile phones. To address this training-deployment mismatch, we propose an efficient image enhancement model designed specifically for mobile deployment. Our approach uses a hierarchical network architecture with gated encoder blocks and multiscale refinement to preserve fine-grained visual features.

codingquantization

#156

Frontier LLMs 2026-04-24 Simon Willison's Weblog

4.3

I 4.0 Im 3.5 P 4.5

russellromney/honker "Postgres NOTIFY/LISTEN semantics" for SQLite, implemented as a Rust SQLite extension and various language bindings to help make use of it. The design of this looks very solid. It lets you write Python code for queues that looks like this: import honker db = honker . open ( "app.db" ) emails = db . queue ( "emails" ) emails . enqueue ({ "to" : "[email protected]" }) # Consume (in a worker process) async for job in emails . claim ( "worker-1" ): send ( job .

#161

Industry 2026-04-23 TechCrunch — AI

4.3

I 4.0 Im 4.5 P 3.5

Sierra, the AI customer service agent startup founded by technologist Bret Taylor, announced today that it has acquired the YC-backed French startup Fragment.

#166

Another customer of troubled startup Delve suffered a big security incident

Industry 2026-04-23 TechCrunch — AI

4.3

I 4.0 Im 4.5 P 3.5

TechCrunch has confirmed that Delve was the compliance company that performed the security certifications for Context AI, the AI agent training startup that last week disclosed a security incident.

#167

Beehiiv rolls out new creator tools, including webinars and customizable paywalls

Industry 2026-04-23 TechCrunch — AI

4.3

I 4.0 Im 4.5 P 3.5

The announcement is a clear sign the company is trying to become an all-in-one hub for creators, reducing the hassle of juggling various tools and services to run their businesses.

#168

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

Agents & Tool Use 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Chee Wei Tan, Yuchen Wang, Shangxin Guo

4.2

I 4.0 Im 4.0 P 3.5

This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, customize, and deploy LLM-powered game agents while actively engaging with AI-driven strategies. The LLM-based chatbot, integrated within Nemobot, demonstrates its capabilities across four distinct classes of games. For dictionary-based games, it compresses state-action mappings into efficient, generalized models for rapid adaptability.

rlagentsfinetune

#169

Agentic AI-assisted coding offers a unique opportunity to instill epistemic grounding during software development

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Magnus Palmblad, Jared M. Ragland, Benjamin A. Neely

4.2

I 4.0 Im 4.0 P 3.5

The capabilities of AI-assisted coding are progressing at breakneck speed. Chat-based vibe coding has evolved into fully fledged AI-assisted, agentic software development using agent scaffolds where the human developer creates a plan that agentic AIs implement. One current trend is utilizing documents beyond this plan document, such as project and method-scoped documents. Here we propose GROUNDING.md, a community-governed, field-scoped epistemic grounding document, using mass spectrometry-based proteomics as an example.

agentscoding

#170

Using ASP(Q) to Handle Inconsistent Prioritized Data

Research 2026-04-23 arXiv cs.AI (Artificial Intelligence)

Meghyn Bienvenu, Camille Bourgaux, Robin Jean, Giuseppe Mazzotta

4.2

I 4.0 Im 4.0 P 3.5

We explore the use of answer set programming (ASP) and its extension with quantifiers, ASP(Q), for inconsistency-tolerant querying of prioritized data, where a priority relation between conflicting facts is exploited to define three notions of optimal repairs (Pareto-, globally- and completion-optimal). We consider the variants of three well-known semantics (AR, brave and IAR) that use these optimal repairs, and for which query answering is in the first or second level of the polynomial hierarchy for a large class of logical theories.

coding

#171

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)

Yilong Chen, Yanxi Xie, Zitian Gao, He Xin, Yihao Xiao…

4.2

I 4.0 Im 4.0 P 3.5

Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features.

#172

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

Evaluations & Benchmarks 2026-04-23 arXiv cs.CL (Computation & Language)

Minh Duc Bui, Xenia Heilmann, Mattia Cerrato, Manuel Mager, Katharina von der Wense

4.2

I 4.0 Im 4.0 P 3.5

Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection.

benchmarkcodegen

#173

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)

Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, Nova Fallen…

4.2

I 4.5 Im 3.5 P 3.5

Modern large-scale language model pre-training relies heavily on the single program multiple data (SPMD) paradigm, which requires tight coupling across accelerators. Due to this coupling, transient slowdowns, hardware failures, and synchronization overhead stall the entire computation, wasting significant compute time at scale. While recent distributed methods like DiLoCo reduced communication bandwidth, they remained fundamentally synchronous and vulnerable to these system stalls. To address this, we introduce Decoupled DiLoCo, an evolution of the DiLoCo framework designed to break the lock-step synchronization barrier and go beyond SPMD to maximize training goodput.

#178

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

Research 2026-04-23 arXiv cs.CL (Computation & Language)

Michele Miranda, Xinlan Yan, Nishant Mishra, Rachel Murphy, Ameen Abu-Hanna…

4.2

I 4.0 Im 4.0 P 3.5

Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction.

#179

MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

Frontier LLMs 2026-04-23 arXiv cs.CL (Computation & Language)

Maziar Kianimoghadam Jouneghani

4.2

I 4.0 Im 3.5 P 3.5

We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive framework that switches between multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance.

#180

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Zahra Monfared, Saksham Malhotra, Sekiya Hajime, Ioannis Kevrekidis, Felix Dietrich

4.2

I 4.0 Im 3.5 P 3.5

For continuous-time dynamical systems with reversible trajectories, the nowhere-vanishing eigenfunctions of the Koopman operator of the system form a multiplicative group. Here, we exploit this property to accelerate the systematic numerical computation of the eigenspaces of the operator. Given a small set of (so-called ``principal'') eigenfunctions that are approximated conventionally, we can obtain a much larger set by constructing polynomials of the principal eigenfunctions. This enriches the set, and thus allows us to more accurately represent application-specific observables. Often, eigenfunctions exhibit localized singularities (e.g.

#184

An effective variant of the Hartigan $k$-means algorithm

Research 2026-04-23 arXiv cs.LG (Machine Learning)

François Clément, Stefan Steinerberger

4.2

I 4.0 Im 3.5 P 3.5

The k-means problem is perhaps the classical clustering problem and often synonymous with Lloyd's algorithm (1957). It has become clear that Hartigan's algorithm (1975) gives better results in almost all cases, Telgarsky-Vattani note a typical improvement of $5\%$ -- $10\%$. We point out that a very minor variation of Hartigan's method leads to another $2\%$ -- $5\%$ improvement; the improvement tends to become larger when either dimension or $k$ increase.

#185

Compliance Moral Hazard and the Backfiring Mandate

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Jian Ni, Lecheng Zheng, John R Birge

4.2

I 4.0 Im 4.0 P 3.5

Competing firms that serve shared customer populations face a fundamental information aggregation problem: each firm holds fragmented signals about risky customers, but individual incentives impede efficient collective detection. We develop a mechanism design framework for decentralized risk analytics, grounded in anti-money laundering in banking networks. Three strategic frictions distinguish our setting: compliance moral hazard, adversarial adaptation, and information destruction through intervention.

benchmark

#186

Transferable Physics-Informed Representations via Closed-Form Head Adaptation

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Jian Cheng Wong, Isaac Yin Chung Lai, Pao-Hsiung Chiu, Chin Chun Ooi, Abhishek Gupta…

4.2

I 4.0 Im 3.5 P 3.5

Physics-informed neural networks (PINNs) have garnered significant interest for their potential in solving partial differential equations (PDEs) that govern a wide range of physical phenomena. By incorporating physical laws into the learning process, PINN models have demonstrated the ability to learn physical outcomes reasonably well. However, current PINN approaches struggle to predict or solve new PDEs effectively when there is a lack of training examples, indicating they do not generalize well to unseen problem instances.

#187

Neural surrogates for crystal growth dynamics with variable supersaturation: explicit vs. implicit conditioning

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Matteo Rigoni, Daniele Lanzoni, Francesco Montalenti, Roberto Bergamaschini

4.2

I 4.0 Im 3.5 P 3.5

Simulations of crystal growth are performed by using Convolutional Recurrent Neural Network surrogate models, trained on a dataset of time sequences computed by numerical integration of Allen-Cahn dynamics including faceting via kinetic anisotropy. Two network architectures are developed to take into account the effects of a variable supersaturation value. The first infers it implicitly by processing an input mini-sequence of a few evolution frames and then returns a consistent continuation of the evolution.

#188

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Efficiency 2026-04-23 arXiv cs.LG (Machine Learning)

Ashley N. Abraham, Andrew Strelzoff, Haley R. Dozier, Althea C. Henslee, Mark A. Chappell

4.2

I 4.0 Im 3.5 P 3.5

Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time.

quantization

#189

Geometric Characterisation and Structured Trajectory Surrogates for Clinical Dataset Condensation

Efficiency 2026-04-23 arXiv cs.LG (Machine Learning)

Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David Clifton…

4.2

I 4.0 Im 3.5 P 3.5

Dataset condensation constructs compact synthetic datasets that retain the training utility of large real-world datasets, enabling efficient model development and potentially supporting downstream research in governed domains such as healthcare. Trajectory matching (TM) is a widely used condensation approach that supervises synthetic data using changes in model parameters observed during training on real data, yet the structure of this supervision signal remains poorly understood.

#190

A temporal deep learning framework for calibration of low-cost air quality sensors

Research 2026-04-23 arXiv cs.LG (Machine Learning)

Arindam Sengupta, Tony Bush, Ben Marner, Jose Miguel Pérez, Soledad Le Clainche

4.2

I 4.0 Im 3.5 P 3.5

Low-cost air quality sensors (LCS) provide a practical alternative to expensive regulatory-grade instruments, making dense urban monitoring networks possible. Yet their adoption is limited by calibration challenges, including sensor drift, environmental cross-sensitivity, and variability in performance from device to device. This work presents a deep learning framework for calibrating LCS measurements of PM$_{2.5}$, PM$_{10}$, and NO$_2$ using a Long Short-Term Memory (LSTM) network, trained on co-located reference data from the OxAria network in Oxford, UK.

coding

#191

Robotics 2026-04-23 arXiv cs.RO (Robotics)

Minjo Park, Metin Sitti

4.1

I 4.0 Im 3.5 P 3.5

Pumping fluids is fundamental to a wide range of industrial, environmental, and biomedical applications. Among various pumping mechanisms, peristaltic pumps enable efficient and safe fluid transport by deforming an elastic tube without direct contact with the working fluid. Although previous studies have introduced mechanical, pneumatic, or magnetic actuations to drive membrane deformation, these approaches often lead to complex pump architectures and control schemes. In this study, we present a soft membrane pump that achieves peristaltic motion through a single pneumatic input combined with an embedded passive magnet.

#211

Effects of Swarm Size Variability on Operator Workload

Robotics 2026-04-23 arXiv cs.RO (Robotics)

William Hunt, Aleksandra Landowska, Horia A. Maior, Sarvapali D. Ramchurn, Mohammad Soorati

4.1

I 4.0 Im 3.5 P 3.5

Real-world deployments of human--swarm teams depend on balancing operator workload to leverage human strengths without inducing overload. A key challenge is that swarm size is often dynamic: robots may join or leave the mission due to failures or redeployment, causing abrupt workload fluctuations. Understanding how such changes affect human workload and performance is critical for robust human--swarm interaction design. This paper investigates how the magnitude and direction of changes in swarm size influence operator workload.

robot

#212

A Bayesian Reasoning Framework for Robotic Systems in Autonomous Casualty Triage

Robotics 2026-04-23 arXiv cs.RO (Robotics)

Szymon Rusiecki, Cecilia Morales, Pia Störy, Kimberly Elenberg, Leonard Weiss…

4.1

I 4.0 Im 3.5 P 3.5

Autonomous robots deployed in mass casualty incidents (MCI) face the challenge of making critical decisions based on incomplete and noisy perceptual data. We present an autonomous robotic system for casualty assessment that fuses outputs from multiple vision-based algorithms, estimating signs of severe hemorrhage, visible trauma, or physical alertness, into a coherent triage assessment. At the core of our system is a Bayesian network, constructed from expert-defined rules, which enables probabilistic reasoning about a casualty's condition even with missing or conflicting sensory inputs.

robot

#213

X2-N: A Transformable Wheel-legged Humanoid Robot with Dual-mode Locomotion and Manipulation

Robotics 2026-04-23 arXiv cs.RO (Robotics)

Yan Ning, Xingzhou Chen, Delong Li, Hao Zhang, Hanfu Gai…

4.1

I 4.0 Im 3.5 P 3.5

Wheel-legged robots combine the efficiency of wheeled locomotion with the versatility of legged systems, enabling rapid traversal over both continuous and discrete terrains. However, conventional designs typically employ fixed wheels as feet and limited degrees of freedom (DoFs) at the hips, resulting in reduced stability and mobility during legged locomotion compared to humanoids with flat feet. In addition, most existing platforms lack a full upper body with arms, which limits their ability to perform dexterous manipulation tasks.

rlrobotmanipulationhumanoid

#214

A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration

Robotics 2026-04-23 arXiv cs.RO (Robotics)

Kuan Xu, Ruimeng Liu, Yizhuo Yang, Denan Liang, Tongxing Jin…

4.1

I 4.0 Im 3.5 P 3.5

Bridging the gap between embodied intelligence and embedded deployment remains a key challenge in intelligent robotic systems, where perception, reasoning, and planning must operate under strict constraints on computation, memory, energy, and real-time execution. In vision-language navigation (VLN), existing approaches often face a fundamental trade-off between strong reasoning capabilities and efficient deployment on real-world platforms. In this paper, we present a deployable embodied VLN system that achieves both high efficiency and robust high-level reasoning on real-world robotic platforms.

vlmrobot

#215

Causality-Encoded Diffusion Models for Interventional Sampling and Edge Inference

Research 2026-04-23 arXiv stat.ML (Statistical ML)

Li Chen, Xiaotong Shen, Wei Pan

4.1

I 4.0 Im 3.8 P 3.5

Standard diffusion models are flexible estimators of complex distributions, but they do not encode causal structures and therefore do not by themselves support causal analysis. We propose a causality-encoded diffusion framework that incorporates a known directed acyclic graph by training conditional diffusion models consistent with the graph factorisation. The resulting sampler approximately recovers the observational distribution and enables interventional sampling by fixing intervened variables while propagating effects through the graph during reverse diffusion.

diffusion

#216

Context Unrolling in Omni Models

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)

Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He…

Research 2026-04-23 arXiv stat.ML (Statistical ML)

Lena Zellinger, Antonio Vergari

4.0

I 4.0 Im 3.5 P 3.5

When approximating an intractable density via variational inference (VI) the variational family is typically chosen as a simple parametric family that very likely does not contain the target. This raises the question: Under which conditions can we recover characteristics of the target despite misspecification? In this work, we extend previous results on robust VI with location-scale families under target symmetries. We derive sufficient conditions guaranteeing exact recovery of the mean when using the forward Kullback-Leibler divergence and $α$-divergences.

#225

Multiscale Super Resolution without Image Priors

Multimodal 2026-04-23 arXiv cs.CV (Computer Vision)

Daniel Fu, Gabby Litterio, Pedro Felzenszwalb, Rashid Zia

4.0

I 4.0 Im 3.5 P 3.5

We address the ambiguities in the super-resolution problem under translation. We demonstrate that combinations of low-resolution images at different scales can be used to make the super-resolution problem well posed. Such differences in scale can be achieved using sensors with different pixel sizes (as demonstrated here) or by varying the effective pixel size through changes in optical magnification (e.g., using a zoom lens).

#226

It's a big one

Frontier LLMs 2026-04-24 Simon Willison's Weblog

3.9

I 3.5 Im 3.0 P 4.0

This week's edition of my email newsletter (aka content from this blog delivered to your inbox) features 4 pelicans riding bicycles, 1 possum on an e-scooter, up to 5 raccoons with ham radios hiding in crowds, 5 blog posts, 8 links, 3 quotes and a new chapter of my Agentic Engineering Patterns guide. Tags: newsletter

#227

Making Sense of the Early Universe

Infrastructure 2026-04-23 NVIDIA AI Blog

2.8

I 3.0 Im 3.0 P 2.5

text goes here

#228

Tag, You’re It: GeForce NOW Levels Up Game Discovery With Xbox Game Pass and Ubisoft+ Labels

Infrastructure 2026-04-23 NVIDIA AI Blog

2.5

I 2.0 Im 1.5 P 3.0

GeForce NOW is doubling down on what matters most: gamers. This week’s upgrades bring smarter libraries, making it easier than ever for gamers to turn a PC collection into a cloud-powered flex. It starts with giving existing libraries time to shine. Gamers can bring the games they love to the cloud, stream them with high performance and see the value of a GeForce NOW membership grow with new games, rewards and features. First up, finding something to play gets an upgrade.