AI Digest — April 20, 2026

Claude Opus 4.7 released with SOTA SWE-Bench Verified, expanded vision, new 'xhigh' effort level

Robotics ★ 9.6 multi-source (7)

Anthropic News, Artificial Analysis, YT: AI Explained, Hacker News, Cursor Blog, Latent Space … 2026-04-16

Anthropic released Claude Opus 4.7 on April 16, 2026, positioning it as the frontier of its lineup while holding pricing unchanged at five dollars per million input tokens and twenty-five dollars per million output tokens. The headline technical gains sit in software engineering, long-horizon autonomy, and vision. On SWE-bench Verified the model posts state-of-the-art resolution rates, with early testers reporting roughly thirteen percentage points of gain over Opus 4.6 on coding tasks and a three-times increase in the number of production tasks completed end-to-end. On specialized evaluations the model reaches 90.9 percent on BigLaw Bench and takes the lead on GDPval-AA, a third-party evaluation of economically valuable knowledge work. Instruction following is reported to be substantially stronger, which Anthropic flags explicitly as a migration hazard: prompts tuned against older Claude versions may need to be rewritten rather than ported.

Vision capacity expanded roughly three times, with support for images up to 2,576 pixels on the long edge. In practice this unlocks dense-screenshot reading, reasoning over complex diagrams, and better extraction from technical and scientific figures. A new reasoning effort level called xhigh exposes finer control over the reasoning-versus-latency tradeoff, extending the existing low-to-high ladder. Early testers emphasize sustained autonomous reasoning — the model is described as coherent across multi-hour runs rather than prematurely concluding on difficult problems. The tokenizer has been updated, and as a consequence input token counts rise roughly 1.0 to 1.35 times depending on the content, which has meaningful implications for both latency and cost modeling.

Safety posture is framed as comparable to Opus 4.6, with improvements in honesty and prompt-injection resistance. The release is paired with Project Glasswing, a set of automatic safeguards that detect and block requests indicating prohibited or high-risk cybersecurity uses; a Cyber Verification Program carves out legitimate vulnerability research and penetration testing. Distribution covers all Claude products, the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Third-party evaluations and community reception were genuinely split. Artificial Analysis placed Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 tied at an intelligence score of fifty-seven, which frames the release as frontier parity rather than an uncontested lead. On Hacker News, the strongest positive signal came from heavy Claude Code users noting gains on well-specified structured tasks at higher effort levels. Less favorably, multiple threads flagged regressions in day-to-day Claude Code use — more hallucinations on shallow checks, additional confirmation loops, and higher token burn per outcome. A separate and pointed complaint thread, 'Claude Code Opus 4.7 keeps checking on malware,' documented the model producing false-positive malware refusals on legitimate debugging work; the apparent root cause is an interaction between Claude Code's injected system prompt and Opus 4.7's updated reasoning behavior, with at least one developer reporting an account termination triggered by these signals. The practical takeaway is that whether Opus 4.7 is an upgrade depends materially on workflow: for structured high-effort tasks with well-written prompts the gains appear real, while casual Claude Code use may warrant holding on 4.6 until the rough edges settle.

How it was discussed

Anthropic News: Framed as SOTA on SWE-bench Verified with 2,576px vision, xhigh reasoning control, and Project Glasswing cybersecurity guardrails; pricing unchanged.
Artificial Analysis: Placed Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 tied at intelligence score 57 — frontier parity, not a singular lead.
Hacker News: Developer reception split: users report stronger structured-coding performance at high effort but regressions in Claude Code (hallucinations, more confirmation loops, higher token burn).
Hacker News (malware thread): Injected Claude Code system prompt now interacts with 4.7's reasoning to produce false-positive 'suspected malware' refusals on legitimate debugging work; one developer reported account termination.
YT: AI Explained: Walked through 'New Frontier in Performance and Drama' video framing — strong on benchmarks, but community drama around tokenizer cost changes and refusal behavior.

Impact 9.8Import 8.8Pop 9.4

Physical Intelligence releases π0.7: steerable VLA foundation model with emergent cross-embodiment generalization

Robotics ★ 8.6 multi-source (2)

Physical Intelligence, HF Daily Papers (indirect) 2026-04-16

Physical Intelligence released π0.7, a vision-language-action foundation model for robotic manipulation that is deliberately positioned as a single steerable generalist rather than a collection of task-specific specialists. The architecture retains the VLA template from prior π releases: a vision-language backbone ingests image observations plus language and metadata, and an action head produces low-level control. What is new is the conditioning surface. Inputs now include episode metadata that explicitly tags quality, speed, and control modality, and a lightweight world model generates visual subgoals used as a form of learned grounding for the high-level policy. The high-level policy itself emits intermediate language subtasks, which the downstream policy consumes alongside the visual subgoals. This lets the model disambiguate diverse behaviors in its training data rather than averaging them out.

Training uses an unusually heterogeneous data mix. Robot teleoperation demonstrations span multiple embodiments, human video contributes strategy priors without action labels, and autonomous episodes from reinforcement-learning specialist policies — specifically models trained with the Recap procedure — are folded in as additional demonstration-like signal. Naively merging these would typically degrade performance, because suboptimal or specialist-quirk behaviors contaminate the base policy. The metadata conditioning is the mechanism that prevents this: labels describing the quality and provenance of each episode allow the model to distill diverse strategies while preserving expert-level output at inference time, simply by prompting the relevant metadata tokens.

The capability breadth is unusual for a single model. Reported tasks include laundry folding on a bimanual UR5e platform, espresso preparation end-to-end, flat-pack box building, kitchen appliance operation, vegetable peeling, and, notably, zero-shot air-fryer loading composed from disparate training episodes. Quantitatively, π0.7 matches or exceeds π*0.6 reinforcement-learning specialist models on laundry with T-shirts, diverse laundry items, espresso, and box-building in both success rate and throughput — despite being a single generalist rather than purpose-trained per task. Cross-embodiment generalization is notable: on the bimanual UR5e, an undertrained platform in the dataset, the model folds laundry at a success rate comparable to experienced human teleoperators attempting that cross-embodiment configuration for the first time after 375-plus hours of prior teleop experience elsewhere.

The authors argue this constitutes early evidence of compositional generalization in the VLA class — recombining primitives similarly to how large language models recombine reasoning patterns, and accepting language coaching for novel tasks without additional teleop data. The practical implication is that VLA scaling along data-quality and data-diversity axes, rather than parameter count alone, continues to be the binding constraint on generalist-robot progress, and that metadata-conditioned distillation from RL specialists is a viable mechanism for folding specialist gains back into a steerable generalist.

Impact 9.2Import 8.8Pop 7.4

Dwarkesh Patel × Jensen Huang: TPU competition, Nvidia's supply chain, and the case against export controls

Industry ★ 8.2

Dwarkesh Patel Podcast 2026-04-15

Dwarkesh Patel's interview with Jensen Huang lands at a moment when NVIDIA's moat is being stress-tested from several directions at once — Anthropic's TPU shift, Google and Broadcom's multi-gigawatt compute partnership announcements, and sustained policy pressure over chip exports to China. Huang's responses across the conversation form a coherent defense that rests on three claims: architecture, supply chain, and geopolitics.

On architecture, Huang argues that the leap from Hopper to Blackwell delivered roughly fifty times efficiency improvement, far exceeding the twenty-five percent annualized scaling implied by lithography alone. His framing separates matrix multiplication, which TPUs handle well, from the broader set of operations that large models increasingly require — novel attention variants, hybrid model structures, dynamic control flow — where general-purpose programmability is load-bearing. He is explicit that CUDA's moat is a software-ecosystem moat: several hundred million deployed GPUs across every major cloud means kernel engineers prioritize NVIDIA, and he claims hyperscalers running custom kernels typically obtain two to three times speedup on top of NVIDIA's baseline optimized paths. He is open to architectural diversification — he references licensing Groq inference technology for premium, low-latency token segments — but rejects backporting to older process nodes like N7 on engineering-complexity grounds.

On supply chain, Huang describes roughly one hundred billion dollars of explicit upstream commitment and an estimated two hundred fifty billion dollars of total commitment, built on the premise that NVIDIA's downstream market dominance gives suppliers confidence to invest in capacity. He claims CoWoS packaging constraints are resolved and that EUV capacity bottlenecks are a two-to-three-year problem given clear demand signaling to ASML and TSMC. The binding constraint on datacenter buildouts, he asserts, is no longer chips or memory but skilled-trade labor — plumbers and electricians — which is a concrete claim worth testing.

On geopolitics, Huang makes the strongest and most contested argument. He contends that SMIC's seven-nanometer process combined with HBM2 memory is sufficient for competitive Chinese model training, so U.S. export controls do not throttle Chinese AI capability; they only accelerate the development of an independent Chinese ecosystem that NVIDIA is then cut out of. He rejects the uranium analogy pointedly, arguing AI compute is not a weapons-grade material with a scarce physical precursor. He frames Anthropic's Broadcom and Google partnership as a single-case deviation — he uses the phrase 'one hundred percent Anthropic' — rather than evidence of a broader shift. His core commercial argument: maintaining marginal sales to China preserves NVIDIA's position in roughly forty percent of the world's technology industry, and conceding that position hurts American long-term competitiveness more than allowing current sales.

The strategic takeaway is that NVIDIA's public position is now explicitly political, not just technical. Huang is arguing a policy case to a broad audience, which is itself a signal about how the company sees the 2026 environment.

Impact 7.4Import 8.4Pop 8.2

Cursor 3 launches: unified agent workspace, parallel cloud agents, Composer 2 frontier coding model

Agents & Tools ★ 8.0 multi-source (2)

Cursor Blog, Hacker News 2026-04-02

Cursor 3 is framed by Anysphere as a from-scratch rebuild rather than an iterative IDE update, positioning agents — not files, and not the editor itself — as the primary abstraction of the software development surface. The most visible change is consolidation: local and cloud agents converge into a single sidebar, and agent activity initiated from mobile, web, desktop, Slack, GitHub, or Linear surfaces in one unified stream. Parallel agent execution becomes first-class, with cloud agents emitting visual demos and screenshots the developer can inspect for verification, directly replacing the older workflow of context-switching to cursor.com/agents in a browser tab.

The handoff primitive is notable. Sessions move between local and cloud environments explicitly — cloud-to-local to test changes against a checkout using Composer 2, Cursor's frontier coding model, and local-to-cloud to preserve long-running tasks when closing the laptop. This is engineered around a recognition that developers want the same agent context in different execution environments at different times, not two parallel agents with drifted state. Composer 2 is referenced throughout as the frontier model for the workspace, though specific benchmark numbers are not enumerated in the launch post.

Supporting surfaces are redesigned. The diff interface integrates staging, commits, and pull request management rather than forcing context switches to git tooling. A built-in browser lets agents and developers interact with locally-hosted websites without another tab. Core IDE fundamentals — Language Server Protocol integration, go-to-definition, file navigation — are preserved, and the marketplace retains hundreds of plugins. The companion posts — multi-agent GPU kernel optimization with NVIDIA demonstrating thirty-eight percent geomean speedup across 235 CUDA problems, Bugbot self-improving with learned rules, and warp-decode MoE inference optimizations — indicate the team is investing in the deeper agent-coordination and runtime-efficiency substrate that makes parallel cloud agents economically viable.

The framing — 'the third era of software development' with fleets of autonomous agents as the default work unit — is a marketing claim but it is also a bet that the IDE stops being the focal surface for many developers. The practical question for teams is whether parallel cloud agents handling long-running tasks actually change work-in-progress dynamics, or whether the model stays IDE-centric with agents acting as fancier autocomplete. Cursor's revenue trajectory and the product direction suggest they believe the former; competitive responses from GitHub, JetBrains, and Cognition will be the clearest independent signal.

Impact 8.5Import 7.6Pop 7.6

Transformer Circuits: Emotion concepts in Claude Sonnet 4.5 causally influence outputs

Interpretability ★ 7.9

Transformer Circuits Thread 2026-04-15

Anthropic's interpretability team published 'Emotion Concepts and their Function in a Large Language Model' on the Transformer Circuits Thread. The paper identifies sparse-autoencoder-style representations corresponding to emotion concepts in Claude Sonnet 4.5's residual stream and demonstrates, through targeted causal interventions, that steering these features produces predictable and emotion-specific shifts in model outputs. This extends prior work on concept features — golden-gate-style single-feature steering, refusal circuits, sycophancy features — into a new category: internal variables that the model behaves as if it is 'experiencing,' in a functional sense, and that measurably influence generation downstream.

Methodologically the work fits the established pattern — train dictionary learners on residual activations, locate features that activate on emotionally-valenced inputs, and then probe causality by clamping or perturbing those features at inference time. What is new is the specific taxonomy and the degree to which these features behave as latent variables rather than surface artifacts. Steering an emotion-concept feature in the positive direction reliably shifts tone, word choice, and in some cases refusal behavior in ways consistent with the emotion label; negative-direction clamping produces the inverse. Importantly the interventions generalize across unrelated prompt contexts, which is the signature of a genuine internal variable rather than a pattern-matched output template.

The implications cluster in three directions. For alignment, emotion-like latent state being load-bearing for behavior means that steering interventions — to reduce sycophancy, harmful-compliance, or deception — may need to account for these features rather than optimizing purely on output-level rewards. For welfare framing, the paper contributes a concrete case that 'emotional' state is a functional internal variable with causal effect on generation, which is relevant to ongoing debates about AI welfare; the authors do not claim this resolves the philosophical question but it does raise the empirical bar. For the mechanistic-interpretability research agenda, it suggests dictionary learning over residual streams continues to yield interpretable and causally-relevant features at scale in frontier-tier models, which is a positive signal for the broader SAE research program.

Expected follow-up questions for the community include whether these features generalize across models — whether Claude 3, Opus, and non-Anthropic frontier models show analogous structure — and whether fine-tuning procedures such as RLHF shape or attenuate these features relative to pretrained base models. The paper stands as one of the more concrete demonstrations to date that interpretability tools can locate and causally manipulate abstractions that would previously have been dismissed as anthropomorphic projection.

Impact 7.8Import 8.2Pop 7.0

Anthropic Labs launches Claude Design: conversational design tool powered by Opus 4.7

Robotics ★ 7.9 multi-source (2)

Anthropic News, TechCrunch AI 2026-04-17

Anthropic Labs launched Claude Design on April 17, 2026, a research-preview product that positions Claude Opus 4.7 as a conversational design collaborator for visual artifacts — slides, prototypes, wireframes, marketing materials, and interactive elements. Users describe what they need, Claude produces an initial version, and refinement proceeds through inline comments, direct edits, and custom adjustment sliders that expose parameters like density, contrast, and tone. The import pipeline is broad: text prompts, images, uploaded documents, codebases, and URL capture that extracts elements from live websites, which matters because most real-world design briefs arrive as a mess of references rather than a clean specification.

The enterprise angle is more interesting than the consumer positioning suggests. Claude Design reads organization design systems automatically — colors, typography, component libraries — which means outputs are governed by brand constraints without the designer hand-holding the model through every iteration. Organization-scoped sharing with view and edit permissions makes the product collaborative by default, and exports cover Canva, PDF, PPTX, and HTML, plus a hand-off path into Claude Code when a design needs to become a real implementation. The interactive-prototype support extending to voice, video, and 3D elements is a meaningful expansion of what a design tool is expected to produce, even if the quality bar on those modalities is not yet stated.

Availability is gated to Claude Pro, Max, Team, and Enterprise subscribers as a research preview, with enterprise admin opt-in at the organization level, and usage counted against existing subscription limits rather than a separate tier. The positioning reads as Anthropic probing the design-software market — historically a Figma-and-Adobe duopoly for professionals and Canva for everyone else — with a model-centric alternative where the primary interface is conversation rather than direct manipulation. For teams already using Claude for other knowledge work, integration into existing billing and account boundaries materially lowers adoption friction, which is probably the more important dynamic than any single design-quality benchmark the product might eventually cite.

Impact 7.4Import 8.8Pop 6.8

Cognition SWE-Check: 10x faster bug detection, matches frontier on in-distribution evals via Applied Compute RL

AI Coding ★ 7.2

Cognition AI 2026-04-14

Specialized bug-detection model for code diffs in Windsurf. Matches Claude Opus 4.6 F1 on in-distribution evals; delta-F1 improves from 0.49→0.29 on OOD (Cognition's internal codebase) — still a gap. Runs ~10x faster via dense intermediate reasoning on Cerebras. Training via Applied Compute RL integrating the Windsurf production environment into the sandbox. Two key tricks: (1) reward linearization converting global F_0.5 to sample-level rewards for gradient descent; (2) two-phase post-training — first maximize capability, then optimize latency with a CDF-based penalty derived from user switch patterns during dogfooding. Shipping in Windsurf Next under cmd+U.

Impact 7.8Import 7.0Pop 6.4

Cursor × NVIDIA: multi-agent system achieves 38% geomean speedup across 235 CUDA kernel optimization problems

Agents & Tools ★ 6.7

Cursor Blog 2026-04-14

A planner agent distributes optimization tasks to autonomous workers; the coordination protocol fits in a single markdown file. Workers learn to invoke benchmarking pipelines and iterate without human intervention. Evaluated on 235 CUDA problems: 149 (63%) improved, 19% by >2x. Speed-of-Light scoring: median 0.56 (lots of headroom), attention kernel 0.9722 (84% speedup), matmul at 86% of cuBLAS with 9% improvement on specific workloads, quantized ops 39% faster. Evaluated both direct CUDA C + inline PTX and high-level CuTe DSL — the system learned novel APIs from docs alone.

Impact 8.2Import 4.0Pop 7.6

Dwarkesh: 'What I learned this week' — pretraining parallelisms, distillation can't easily be stopped, Mythos

Efficiency ★ 6.2

Dwarkesh Patel Podcast 2026-04-15

On distillation: at ~$25/M tokens from frontier models, training-data acquisition is trivial; hidden CoT offers limited protection since models can be told to skip reasoning, RL targets can force reconstruction, and agentic tool use is observable on users' devices. On pretraining: FSDP as default (shards 1/N, all-gathers per layer, discards post-use — comms ~3x params via reduce-scatter); hierarchical collectives (reduce-scatter within NVLink domains, all-reduce across) manage the compute-comms crossover; hard batch-size floor from attention-within-sequence locality pushes max scaling to ~1K GPUs at 10K seq/10M critical tokens. Failure modes: causality breaks via expert routing, token dropping, FP16 granularity in collectives causing gradient errors. Mythos frames as multi-vulnerability attack sophistication discontinuity; defense-patching harder than vuln-finding for AI systems.

Impact 7.2Import 4.0Pop 6.8

#10

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Safety & Policy ★ 5.4 multi-source (3)

arXiv cs.LG, arXiv cs.AI, arXiv Efficiency 2026-04-17

by Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, H

Impact 4.0Import 5.6Pop 5.9

#11

Prototype-Grounded Concept Models for Verifiable Concept Alignment

Safety & Policy ★ 5.4 multi-source (3)

arXiv cs.LG, arXiv cs.AI, arXiv cs.NE 2026-04-17

by Stefano Colamonaco, David Debot, Pietro Barbiero, Giuseppe Marra

Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs match the predictive performance of state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.

Impact 4.0Import 5.6Pop 5.9

#12

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

Research ★ 5.4 multi-source (3)

arXiv cs.LG, arXiv cs.CL, arXiv cs.AI 2026-04-17

by Yutong Gao, Qinglin Meng, Yuan Zhou, Liangming Pan

While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.

Impact 4.0Import 5.6Pop 5.9

#13

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Research ★ 5.4 multi-source (4)

arXiv cs.LG, arXiv cs.CL, arXiv Efficiency, HF Daily Papers 2026-04-17

by Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang et al.

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP

Impact 4.0Import 4.0Pop 7.7

#14

Where does output diversity collapse in post-training?

Research ★ 5.4 multi-source (7)

arXiv cs.LG, arXiv cs.CL, arXiv cs.AI, arXiv RL, arXiv PostTraining, arXiv Efficiency … 2026-04-17

by Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-c

Impact 4.0Import 4.0Pop 7.7

#15

AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection

Safety & Policy ★ 5.4 multi-source (3)

arXiv cs.AI, arXiv cs.CV, arXiv MechInterp 2026-04-17

by Hao Wang, Beichen Zhang, Yanpei Gong, Shaoyi Fang et al.

As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.

Impact 4.0Import 5.6Pop 5.9

#16

Repurposing 3D Generative Model for Autoregressive Layout Generation

Research ★ 5.4 multi-source (5)

arXiv cs.CV, arXiv GenMedia, arXiv Efficiency, arXiv Evals, HF Daily Papers 2026-04-17

by Haoran Feng, Yifan Niu, Zehuan Huang, Yang-Tian Sun et al.

We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at https://github.com/fenghora/LaviGen.

Impact 4.0Import 4.0Pop 7.7

#17

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Evals & Benchmarks ★ 5.3 multi-source (3)

arXiv cs.CL, arXiv cs.AI, arXiv Evals 2026-04-17

by Van-Truong Le

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent

Impact 4.8Import 4.6Pop 5.9

#18

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Evals & Benchmarks ★ 5.2 multi-source (4)

arXiv cs.CL, arXiv cs.AI, arXiv cs.CV, arXiv Evals 2026-04-17

by Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen et al.

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a b

Impact 4.0Import 4.6Pop 6.5

#19

Qwen3.5-Omni Technical Report

Multimodal ★ 5.2 multi-source (3)

arXiv cs.CL, arXiv Efficiency, HF Daily Papers 2026-04-17

by Qwen Team

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text

Impact 4.0Import 4.0Pop 7.1

#20

Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement

State Space Models ★ 5.1 multi-source (3)

arXiv cs.CV, arXiv SSM, arXiv Evals 2026-04-17

by Tejeswar Pokuri, Shivarth Rai

Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scattering in water. While learning-based methods like CNNs and Transformers have shown promise, they face critical limitations: CNNs struggle to model the long-range dependencies needed for non-uniform degradation, and Transformers incur quadratic computational complexity, making them inefficient for high-resolution images. To address these challenges, we propose Hero-Mamba, a novel Mamba-based network that achieves efficient dual-domain learning for underwater image enhancement. Our approach uniquely processes information from both the spatial domain (RGB image) and the spectral domain (FFT components) in parallel. This dual-domain input allows the network to decouple degradation factors, separating color/brightness information from texture/noise. The core of our network utilizes Mamba-based SS2D blocks to capture global receptive fields and long-range dependencies with linear complexity, overcoming the limitations of both CNNs and Transformers. Furthermore, we introduce a ColorFusion block, guided by a background light prior, to restore

Impact 4.7Import 4.0Pop 5.9

#21

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4

NVIDIA AI Blog 2026-04-20

AI agents are transforming how work gets done across all industries, accelerating everything from content creation to decision-making. NVIDIA’s expanded strategic collaborations with Adobe and WPP are bringing agentic AI to the center of enterprise marketing operations across creative production and customer experience orchestration. As demand for personalized customer experiences surges, brands require intelligent systems that can plan, create, produce and activate content continuously — without compromising control, governance or brand integrity. Consider a global retailer delivering the right offer, image, copy and price, across millions of product, audience and channel combinations — updated in minutes instead of months. For marketing and creative teams, that means moving from one-size-fits-all campaigns to tailored experiences that are always on, always relevant and on brand. All of it is powered by intelligent systems that continuously generate and deliver content without sacrificing control, governance or brand integrity. The expanded collaborations bring together three complementary strengths: Adobe’s creative and customer experience platforms and the new Adobe CX Enterprise Coworker, WPP’s global media and marketing expertise, and NVIDIA’s accelerated computing and software stack, including NVIDIA Nemotron open models, NVIDIA Agent Toolkit and the NVIDIA OpenShell secure runtime for building and running secure agentic AI systems. As these agents begin orchestrating multistep workflows, tapping sensitive data and triggering actions across marketing stacks, enterprises need a way to enforce clear rules of engagement so every operation remains compliant, on brand and within defined risk boundaries. Powered by the NVIDIA OpenShell runtime, every agent operates with

Impact 5.8Import 4.0Pop 4.8

#32

AllenAI: Benchmarks show frontier AI science agents still struggle on problems humans solve routinely

Agents & Tools ★ 4.9

Allen Institute for AI 2026-04-13

Two benchmarks (ScienceWorld, DiscoveryWorld) evaluate AI agents on scientific-discovery tasks. Even the strongest agents struggle with problems competent human scientists solve routinely, providing a concrete yardstick for the gap between claimed scientific-reasoning capability and practical usefulness.

Impact 5.8Import 4.9Pop 3.5

#33

Latent Space × Notion: five rebuilds, 100+ tools, MCP vs CLIs for production agents

Agents & Tools ★ 4.9 multi-source (2)

Latent Space Podcast, Latent Space 2026-04-15

Notion rebuilt its agent system five times since late 2022. Progressed from JS codegen (models couldn't write reliable code) through custom XML, markdown, and SQL-like DB abstractions. Key shift: few-shot prompts → tool definitions to distribute ownership across teams. 100+ tools managed via progressive disclosure to avoid context bloat. CLIs favored for self-debugging within one env; MCPs for narrow, tightly-permissioned tasks. Pricing via credits abstracting tokens/model/tier/features. 'Auto' model selection matches capability to task — explicit rejection of 'frontier model for every knowledge-work task'.

Impact 4.0Import 4.0Pop 6.0

#34

Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration

Research ★ 4.9 multi-source (2)

arXiv cs.AI, arXiv cs.CV 2026-04-17

by Baramee Sukumal, Aueaphum Aueawatthanaphisut

Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the high

Impact 4.0Import 5.6Pop 4.7

#35

LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment

Evals & Benchmarks ★ 4.8

College instructor turns to typewriters to curb AI-written work

Industry ★ 4.8

HN AI (100+) 2026-04-18

Article URL: https://sentinelcolorado.com/uncategorized/a-college-instructor-turns-to-typewriters-to-curb-ai-written-work-and-teach-life-lessons/ Comments URL: https://news.ycombinator.com/item?id=47818485 Points: 477 # Comments: 422

Impact 4.0Import 4.0Pop 6.0

#76

Graphs that explain the state of AI in 2026

Industry ★ 4.8

HN AI (100+) 2026-04-18

Article URL: https://spectrum.ieee.org/state-of-ai-index-2026 Comments URL: https://news.ycombinator.com/item?id=47817581 Points: 111 # Comments: 61

Impact 4.0Import 4.0Pop 6.0

#77

Qwen3.5-Omni technical report (HF Daily Papers top paper)

Multimodal ★ 4.8 multi-source (2)

HF Daily Papers, arXiv cs.CL 2026-04-18

Streaming audio+video input and low-latency voice output on top of Qwen3 base. Technical report highlights a talker-thinker decoupling in the decoder. Featured on HF Daily Papers top board (19 upvotes); raised in community discussions about VRAM sizing and gains outside Alibaba's eval harness.

How it was discussed

HF Daily Papers: Trending in Daily Papers top board; primary interest is in voice latency and multimodal streaming.
Community: Discussion centers on whether OmniBench gains replicate outside Alibaba's evaluation harness, and consumer-VRAM sizing trade-offs.

Impact 4.0Import 4.0Pop 5.9

#78

Robotics ★ 4.6 multi-source (2)

arXiv cs.RO, arXiv Robotics-Embodied 2026-04-17

by Jasper Lu, Zhenhao Shen, Yuanfei Wang, Shugao Liu et al.

Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.

Impact 4.0Import 4.6Pop 4.7

#89

Gemini Robotics-ER 1.6: Powering real-world robotics tasks through enhanced embodied reasoning

Robotics ★ 4.5

Google DeepMind Blog 2026-04-13

Gemini Robotics ER 1.6: Enhancing spatial reasoning and multi-view understanding for autonomous robotics.

Impact 4.0Import 4.0Pop 4.8

#90

Trusted access for the next era of cyber defense

Frontier LLMs ★ 4.5

OpenAI Research 2026-04-14

OpenAI expands its Trusted Access for Cyber program, introducing GPT-5.4-Cyber to vetted defenders and strengthening safeguards as AI cybersecurity capabilities advance.

Impact 4.0Import 4.0Pop 4.8

#91

What I’ve been building: ATOM Report, post-training course, finishing my book, and ongoing research

Frontier LLMs ★ 4.5

Interconnects (Nathan Lambert) 2026-04-14

This post is a roundup of my recent efforts that did not warrant a standalone Interconnects post, why I’m spending time on them, and what they accomplished. The ATOM Report: Measuring the Open Language Model Ecosystem RLHF Book is done & ready for pre-order! A post-training course I’m making Recent technical research Share 1. The ATOM Report: Measuring the Open Language Model Ecosystem https://arxiv.org/abs/2604.07190 To accompany The ATOM Project memo , arguably a manifesto, making the case for investment in open models in the U.S. – originally launched in August 2025 – we’ve released an updated technical report with our latest data, analysis, and storytelling within the open language model ecosystem. The ATOM Report is dense with the methods Florian and I use to keep track of the open ecosystem. It covers GPT-OSS’s rise, inference market share, the influence of China’s mid-tier players like Moonshot, Z.ai, & MiniMax, signs of the U.S.’s progress on open models, and much more. In particular, the paper details our updates to the Relative Adoption Metric (RAM) , which we use to evaluate the adoption of recent models in a time-varying and size-normalized manner. Here’s a sampling of recent, primarily Chinese, models on the RAM score. The RAM score is designed so that a score >1 indicates a model is, at that point in time, on track to be a top 10 most downloaded model of its size category, ever. It reduces a messy landscape to one, easily interpretable number! We used the data to also analyze the recent Gemma 4 release, which is showing incredible early adoption numbers. We’ll stay tuned on it! Subscribe to the (infrequent) ATOM Project Substack for more updates like this! 2. RLHF Book is done & rea

Impact 4.0Import 4.0Pop 4.8

#92

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

Generative Media ★ 4.5

Expert-Guided Class-Conditional Goodness-of-Fit Scores for Interpretable Classification with Informative Missingness: An Application to Seismic Monitoring

Research ★ 4.5

arXiv stat.ML 2026-04-16

by Shahar Cohen, David M. Steinberg, Yael Radzyner, Yochai Ben Horin

We study a classification problem with three key challenges: pervasive informative missingness, the integration of partial prior expert knowledge into the learning process, and the need for interpretable decision rules. We propose a framework that encodes prior knowledge through an expert-guided class-conditional model for one or more classes, and use this model to construct a small set of interpretable goodness-of-fit features. The features quantify how well the observed data agree with the expert model, isolating the contributions of different aspects of the data, including both observed and missing components. These features are combined with a few transparent auxiliary summaries in a simple discriminative classifier, resulting in a decision rule that is easy to inspect and justify. We develop and apply the framework in the context of seismic monitoring used to assess compliance with the Comprehensive Nuclear-Test-Ban Treaty. We show that the method has strong potential as a transparent screening tool, reducing workload for expert analysts. A simulation designed to isolate the contribution of the proposed framework shows that this interpretable expert-guided method can even outp

Impact 4.0Import 5.6Pop 3.5

#99

RL-STPA: Adapting System-Theoretic Hazard Analysis for Safety-Critical Reinforcement Learning

Reinforcement Learning ★ 4.5

arXiv RL 2026-04-16

by Steven A. Senczyszyn, Timothy C. Havens, Nathaniel Rice, Jason E. Summers et al.

As reinforcement learning (RL) deployments expand into safety-critical domains, existing evaluation methods fail to systematically identify hazards arising from the black-box nature of neural network enabled policies and distributional shift between training and deployment. This paper introduces Reinforcement Learning System-Theoretic Process Analysis (RL-STPA), a framework that adapts conventional STPA's systematic hazard analysis to address RL's unique challenges through three key contributions: hierarchical subtask decomposition using both temporal phase analysis and domain expertise to capture emergent behaviors, coverage-guided perturbation testing that explores the sensitivity of state-action spaces, and iterative checkpoints that feed identified hazards back into training through reward shaping and curriculum design. We demonstrate RL-STPA in the safety-critical test case of autonomous drone navigation and landing, revealing potential loss scenarios that can be missed by standard RL evaluations. The proposed framework provides practitioners with a toolkit for systematic hazard analysis, quantitative metrics for safety coverage assessment, and actionable guidelines for establ

Impact 4.0Import 5.6Pop 3.5

#100

Codex for (almost) everything

Frontier LLMs ★ 4.5

OpenAI Research 2026-04-16

The updated Codex app for macOS and Windows adds computer use, in-app browsing, image generation, memory, and plugins to accelerate developer workflows.

Impact 4.0Import 4.0Pop 4.8

#101

Introducing GPT-Rosalind for life sciences research

Frontier LLMs ★ 4.5

OpenAI Research 2026-04-16

OpenAI introduces GPT-Rosalind, a frontier reasoning model built to accelerate drug discovery, genomics analysis, protein reasoning, and scientific research workflows.

Impact 4.0Import 4.0Pop 4.8

#102

Accelerating the cyber defense ecosystem that protects us all

Frontier LLMs ★ 4.5

OpenAI Research 2026-04-16

Leading security firms and enterprises join OpenAI’s Trusted Access for Cyber, using GPT-5.4-Cyber and $10M in API grants to strengthen global cyber defense.

Impact 4.0Import 4.0Pop 4.8

#103

Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

Research ★ 4.5

arXiv cs.AI 2026-04-17

by Thomas Bayer, Alexander Lohr, Sarah Weiß, Bernd Michelberger et al.

Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamica

Impact 4.0Import 5.6Pop 3.5

#104

C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

Safety & Policy ★ 4.5

Gov/Defense ★ 4.3

FedScoop 2026-04-15

The Department of Energy’s Office of Cybersecurity, Energy Security and Emergency Response has partnered with Lawrence Livermore National Laboratory to develop an AI testbed capable of identifying model weaknesses, the agency said in a blog post Tuesday.  Energy-sector stakeholders, including utilities, grid operators, vendors, national labs and research organizations, can use the platform to better understand model risk and how to integrate AI into critical systems.  Users will upload AI models to the platform and perform adversarial tests to assess security posture.  “The testbed enables users to observe the effects of attacks and quantify how vulnerable the model is to manipulation and leaked information,” DOE said in the blog post. “This facilitates apples-to-apples comparisons between models, showing users which model options are most robust and by what margin.” Named after the Norse god Thor’s hammer, the Mjölnir AI Testbed will give energy-sector players a look at whether an AI model behaves unsafely or exposes sensitive data at a time when AI models are becoming more integrated into critical workflows.  The technology is a high-value target for cyberattacks, underlining the need for resilient models. Anthropic, for example, says that its models have been targeted by Chinese competitors in attempts to steal information about how the technology works.  “As AI systems handle increasingly sensitive data and perform critical societal functions, failures in AI security could result in severe consequences, including privacy violations, operational disruptions, economic damages, and threats to public safety,” researchers from the Japan AI Safety Institute said in a July 2025 report .  Even when not targeted directly, AI systems are subject

Impact 5.2Import 4.0Pop 3.5

#199

Waymo opens to all in Miami and Orlando; new Nashville launch

Robotics ★ 4.3

Waymo Blog 2026-04-15

Waymo opens its robotaxi service to all residents and visitors in Miami and Orlando, no waitlist, and launches a rolling-invitation Nashville service. Continues the 2026 footprint expansion off existing operational Bay Area, Phoenix, LA, Austin launches.

Impact 5.2Import 4.0Pop 3.5

#200

MambaSL: Exploring Single-Layer Mamba for Time Series Classification

State Space Models ★ 4.3

arXiv SSM 2026-04-16

by Yoo-Min Jung, Leekyung Kim

Despite recent advances in state space models (SSMs) such as Mamba across various sequence domains, research on their standalone capacity for time series classification (TSC) has remained limited. We propose MambaSL, a framework that minimally redesigns the selective SSM and projection layers of a single-layer Mamba, guided by four TSC-specific hypotheses. To address benchmarking limitations -- restricted configurations, partial University of East Anglia (UEA) dataset coverage, and insufficiently reproducible setups -- we re-evaluate 20 strong baselines across all 30 UEA datasets under a unified protocol. As a result, MambaSL achieves state-of-the-art performance with statistically significant average improvements, while ensuring reproducibility via public checkpoints for all evaluated models. Together with visualizations, these results demonstrate the potential of Mamba-based architectures as a TSC backbone.

Impact 4.7Import 4.0Pop 3.5

#201

HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet

State Space Models ★ 4.3

arXiv SSM 2026-04-16

by Badri N. Patro, Vijay S. Agneeswaran

Vision State Space Models (SSMs) like Vim, VMamba, and SiMBA rely on complex scanning strategies to adapt sequential SSMs to process 2D images, introducing computational overhead and architectural complexity. We propose HAMSA, a scanning-free SSM operating directly in the spectral domain. HAMSA introduces three key innovations: (1) simplified kernel parameterization-a single Gaussian-initialized complex kernel replacing traditional (A, B, C) matrices, eliminating discretization instabilities; (2) SpectralPulseNet (SPN)-an input-dependent frequency gating mechanism enabling adaptive spectral modulation; and (3) Spectral Adaptive Gating Unit (SAGU)-magnitude-based gating for stable gradient flow in the frequency domain. By leveraging FFT-based convolution, HAMSA eliminates sequential scanning while achieving O(L log L) complexity with superior simplicity and efficiency. On ImageNet-1K, HAMSA reaches 85.7% top-1 accuracy (state-of-the-art among SSMs), with 2.2 X faster inference than transformers (4.2ms vs 9.2ms for DeiT-S) and 1.4-1.9X speedup over scanning-based SSMs, while using less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J). HAMSA demonstrates strong generalization

Impact 4.7Import 4.0Pop 3.5

#202

On the Expressive Power and Limitations of Multi-Layer SSMs

Research ★ 4.3

arXiv SSM 2026-04-16

by Nikola Zubić, Qian Li, Yuyi Wang, Davide Scaramuzza

We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.

Impact 4.7Import 4.0Pop 3.5

#203

Frontier LLMs ★ 4.0

Google AI Blog 2026-04-13

Education Innovation

Impact 4.0Import 4.0Pop 3.5

#231

How the Army is preparing to bring its first tiltrotor aircraft online

Gov/Defense ★ 4.0

Defense One 2026-04-13

The service wants its MV-75 to bring capabilities other services have had for years—while avoiding the V-22’s fraught reputation.

Impact 4.0Import 4.0Pop 3.5

#232

HASC chair: Trillion-dollar defense budgets are the ‘new normal.’ Reconciliation is less certain.

Gov/Defense ★ 4.0

Defense One 2026-04-13

Many of the administration’s military space priorities bank on abnormal budget maneuvers.

Impact 4.0Import 4.0Pop 3.5

#233

Agents & Tools ★ 4.0

Agents & Tools ★ 4.0

Hugging Face Blog 2026-04-15

Impact 4.0Import 4.0Pop 3.5

#263

Frontier LLMs ★ 4.0

Google AI Blog 2026-04-16

Generative AI

Impact 4.0Import 4.0Pop 3.5

#318

AI-generated synthetic neurons speed up brain mapping

Frontier LLMs ★ 4.0

Google AI Blog 2026-04-16

General Science

Impact 4.0Import 4.0Pop 3.5

#319

Ecom-RLVE: Adaptive Verifiable Environments for E-Commerce Conversational Agents

Agents & Tools ★ 4.0

Hugging Face Blog 2026-04-16

Impact 4.0Import 4.0Pop 3.5

#320

The PR you would have opened yourself

Industry ★ 4.0

Hugging Face Blog 2026-04-16

Impact 4.0Import 4.0Pop 3.5

#321

Research ★ 4.0

arXiv Efficiency 2026-04-17

by Jun Feng, Jiahui Tang, Zhicheng He, Hang Lv et al.

Adaptive Retrieval-Augmented Generation aims to mitigate the interference of extraneous noise by dynamically determining the necessity of retrieving supplementary passages. However, as Large Language Models evolve with increasing robustness to noise, the necessity of adaptive retrieval warrants re-evaluation. In this paper, we rethink this necessity and propose AdaRankLLM, a novel adaptive retrieval framework. To effectively verify the necessity of adaptive listwise reranking, we first develop an adaptive ranker employing a zero-shot prompt with a passage dropout mechanism, and compare its generation outcomes against static fixed-depth retrieval strategies. Furthermore, to endow smaller open-source LLMs with this precise listwise ranking and adaptive filtering capability, we introduce a two-stage progressive distillation paradigm enhanced by data sampling and augmentation techniques. Extensive experiments across three datasets and eight LLMs demonstrate that AdaRankLLM consistently achieves optimal performance in most scenarios with significantly reduced context overhead. Crucially, our analysis reveals a role shift in adaptive retrieval: it functions as a critical noise filter for

Impact 4.0Import 4.0Pop 3.5

#444

Research ★ 3.9

Industry ★ 3.9

TechCrunch AI 2026-04-17

World, which has raised eyebrows (but also a lot of interest) with its Orb-centered anonymous verification project, is looking to expand its influence via a bevy of new partnerships.

Impact 4.0Import 4.0Pop 3.5

#573

Kevin Weil and Bill Peebles exit OpenAI as company continues to shed ‘side quests’

Industry ★ 3.9

TechCrunch AI 2026-04-17

Kevin Weil and Bill Peebles are leaving OpenAI as the company shuts down Sora and folds its science team, signaling a sharp pivot away from consumer moonshots toward enterprise AI.

Impact 4.0Import 4.0Pop 3.5

#574

Sources: Cursor in talks to raise $2B+ at $50B valuation as enterprise growth surges

Industry ★ 3.9

TechCrunch AI 2026-04-17

Returning backers a16z and Thrive are expected to lead the round.

Impact 4.0Import 4.0Pop 3.5

#575

‘Tokenmaxxing’ is making developers less productive than they think

Industry ★ 3.9

TechCrunch AI 2026-04-17

There's a lot more code — but it's a lot more expensive and requires a lot more rewriting.

Impact 4.0Import 4.0Pop 3.5

#576

Tokenmaxxing, OpenAI’s shopping spree, and the AI Anxiety Gap

Industry ★ 3.9

TechCrunch AI 2026-04-17

The gap between AI insiders and everyone else is widening, and the spending, suspicion, and even new vocabulary are starting to show it. While OpenAI is busy buying up everything from finance apps to talk shows, a certain shoe company just rebranded as an AI infrastructure play, and Anthropic unveiled a model it says is too powerful to release publicly …but apparently not too […]

Impact 4.0Import 4.0Pop 3.5

#577

Gov/Defense ★ 3.9

War on the Rocks 2026-04-20

Think of a violin made by a master craftsman: beautiful, precise, capable of extraordinary performance, but impossible to produce quickly or cheaply. It takes time, rare expertise, and materials that cannot be sourced at scale. You would not equip an entire orchestra with instruments like that. Yet that is essentially what the United States has attempted with its tactical air fleet.The F-35 program’s total lifetime cost is projected to exceed two trillion dollars, the most expensive Major Defense Acquisition Program in history. The United States plans to purchase thousands of them. Meanwhile, modern conflict, from Ukraine’s drone war to naval The post The F-35 Is a Masterpiece Built for the Wrong War appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5

#596

Iran and the Indispensable Broker: How Pakistan Outmaneuvers India on the World Stage

Gov/Defense ★ 3.9

War on the Rocks 2026-04-20

In Sept. 2025, Pakistan and Saudi Arabia signed a mutual defense pact, formalizing what decades of quiet cooperation had already made real. The defense pact signed in Riyadh was presented in official communiqués as a natural deepening of bilateral ties. It was that, but it was also something larger: The latest installment in a pattern that has persisted for half a century and that continues to confound the logic of power politics. Pakistan, a state dependent on International Monetary Fund bailouts and outmatched conventionally by its larger neighbor, has once again positioned itself at the center of a consequential security The post Iran and the Indispensable Broker: How Pakistan Outmaneuvers India on the World Stage appeared first on War on the Rocks .

Impact 4.0Import 4.0Pop 3.5

Source	Status
arXiv cs.LG	OK: 50 in-window
arXiv cs.CL	OK: 50 in-window
arXiv cs.AI	OK: 50 in-window
arXiv cs.RO	OK: 30 in-window
arXiv cs.NE	OK: 20 in-window
arXiv cs.CV	OK: 30 in-window
arXiv stat.ML	OK: 20 in-window
arXiv SSM	OK: 27 in-window
arXiv Recurrent/LinAttn	OK: 2 in-window
arXiv Robotics-Embodied	OK: 28 in-window
arXiv RL	OK: 30 in-window
arXiv Agents	OK: 30 in-window
arXiv MechInterp	OK: 20 in-window
arXiv GenMedia	OK: 25 in-window
arXiv PostTraining	OK: 25 in-window
arXiv Efficiency	OK: 25 in-window
arXiv AIScience	OK: 20 in-window
arXiv Evals	OK: 20 in-window
arXiv GovDefense	OK: 0 in-window
OpenAI Research	OK: 6 in-window (941 total)
Google DeepMind Blog	OK: 2 in-window (100 total)
Google AI Blog	OK: 3 in-window (100 total)
NVIDIA AI Blog	OK: 5 in-window (18 total)
Microsoft Research Blog	OK: 0 in-window (10 total)
Hugging Face Blog	OK: 6 in-window (766 total)
Import AI (Jack Clark)	OK: 1 in-window (10 total)
The Batch (DeepLearning.AI)	FAIL: HTTPError: HTTP Error 404: Not Found
Interconnects (Nathan Lambert)	OK: 2 in-window (20 total)
Ahead of AI (Sebastian Raschka)	OK: 1 in-window (20 total)
Latent Space	OK: 6 in-window (20 total)
AI Snake Oil	OK: 1 in-window (20 total)
One Useful Thing (Ethan Mollick)	OK: 0 in-window (20 total)
Lil'Log (Lilian Weng)	OK: 0 in-window (51 total)
Simon Willison's Weblog	OK: 24 in-window (30 total)
Gradient Flow	OK: 2 in-window (10 total)
Last Week in AI	OK: 0 in-window (20 total)
MIT Technology Review	OK: 10 in-window (10 total)
TechCrunch AI	OK: 20 in-window (20 total)
Stratechery	OK: 5 in-window (10 total)
AI Alignment Forum	OK: 4 in-window (10 total)