Wolf Digest — 2026-05-05

#1

LWiAI Podcast #243 - GPT 5.5, DeepSeek V4, AI safety sabotage

Frontier LLMs 2026-05-04 Last Week in AI 8.5 8.5/9.3/7.5

Last Week in AI #243, recorded April 29 and posted May 4, is the most concentrated single artifact of the week, packing four distinct frontier-model events and an alignment-flavored incident into one show. Andrey Kurenkov and Jeremie Harris frame it as a coding-and-voice-heavy week followed by a fresh open-weights drop from China and a Tencent release that misses.

OpenAI shipped GPT-5.5, with the system card claiming meaningful gains on coding evaluations alongside higher per-token pricing than GPT-5.4. The card also surfaces chain-of-thought monitorability and misalignment testing as headline items — OpenAI continues to publish reasoning-trace probes against its own models — and includes a now-discussed system-prompt warning about "goblins" that the hosts treat as a quirk in OpenAI's deployment-time tooling rather than a serious capability claim. xAI countered with Grok Voice Think Fast 1.0, leading on real-time-voice-agent benchmarks and quantifying production impact at Starlink customer support and sales — large enough lifts that the hosts treat it as a credible Whisper-plus-frontier-model competitor rather than a demo, though the benchmarks are first-party.

The bigger frontier news is DeepSeek V4. Pro and Flash variants ship as open weights, with the architecture moving deeper into mixture-of-experts scaling and pushing context to one million tokens via hybrid compressed-attention modifications. The hosts read this as a continuation of the post-V3 cadence — the lab is converging on the same recipe that closed-weight labs are using internally and shipping the artifacts publicly. Tencent's Hunyuan 3 preview lands the same week with weaker benchmarks; Andrey treats it as evidence that the marginal value of "yet another Chinese frontier release" is dropping unless the lab can clearly differentiate on capability or efficiency.

The episode's safety-flavored thread is a sabotage incident that the hosts characterize as a deliberate attempt to insert harmful behavior into a frontier system — they cover what is publicly known so far without naming a perpetrator and treat it as evidence that supply-chain attacks against model-training pipelines are now a credible threat surface, not a hypothetical one. They tie this back to the "distillation attacks" framing that Nathan Lambert pushes back on in the same week, and to OpenAI's published chain-of-thought monitorability work — three threads converging on the same point that the misalignment frontier is increasingly about adversarial inputs to training rather than emergent goal-directed behavior in deployment.

For practitioners, the actionable signal is that DeepSeek V4 and Grok Voice Think Fast both land as serious challengers in their respective domains in the same week, GPT-5.5 raises the price-per-coding-task ceiling without obviously moving the floor, and the misalignment conversation is shifting from "will the model deceive" toward "will an adversary plant the deception." The full episode runs 1h52m and is worth listening to in full for context on each release.

#2

Week one of the Musk v. Altman trial: What it was like in the room

Industry 2026-05-04 MIT Technology Review — AI 7.8 7.5/8.5/7.0

James O'Donnell's report from the Oakland federal courthouse covers the first week of trial in Musk's lawsuit alleging that OpenAI breached its founding nonprofit mission. The trial began the week of April 27 in front of a federal judge, with Musk's testimony and direct examination of Sam Altman both occurring during week one. The piece is the highest-quality on-the-ground account of the trial so far and reads like a stenographer's notebook turned into prose.

The legal stakes are concrete. Musk seeks rescission and disgorgement tied to his early funding contributions, with the strongest of his claims hinging on whether OpenAI's transition from nonprofit to capped-profit and then to its current structure violated explicit promises made to him in writing. A partial Musk win would, by the reporter's read, materially complicate OpenAI's reported plan to go public this year — the company is in the middle of preparing a registration filing, and any successful claim that its capital structure was constructed on a breached fiduciary duty would have to be disclosed to investors and resolved before pricing.

The reporter highlights several substantive moments from week one. Musk on the stand is described as "more measured than expected" in legal posture, repeatedly returning to the email record to show that his early checks were sized and timed against a nonprofit governance structure he believed Altman would maintain. Altman's direct examination, scheduled for week two, is expected to push back on the premise that any fixed promise was made — OpenAI's defense is that the for-profit subsidiary was always contemplated as a path to fund the mission and that Musk himself proposed a corporate-structure pivot in 2017. The article surfaces the trove of texts and emails that have entered evidence, including the now-public exchange in which Musk asked Altman for a settlement-like resolution and was rebuffed.

Beyond the stakes for OpenAI's IPO, O'Donnell points at the spectacle dimension: two of the most powerful figures in AI testifying within feet of each other while their feud plays out on X in real time, with each side's PR apparatus pushing transcripts and selective quotes. The reporter notes that the courtroom itself is a small, packed federal venue — credentialed press, OpenAI senior leadership in attendance, Musk's legal team led by his usual outside counsel — and that the judge has been actively managing the pace to keep the trial on schedule for a verdict by mid-summer. Several TechCrunch threads this week extend the same beat: one piece on Musk's only AI-expert witness, who testified about an AGI arms race; another surfacing the ominous post-funding texts Musk sent Brockman and Altman.

For anyone tracking AI corporate governance or reading the OpenAI S-1 when it lands, this is the trial of the year. The reporter promises continued coverage through closing arguments.

#3

Pentagon seeks smarter, self-organizing drones as autonomous-warfare budget is poised to skyrocket

Government & Defense 2026-05-04 Defense One 7.6 7.5/8.3/6.5

Defense One's Patrick Tucker reports that the autonomous-warfare line in the Pentagon's fiscal 2027 budget request is poised to grow significantly, with explicit emphasis on drones that can self-organize into formations and execute mission segments without per-platform operator input. The reporting cites unnamed program-office sources who describe the policy framing as a deliberate move past the current Replicator paradigm — Replicator emphasizes mass-produced, low-cost individual platforms; the new push emphasizes coordinated swarms running shared autonomy stacks.

The technical architecture, as described by the program-office sources, is built around three layers: a perception layer using onboard sensor-fusion stacks to maintain local situational awareness without relying on continuous data-link to a controller; a coordination layer that handles role assignment, formation maintenance, and target hand-off across a swarm using mesh networking; and a tasking layer where a human operator specifies mission objectives at the formation level rather than tasking each platform individually. The piece notes that several of the named technology providers — Anduril, Shield AI, Skydio, and a handful of smaller defense-tech companies — already deploy versions of all three layers in standalone products, but the integration story across vendors is still unsolved.

The budget growth is the news. Tucker quantifies the trajectory: autonomous-systems lines have been growing at roughly 35 percent year-over-year for the past three budget cycles, and the FY27 request is expected to extend that growth, with most of the increment going to platform-agnostic autonomy software rather than specific airframes. The Pentagon is treating autonomy as a horizontal capability — the same software stack expected to fly on multiple Replicator-class platforms, kamikaze drones like the Switchblade 400 (which the Army awarded contracts on this week), and larger collaborative-combat-aircraft programs in development with the Air Force. This contrasts with prior generations of autonomous-systems programs that tightly coupled the autonomy stack to a single airframe.

The piece flags two emerging tensions. First, the procurement-model question: the Defense Innovation Unit's "non-traditional" rapid-prototyping pipeline has been the on-ramp for most of the smaller autonomy vendors, but scaling to formation-level deployment requires moving programs into traditional acquisition pathways, where procurement timelines stretch to years and software-update cadence drops. Second, the data-sharing question: smaller vendors have been reluctant to share their flight-test data with Pentagon program offices because the data is core IP, but the Pentagon needs cross-platform datasets to train and validate the formation-coordination layer. Several vendors interviewed pushed back on the framing that the swarm coordination problem is "almost solved," noting that adversarial-jamming environments and degraded-GPS scenarios remain weak spots in the current state of practice.

The piece reads as one of the cleaner pieces of agenda-setting reporting on Pentagon AI policy this year and pairs naturally with this week's other defense-tech beats — the Marine Corps drone roadmap, the U.S. Strait of Hormuz "umbrella" deployment, and the Switchblade 400 award. For practitioners, the takeaway is that the autonomy-software market for defense is shifting toward standardized, platform-agnostic stacks, and the FY27 budget will tell us how serious the Pentagon is about that shift.

#4

The distillation panic

Safety, Policy & Regulation 2026-05-04 Interconnects (Nathan Lambert) 7.5 7.0/8.0/7.0

Nathan Lambert's "The distillation panic" is the most pointed policy-discourse essay of the week and pushes back against the recent framing of "distillation attacks" — the term that has emerged in U.S. policy circles to describe Chinese labs hacking or jailbreaking commercial APIs to extract training signal. Lambert's core claim is that "distillation attacks" is a horrible piece of terminology that will, by repeated use in policy documents, irrevocably associate the broad and useful research technique of distillation with the narrow misuse pattern.

The technical distinction Lambert is defending is real and important. Distillation as a research method covers everything from teacher-student knowledge transfer to model compression, MoE expert pruning, and the entire post-training synthetic-data pipeline that nearly every frontier lab now uses. It is one of the core tools that makes academic and economic diffusion of AI capability possible — small labs and startups can fine-tune on outputs from larger models, researchers can build interpretability tools by distilling behavior into more analyzable forms, and downstream consumers benefit from cheaper, faster models trained from larger ones. The "attack" framing turns this entire toolkit into an act adjacent to a hostile state.

Lambert draws an explicit parallel to the open-source / open-weights debate, where careless terminology in policy circles collapsed a meaningful technical distinction into a single label that nobody remembers the original definition of. He argues that the same fate is now playing out for distillation: U.S. national-security commentary now treats every API extraction as suspect, the press echoes the framing, and within months the academic community will find itself defending a research practice it previously considered uncontroversial. The Lawfare piece on U.S. distillation-attack response from the prior week, which Lambert references, is part of what he is reacting to.

Where the essay gets practically useful is in his prescription. Lambert proposes naming the actual misuse pattern more narrowly — "API extraction" or "training-signal exfiltration" — and reserving "distillation" for the legitimate research technique. He notes that the labs themselves have a role to play here: rate-limiting and detection of exfiltration patterns are concrete defenses that don't require terminological overhang, and the frontier labs are already deploying both. He argues that policy work focused on the actual exfiltration pathways — telemetry, terms of service enforcement, attribution in logs — is more useful than blanket framings that risk catching legitimate research in their net.

The piece is short by Interconnects standards (around eight minutes of audio narration) and lands as a deliberate intervention in the policy discourse rather than a deep technical post. For readers tracking how the U.S. AI-policy conversation is evolving in 2026, it is one of the most clearly argued pushbacks against the recent National-Security-Council-adjacent framing — and worth reading alongside this week's Lawfare and Defense One pieces, which adopt the framing Lambert is criticizing.

#5

OpenAI’s cozy partner Cerebras is on track for a blockbuster IPO

Infrastructure 2026-05-04 TechCrunch — AI 7.4 7.5/7.0/7.5

In the long-running saga that is Cerebras Systems’ IPO, the finish line is finally in sight. The AI chipmaker said on Monday that it is preparing to sell 28 million shares at $115 to $125 a share. This would raise $3.5 billion and give it a $26.6 billion market cap at the high end. That would be a nice bump in just a couple of months for the late investors who piled into its $1 billion Series H at a $23 billion valuation in February.

#6

Anthropic and OpenAI are both launching joint ventures for enterprise AI services

Industry 2026-05-04 TechCrunch — AI 7.4 7.7/7.5/7.0

On Monday, Anthropic announced a joint venture focusing on deploying enterprise AI services. Blackstone, Hellman & Friedman, and Goldman Sachs will be founding partners in the new venture, which is backed by a group of VCs, hedge funds, and private equity firms, including Apollo Global Management, General Atlantic, GIC, Leonard Green, and Sequoia Capital. The Wall Street Journal, which first reported news of the partnership, reported the new venture was valued at $1.5 billion, which includes a $300 million commitment each from Anthropic, Blackstone, and Hellman & Friedman. The announcement com

#7

The AI Models Smart Enough to Know They're Cheating — Beth Barnes & David Rein [METR]

Evaluations & Benchmarks 2026-05-04 Machine Learning Street Talk (MLST) · Machine Learning Street Talk 7.3 7.0/8.0/6.5

Machine Learning Street Talk (MLST)By Machine Learning Street Talk (MLST)Welcome! We engage in fascinating discussions with pre-eminent figures in the AI field. Our flagship show covers current affairs in AI, cognitive science, neuroscience and philosophy of mind with in-depth analysis. Our approach is unrivalled in terms of scope and rigour – we believe in intellectual diversity in AI, and we touch on all of the main ideas in the field with the hype surgically removed.

#8

Sierra raises $950M as the race to own enterprise AI gets serious

Industry 2026-05-04 TechCrunch — AI 7.2 7.5/6.5/7.5

Bret Taylor’s AI startup Sierra is raising a $950 million funding round led by Tiger Global and GV, the company announced Monday, pushing its post-money valuation above $15 billion. The raise gives Sierra more than $1 billion to work with — capital the company says it will use to become the “global standard” for AI-powered customer experiences. Like a lot of AI companies, Sierra has, smartly, been very proactive in touting its own growth in a crowded market. The company says it started with just four design partners a couple of years ago.

#9

How OpenAI delivers low-latency voice AI at scale

Audio & Speech 2026-05-04 Hacker News — AI front page 7.1 7.0/6.5/7.5

May 4, 2026EngineeringHow OpenAI delivers low-latency voice AI at scaleBy Yi Zhang and William McDonald, Members of Technical StaffShareVoice AI only feels natural if conversation moves at the speed of speech. When the network gets in the way, people hear it immediately as awkward pauses, clipped interruptions, or delayed barge-in. That matters for ChatGPT voice, for developers building with the Realtime API, for agents working in interactive workflows, and for models that need to process audio while a user is still talking.At OpenAI’s scale, that translates into three concrete requirements:Gl

#10

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Research 2026-04-30 Hugging Face Daily Papers 6.8 6.5/6.0/8.0

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation.

#11

OpenAI, Google, and Microsoft Back Bill to Fund 'AI Literacy' in Schools

Safety, Policy & Regulation 2026-05-04 Hacker News — AI front page 6.8 5.5/7.5/7.0

A new, bipartisan bill introduced by Democratic Senator of California Adam Schiff and endorsed by the biggest AI developers in the world—including OpenAI, Google, and Microsoft—would change the K-12 curriculum to shoehorn in “AI literacy,” something that young people and teachers alike already hate in schools.The Literacy in Future Technologies Artificial Intelligence, or LIFT AI Act, would empower the new director of the National Science Foundation (NSF) to make grant awards “on a merit-reviewed, competitive basis to institutions of higher education or nonprofit organizations (or a consortium

#12

Import AI 455: Automating AI Research

Safety, Policy & Regulation 2026-05-04 Import AI (Jack Clark) 6.7 6.5/7.0/6.0

Jack Clark argues there's now a 60%+ chance that no-human-involved AI R&D — an AI system powerful enough to autonomously build its own successor — happens by the end of 2028. The case is built from public benchmark data on the engineering components of AI development, plus the rate at which AI capability is compounding across them.

The headline evidence: SWE-Bench has effectively saturated (Claude Mythos Preview at 93.9% vs. Claude 2's ~2% in late 2023). METR's task-time-horizon plot shows AI systems going from ~30 seconds of independent work in 2022 to ~12 hours in 2026 (Opus 4.6), with Ajeya Cotra forecasting ~100 hours by end of 2026. CORE-Bench (computational reproducibility) was declared "solved" in December 2025 at 95.5%. MLE-Bench (Kaggle competitions) jumped from 16.9% at launch to 64.4% (Gemini3 with search). Anthropic's CPU-only LM-training optimization task went from 2.9× speedup (Opus 4 May 2025) to 52× (Claude Mythos Preview April 2026), against a human baseline of ~4× for 4–8 hours of work.

On the meta-skill axis, frontier models can now manage other AI systems (Claude Code's sub-agent supervision is the canonical example) and have produced proof-of-concept automated alignment research (Anthropic #454) that beats human baselines on small-scale scalable-oversight problems. PostTrainBench shows AI systems achieving ~half the uplift human researchers achieve when fine-tuning open-weight models. Frontier labs are explicit about the goal: OpenAI wants an "automated AI research intern by September 2026," Anthropic publishes on automated alignment researchers, DeepMind says automation of alignment research "should be done when feasible." Recursive Superintelligence raised $500M with this exact mandate.

The 60% by 2028 estimate (30% by 2027) hedges on whether AI can do creative, paradigm-shifting research — the transformer-architecture-class insight, not the engineering schlep. Clark notes math/CS results (Erdős-1051, the UBC/UNSW/Stanford/DeepMind math proof) are tantalizing but might be domain-specific. The downstream implications he flags: alignment under recursive self-improvement (small accuracy gaps compound: 99.9% becomes 60.5% after 500 generations), inequality of compute access, and the formation of capital-heavy human-light "machine economies" that may eventually trade primarily with each other.

#13

Elon Musk’s only AI expert witness at the OpenAI trial fears an AGI arms race

Industry 2026-05-04 TechCrunch — AI 6.7 6.5/7.0/6.5

When do we take AI doomers seriously? That’s a key subtext of Elon Musk’s attempt to shut down OpenAI’s for-profit AI business. His attorneys argue that the organization was set up as a charity focused on AI safety and lost its way in pursuit of lucre. To prove that, they cite old emails and statements from the organization’s founders about the need for a public-spirited counterweight to Google DeepMind.

#14

MolmoAct2: Action Reasoning Models for Real-world Deployment

Robotic Autonomy 2026-05-04 arXiv cs.RO · Hugging Face Daily Papers 6.6 6.8/6.0/7.0

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes.

cs.RO

#15

Granite 4.1 3B SVG Pelican Gallery

Frontier LLMs 2026-05-04 Simon Willison's Weblog 6.5 6.5/6.0/6.5

Granite 4.1 3B SVG Pelican Gallery Granite 4.1 3B SVG Pelican Gallery. IBM released their Granite 4.1 family of LLMs a few days ago. They're Apache 2.0 licensed and come in 3B, 8B and 30B sizes.

#16

Image AI models now drive app growth, beating chatbot upgrades

Industry 2026-05-04 TechCrunch — AI 6.5 6.0/6.0/7.5

Image model releases are driving growth for AI mobile apps, generating 6.5x more downloads than traditional model updates, according to a new report from app intelligence provider Appfigures. This marks a shift from earlier days, when the release of new models powering the conversational experiences drove more demand, alongside the new features like a voice chat interface. For instance, ChatGPT and Gemini each added tens of millions of new downloads after releasing their respective image models, Appfigures found. For Google’s Gemini, the release of its image model Nano Banana drove an addition

#17

From Context to Skills: Can Language Models Learn from Context Skillfully?

Research 2026-05-02 Hugging Face Daily Papers 6.3 6.0/6.0/7.0

Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference-time skill augmentation: extracting the rules and procedures from context into natural-language skills.

#18

AcademiClaw: When Students Set Challenges for AI Agents

Agents & Tool Use 2026-05-04 arXiv cs.AI · Hugging Face Daily Papers 6.3 5.5/7.5/6.0

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-st…

cs.AI cs.CY

#19

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Research 2026-04-30 Hugging Face Daily Papers 6.3 6.5/6.0/6.5

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies.

#20

U.S. deploying ‘umbrella’ of defense and tech assets to shield ships in the Strait of Hormuz

Government & Defense 2026-05-04 DefenseScoop 6.3 6.0/7.0/5.5

U.S. deploying ‘umbrella’ of defense and tech assets to shield ships in the Strait of Hormuz President Trump announced “Project Freedom” to help trapped vessels exit the Arabian Gulf and restart the flow of in-demand commerce.

#21

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

Agents & Tool Use 2026-04-28 Hugging Face Daily Papers 6.2 6.0/5.5/7.0

Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories.

#22

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Safety, Policy & Regulation 2026-04-30 Hugging Face Daily Papers 6.2 5.5/7.0/6.0

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse.

#23

Train Your Own LLM from Scratch

AI Coding 2026-05-05 Hacker News — AI front page 6.1 5.5/5.0/7.5

Train Your Own LLM From Scratch A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why. Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space. This workshop is my attempt to give others that same experience.

#24

Google Earnings, Meta Earnings

Industry 2026-05-04 Stratechery 6.1 6.0/6.0/6.0

Google Earnings, Meta Earnings Monday, May 4, 2026 Listen to Podcast Wall Street loved Google’s earnings, and hated Meta’s, even though the latter’s core business was more impressive. The difference is that Google is monetizing its investments now (and it might be all Anthropic).

#25

Senators seek Labor-led database on AI workforce impacts

Safety, Policy & Regulation 2026-05-04 FedScoop — AI 6.1 5.5/7.5/5.0

Senators seek Labor-led database on AI workforce impacts The Workforce Transparency Act from Sens. Mark Warner and Ted Budd would charge the DOL with creating a public resource with “aggregated workforce transparency data.” The Department of Labor building is seen behind a sign marking the location of the agency's headquarters on March 18, 2025, in Washington, D.C.

#26

Army awards deal to AV for new Switchblade 400 kamikaze drone to support LASSO program

Government & Defense 2026-05-04 DefenseScoop 6.0 6.0/6.5/5.0

Army awards deal to AV for new Switchblade 400 kamikaze drone to support LASSO program The Army has awarded AeroVironment a prototype agreement for the drone maker’s latest Switchblade variant. (Image courtesy of AV) The Army has awarded AeroVironment a prototype agreement for the drone maker’s latest Switchblade variant, the company announced Monday.

#27

[AINews] The Other vs The Utility

Industry 2026-05-04 Latent Space (swyx & Alessio) 6.0 5.5/6.0/6.0

Congrats to Sierra, raising ~$1B at a $15B valuation — normally a headline story but we already covered their $10B round and CEO Bret Taylor on the pod — they crossed 100M ARR in November and 150M in Feb, so presumably they are at or above the 200M mark (a nice 75x current multiple, whew - 50x if you give them credit thru EOY).Today though we are choosing to focus on this discussion bravely sparked by Roon, an OpenAI employee

#28

As workers worry about AI, Nvidia’s Jensen Huang says AI is ‘creating an enormous number of jobs’

Industry 2026-05-05 TechCrunch — AI 5.9 5.0/6.0/6.5

When it comes to the specter of AI’s labor-displacing potential, Jensen Huang thinks that the American worker has nothing to fear. During a conversation Monday night with MSNBC’s Becky Quick hosted by the Milken Institute — an economic policy think tank, the jovial Nvidia CEO said that AI was an industrial-scale generator of jobs, not the harbinger of mass unemployment that so-called “AI doomers” have often accused it of being. A number of different topics were broached during the talk, but a central theme that kept coming back was the ongoing economic anxiety surrounding the AI industry and w

#29

Elon Musk sent ominous texts to Greg Brockman, Sam Altman after asking for a settlement, OpenAI claims

Industry 2026-05-04 TechCrunch — AI 5.9 5.5/6.0/6.0

In Brief Posted: 9:36 AM PDT · May 4, 2026 Image Credits:Marc Piasecki / Getty Images Julie Bort Tim Fernholz Elon Musk sent ominous texts to Greg Brockman, Sam Altman after asking for a settlement, OpenAI claims Two days before the Elon Musk vs. OpenAI trial began last week, Musk texted the model maker’s president and co-founder Greg Brockman. Musk suggested to Brockman that OpenAI settle the suit. After Brockman replied by suggesting both sides drop their suits, the exchange went off the rails, with Musk responding: “By the end of this week, you and Sam will be the most hated men in America.

#30

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Research 2026-04-30 Hugging Face Daily Papers 5.8 6.0/6.0/5.5

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code.

#31

Cheap Missiles, Not Drones, Will Win the Next Air War

Government & Defense 2026-05-04 War on the Rocks 5.8 5.5/6.5/5.0

Vitaliy Goncharuk argues NATO is making a costly bet by pouring money into propeller-driven counter-drone systems, when the Ukrainian battlefield is already showing the bet is wrong. Russia has retrofitted its slow Shahed drones with turbojet engines, jumping speed from ~90 mph to ~460 mph and ceiling from 6,500 ft to 29,000 ft. Ukraine's propeller-based interceptor drones (max 280 mph) can no longer catch them from behind — only head-on intercepts remain viable, with sharply reduced hit rates.

Iran's $90,000 "358" missile already intercepts the full class of aerial threats from Shaheds to MQ-9 Reapers and AH-64 Apaches, while Western counter-drone efforts double down on quadcopters at the wrong end of the speed-altitude curve. The right answer, per Goncharuk, is a new class of cheap autonomous interceptor missiles — low-thousands to tens-of-thousands of dollars per unit, AI-guided with onboard inertial/visual nav — that scales economically against turbojet drones costing $20–50K. The components exist; what's missing is integration and production scale. YC-backed Perseus Defense and Ares Industries, plus European players Frankenburg Technologies and Origin Robotics, are building toward this; none are at production scale.

Drones in this view become trucks — propeller-driven motherships carrying 2–10 cheap interceptor missiles, with autonomy mandatory because comms and GPS won't be assumed. Five structural reasons the West isn't moving: institutional momentum (drones are politically legible), missile production complexity (concentrated in legacy primes), the sensor/nav scaling gap from civilian autonomy (missile-class targeting needs different hardware), ITAR friction once propulsion is involved, and missile-engineering workforce scarcity. China is supplying components and also investing in affordable counter-drone missiles (Yitian, FK-3000 with 96 missiles per platform); Russia is fielding the S8000 Banderol, sometimes nicknamed an "AliExpress missile."

#32

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Research 2026-04-29 Hugging Face Daily Papers 5.7 5.5/6.0/5.5

Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios.

#33

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Agents & Tool Use 2026-05-03 Hugging Face Daily Papers 5.7 5.5/6.0/5.5

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress.

#34

Perceptual Flow Network for Visually Grounded Reasoning

Multimodal 2026-05-04 arXiv cs.AI · Hugging Face Daily Papers 5.3 5.5/5.5/5.0

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility.

cs.CV cs.AI

#35

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 5.1 5.5/6.8/3.0

As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability -- knowing what they do not know, detecting errors, seeking clarification -- under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring.

cs.AI cs.CL cs.LG

#36

An explainable hypothesis-driven approach to Drug-Induced Liver Injury with HADES

Agents & Tool Use 2026-05-04 arXiv cs.AI 5.1 4.0/8.3/3.0

Drug-induced liver injury (DILI) remains a leading cause of late-stage clinical trial attrition. However, existing computational predictors primarily rely on binary classification, a framing that limits generalization and yields no mechanistic insight to guide translational decisions. We argue that DILI prediction is better posed as an explainable hypothesis-generation problem.

cs.AI

#37

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Research 2026-04-24 Hugging Face Daily Papers 5.1 5.0/6.1/4.2

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environme…

#38

Let ViT Speak: Generative Language-Image Pre-training

Research 2026-04-30 Hugging Face Daily Papers 5.0 5.5/4.6/5.0

In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competi…

#39

The government’s AI efficiency numbers look good. That should worry you.

Agents & Tool Use 2026-05-04 FedScoop — AI 5.0 4.5/7.2/3.0

The government’s AI efficiency numbers look good. That should worry you. Current AI training across agencies is not sufficient and would benefit from more “original intelligence.”

#40

The next president must reimagine, not just restore, the administrative state

Government & Defense 2026-05-04 Defense One 5.0 4.5/7.2/3.0

}); Elon Musk holds a chainsaw reading “Long live freedom, damn it” during the 2025 Conservative Political Action Conference. Musk took the helm of DOGE, the Trump administration’s Department of Government Efficiency, and oversaw cuts and reorganizations across federal agencies. SAUL LOEB/AFP via Getty Images Get all our news and commentary in your inbox at 6 a.m. ET.

#41

Map2World: Segment Map Conditioned Text to 3D World Generation

Research 2026-04-30 Hugging Face Daily Papers 4.8 5.5/4.0/5.0

3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments.

#42

Prox-E: Fine-Grained 3D Shape Editing via Primitive-Based Abstractions

Research 2026-04-28 Hugging Face Daily Papers 4.8 5.5/4.0/5.0

Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction.

#43

From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

Research 2026-04-26 Hugging Face Daily Papers 4.8 5.5/4.0/5.0

LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface.

#44

First B-52s to get new engines this year

Government & Defense 2026-05-04 Defense One 4.8 4.5/6.5/3.0

}); Leon Neal/Getty Images An Air Force B-52 Stratofortress takes off from RAF Fairford on March 19, 2026, in Fairford, England. Get all our news and commentary in your inbox at 6 a.m. ET. emailRegister for NewsletterStay Connected Insights & Reports }); Commanding the digital domain: Turning data into decision advantage resiliencePresented By Splunk Download Now }); From Paper to PixelsPresented By Adobe 2025 Download Now Defense Systems First B-52s to get new engines this year Critical design review clears Boeing to upgrade two Stratofortresses in bid to keep them flying past 2050.

#45

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 4.7 4.0/7.0/3.0

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance.

cs.CL cs.CR

#46

Online Self-Calibration Against Hallucination in Vision-Language Models

Research 2026-04-30 Hugging Face Daily Papers 4.7 4.5/6.1/3.5

Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see.

#47

Tailoring AI solutions for health care needs

Industry 2026-05-04 MIT Technology Review — AI 4.7 4.5/6.2/3.0

SponsoredBiotechnology and healthTailoring AI solutions for health care needsBy tapping into sector-specific data and expertise, developers can build AI applications that address some of health care’s biggest challenges. By MIT Technology Review Insightsarchive pageMay 4, 2026In partnership withMayo Clinic Platform The AI market is full of big promises of grand transformation. Health care is a prime target for those promises, beset as it is by financial pressures, labor shortages, and the growing burden of caring for an aging population. AI developers are targeting functions that vary widely,

#48

The Army wants a new drone to close ‘reconnaissance and security gaps’ for its battalions

Government & Defense 2026-05-04 DefenseScoop 4.7 4.8/5.7/3.0

The Army wants a new drone to close ‘reconnaissance and security gaps’ for its battalions Amid an ongoing effort to push longer-range, quick-launch drones to tactical units, the service wants battalion commanders to have an unmanned aerial system organic to their unit that can take-off vertically and fly over 40 kilometers.

#49

Bolek: A Multimodal Language Model for Molecular Reasoning

Reinforcement Learning 2026-05-04 arXiv cs.LG 4.6 4.0/6.9/3.0

Molecular property models increasingly support high-stakes drug-discovery decisions, but their outputs are often difficult to audit: classical predictors return scores without rationale, while language models can produce fluent explanations weakly grounded in the input molecule. We introduce Bolek, a compact multimodal language model that grounds natural-language reasoning in molecular structure by injecting a Morgan fingerprint embedding into an instruction-tuned text decoder. Bolek is fine-tuned on molecular alignment tasks, including molecule description, RDKit descriptor prediction, and substructure detection, and on downstream reasoning over 15 TDC binary classification tasks using synt…

cs.LG cs.AI q-bio.BM

#50

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

Research 2026-05-04 arXiv cs.LG 4.6 4.0/6.8/3.0

Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer.

cs.LG cs.AI

#51

AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs

Evaluations & Benchmarks 2026-05-04 arXiv cs.RO 4.6 4.7/6.1/3.0

Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization.

cs.RO cs.CV

#52

Hyp2Former: Hierarchy-Aware Hyperbolic Embeddings for Open-Set Panoptic Segmentation

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.6 3.7/7.1/3.0

Recognizing unknown objects is crucial for safety-critical applications such as autonomous driving and robotics. Open-Set Panoptic Segmentation (OPS) aims to segment known thing and stuff classes while identifying valid unknown objects as separate instances. Prior OPS approaches largely treat known categories as a flat label set, ignoring the semantic hierarchy that provides valuable structural priors for distinguishing unknown objects from in-distribution classes.

cs.CV cs.AI cs.RO

#53

Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting

Multimodal 2026-05-04 arXiv cs.CV 4.6 4.7/6.1/3.0

Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations.

cs.CV

#54

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

Research 2026-04-30 Hugging Face Daily Papers 4.6 5.0/4.6/4.2

Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: Global Trajectory Score Matching (GTSM), for which gradient boosting (in an idealized version) is asymptotically optimal.

#55

OpenAI and PwC collaborate to reimagine the office of the CFO

Industry 2026-05-04 OpenAI Research 4.6 5.3/4.0/3.7

OpenAI and PwC are partnering to help enterprises use AI agents to automate finance workflows, improve forecasting, strengthen controls, and modernize the CFO function.

#56

Quoting John Gruber

Industry 2026-05-05 Simon Willison's Weblog 4.6 4.8/5.5/3.0

So it’s well known that Y Combinator owns some stake in OpenAI. But how big is that stake? This seems like devilishly difficult information to obtain. I asked around and a little birdie who knows several OpenAI investors came back with an answer: Y Combinator owns about 0.6 percent of OpenAI.

#57

Orchestrating Spatial Semantics via a Zone-Graph Paradigm for Intricate Indoor Scene Generation

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.5 3.7/6.8/3.0

Autonomous 3D indoor scene synthesis breaks down in non-convex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration.

cs.RO cs.AI

#58

Navy F/A-18Gs over Iran, Venezuela show rise in aerial electronic attack

Government & Defense 2026-05-04 Defense One 4.5 4.5/5.7/3.0

}); A U.S. Navy EA-18G Growler with Electronic Attack Squadron (VAQ) 138 prepares for takeoff during exercise RED FLAG-Alaska 26-1 at Eielson Air Force Base, Alaska, April 21, 2026. U.S. Marine Corps / Lance Cpl.

#59

The Illusion of Sovereignty: How International Law and Big Tech are Eroding the State

Government & Defense 2026-05-04 War on the Rocks 4.5 4.5/5.7/3.0

Ukrainian security researchers Mykhailo Andreichyn and Serhii Demediuk (a former deputy secretary of Ukraine's National Security and Defense Council) argue state sovereignty is being squeezed between two hammers: the calculated ambiguity of international law in cyberspace, and critical state-defense dependence on private tech infrastructure (AWS, Microsoft, Meta, Google, SpaceX). The combination produces a regime in which states must ask permission to use private capabilities for defense, while aggressors weaponize legal restraint as a shield.

The first hammer: international law treats cyber operations as armed attacks only under "kinetic equivalence." Russia exploited this to plant thousands of backdoors in Ukrainian systems from 2014 onward, activating them at the moment of full-scale invasion in 2022. Per Demediuk's direct operational experience, ~2,500 backdoors may still be prepositioned in Ukrainian systems as of early 2026. The "kinetic-cyber cycle" Russia uses — information pretext, digital targeting via compromised routers, kinetic strike, information rationalization — has been predictable enough that Ukraine built an automated predictive system reportedly running at 60–65% accuracy. The first weapon Russia deployed in February 2022 wasn't a tank but a cyberattack on the Viasat satellite network ~1 hour before the ground invasion.

The second hammer: the 2026 DoD–Anthropic confrontation showed even the U.S. can't fully control how a private company deploys AI (Anthropic refused to lift restrictions on autonomous lethal systems and mass surveillance, and was designated a supply chain risk). Ukraine's Starlink dependency went the other way — during the autumn 2022 southern counteroffensive, Ukrainian forces crossed a geofenced line and lost connectivity mid-assault, leading to casualties from a single private actor's operational decision. The authors argue for two recalibrations: classify destructive state-sponsored cyber operations as armed attacks by intent and cumulative effect (not kinetic equivalence) via the new UN Global Mechanism on ICT security, and impose continuity-of-service obligations on digital-infrastructure providers comparable to those on traditional defense contractors.

#60

Spectral Model eXplainer: a chemically-grounded explainability framework for spectral-based machine learning models

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.4 5.0/5.3/3.0

Spectral-based machine learning models have been increasingly deployed in chemometrics and spectroscopy, where predictive accuracy is as important as explainability. Current employed eXplainable Artificial Intelligence (XAI) methods are largely adapted from tabular or generic multivariate domains, assigning relevance to isolated spectral variables rather than to the chemically meaningful spectral zones. Widely adopted tools such as SHapley Additive exPlanations (SHAP), Permutation Feature Importance (PFI), and Variable Importance in Projection scores (VIP) were not designed for the physical continuity and high collinearity of spectral data, and their variable-level outputs require post-hoc a…

cs.LG physics.app-ph

#61

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Research 2026-04-30 Hugging Face Daily Papers 4.4 5.0/4.0/4.2

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action.

#62

When Do Diffusion Models learn to Generate Multiple Objects?

Research 2026-04-29 Hugging Face Daily Papers 4.4 5.0/4.0/4.2

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself.

#63

ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models

Research 2026-04-28 Hugging Face Daily Papers 4.4 5.0/4.0/4.2

In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are generally high-dimensional, and for various structured generation tasks, additional attributes are combined to associate with data samples. We show that the space spanned by the combination of dimensions and attributes can be insufficiently covered by existing training schemes of diffusion generative models, potentially limiting test time performance.

#64

Redis Array Playground

Agents & Tool Use 2026-05-04 Simon Willison's Weblog 4.4 5.8/4.0/3.0

Tool: Redis Array Playground Tool Redis Array Playground — # Redis Array Playground Salvatore Sanfilippo submitted a PR adding a new data type - arrays - to Redis. The new commands are ARCOUNT, ARDEL, ARDELRANGE, ARGET, ARGETRANGE, ARGREP, ARINFO, ARINSERT, ARLASTITEMS, ARLEN, ARMGET, ARMSET, ARNEXT, AROP, ARRING, ARSCAN, ARSEEK, ARSET. The implementation is currently available in a branch, so I had Claude Code for web build thi

#65

A Closed-Form Persistence-Landmark Pipeline for Certified Point-Cloud and Graph Classification

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.3 3.7/6.3/3.0

We introduce PLACE (Persistence-Landmark Analytic Classification Engine), a closed-form pipeline for classifying point clouds and graphs through their persistent-homology signatures. Three quantitative guarantees -- a margin-based excess-risk rate, a closed-form descriptor-selection rule, and a per-prediction certificate -- are derived from training labels alone, with no learned weights or held-out calibration. The embedding sums Mitra-Virk single-point coordinate functions over a sparse landmark grid; closed-form weights maximize a structural distortion constant $λ(ν)$ (a Lipschitz lower bound on $\mathcal{D}_n$ under non-interference).

cs.LG math.AT

#66

A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

Reinforcement Learning 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region.

cs.LG cs.AI

#67

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content.

cs.LG

#68

MSMixer: Learned Multi-Scale Temporal Mixing with Complementary Linear Shortcut for Long-Term Time Series Forecasting

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.3 3.7/6.3/3.0

Long-term time series forecasting requires models that simultaneously capture rapid oscillations, medium-range periodicities, and slowly evolving macro-trends from a fixed look-back window. Existing lightweight MLP-based models typically operate on a single temporal resolution, limiting their ability to explicitly model patterns at multiple scales. We propose MSMixer, a channel-independent multi-scale MLP architecture that addresses this limitation through three complementary innovations: (i) three parallel scale branches at down-sample factors {1x, 4x, 16x} with independent MLP blocks, (ii) a learnable softmax gate that dynamically weighs branch outputs, and (iii) a DLinear complementary sh…

cs.LG

#69

Representation learning from OCT images

Generative Media 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Optical Coherence Tomography (OCT) has become one of the most used imaging modality in ophthalmology. It provides high-resolution, non-invasive visualization of retinal microarchitecture. The automated analysis of OCT images through representation learning has emerged as a central research frontier.

cs.CV cs.LG

#70

Recurrent Deep Reinforcement Learning for Chemotherapy Control under Partial Observability

Efficiency 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Chemotherapy dose optimization can be formulated as a dynamic treatment regime, requiring sequential decisions under uncertainty that must balance tumor suppression against toxicity. However, most reinforcement learning approaches assume full observability of the patient state, a condition rarely met in clinical practice. We investigate whether memory-augmented policies can improve chemotherapy control under partial observability.

cs.LG cs.AI

#71

Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

Robotic Autonomy 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation.

cs.RO cs.LG

#72

Physics-Informed Neural Learning for State Reconstruction and Parameter Identification in Coupled Greenhouse Climate Dynamics

Efficiency 2026-05-04 arXiv cs.LG 4.3 3.7/6.3/3.0

Physics-informed neural networks (PINNs) have recently emerged as a promising framework for integrating data-driven learning with physical knowledge. In this work, we propose a coupled PINN approach for the joint reconstruction of indoor temperature and humidity dynamics in greenhouse environments, together with simultaneous identification of key model parameters. The method incorporates a reduced-order physically motivated model into the learning process, enabling consistent estimation under sparse and noisy observations.

cs.LG

#73

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.3 3.7/6.1/3.0

Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022.

cs.AI cs.CL

#74

SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures

Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 4.3 3.7/6.1/3.0

We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al. 2024), covering more than 30 language-culture pairs, predominantly representing low-resource languages spoken across multiple continents.

cs.CL

#75

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

Agents & Tool Use 2026-05-04 arXiv cs.AI 4.3 3.7/6.1/3.0

Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on context, fatigue, and the stakes involved. Governing that distribution -- balancing efficiency, oversight, and human capability -- remains an open problem.

cs.AI cs.HC cs.SE

#76

Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging

Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 4.3 3.7/6.1/3.0

Learning to Defer (L2D) enables a model to predict autonomously or defer to an expert, but prior work largely assumes flat label spaces. We study the first L2D setting with hierarchical multi-label decisions, motivated by medical-imaging workflows in which findings are organised by clinical taxonomies. In this setting, deferral is a delegation action rather than a label assignment, so treating it as an independent per-label decision can produce deferral incoherence, including taxonomic contradictions, delegation violations, and deferrals of labels already implied by the model's own assertions.

cs.AI

#77

An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

Agents & Tool Use 2026-05-04 arXiv cs.AI 4.3 3.7/6.1/3.0

Healthcare automation is shaped by local procedures and organizational constraints, so agent capabilities rarely transfer unchanged across settings. Agent skills, self-contained directories that package reusable procedures for AI agents, are emerging as a procedural layer for adapting healthcare agents across diverse healthcare settings. We present the first empirical analysis of healthcare agent skills, drawing on 557 healthcare-related skills filtered from 58,159 public skills on ClawHub and annotated along ten dimensions covering function, deployment context, autonomy, and safety.

cs.AI

#78

The Design and Composition of Structural Causal Decision Processes

Agents & Tool Use 2026-05-04 arXiv cs.AI 4.3 3.7/6.2/3.0

We present two new classes of causal models of decision-making agents. Our approach is motivated by the needs of modeling the economics of computing systems. These systems are composed of subsystems and can exhibit endogenous limits on cognitive resources and value discounting.

cs.CE cs.AI cs.GT econ.TH

#79

Parking Assistance for Trailer-Truck Transport Vehicles Using Sensor Fusion and Motion Planning

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.3 3.7/6.2/3.0

Autonomous driving technology has rapidly evolved over the past decade, offering significant improvements in transportation efficiency, safety, and cost reduction. While much of the progress has focused on highway driving and obstacle avoidance, low-speed maneuvers such as parking remain among the most difficult challenges for autonomous systems. This challenge is especially pronounced in trailer-truck transport vehicles due to their articulated motion and environmental constraints.

cs.RO

#80

Linearizing Vision Transformer with Test-Time Training

Generative Media 2026-05-04 arXiv cs.CV 4.3 3.7/6.2/3.0

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment.

cs.CV

#81

Looking at Europe With a Sharper Eye

Government & Defense 2026-05-04 War on the Rocks 4.3 4.5/5.0/3.0

The latest installment of WOTR's "Ukraine Compass" weekly digest of Ukrainian-language commentary leads with Espreso's Dmytro Snegiryov on Russia's intensified push to capture the Sloviansk-Kramatorsk agglomeration in eastern Ukraine. Russian forces are concentrating attacks on Kostiantynivka, Chasovyi Yar, and the surrounding area, attempting to flank Ukrainian positions near Sloviansk by advancing from multiple directions in a tactic that echoes Bakhmut. Kostiantynivka is being systematically destroyed by heavy bombs with over 2,500 civilians trapped and all access roads under fire.

Snegiryov's read on the difference from Bakhmut: Ukraine has substantially expanded its drone capabilities since 2023, giving it a clear edge in the number of operator units and in first-person-view and drop-drone usage, plus a wider surveillance zone. He argues this is slowing — but not stopping — Russian advances, and the only way Ukraine continues to hold these areas is by leaning on partner-supplied equipment and expertise rather than going it alone. The full Compass roundup is members-only past this point; the article curates additional pieces from Ukrainian outlets across the political spectrum on frontline strategy, domestic politics, and public argument inside a country at war.

#82

MolViBench: Evaluating LLMs on Molecular Vibe Coding

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.2 3.7/6.0/3.0

Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected.

cs.CL

#83

Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces

Research 2026-05-04 arXiv cs.AI 4.2 4.0/5.5/3.0

Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence. This decoupled practice first compresses and then fine-tunes adapters, potentially misaligning the compressed subspace with downstream objectives and squandering a global parameter budget. To overcome this limitation, we introduce JACTUS (Joint Adaptation and Compression with a Task-aware Union of Subspaces), a single framework that unifies compression and adaptation.

cs.AI

#84

SAIL: Structure-Aware Interpretable Learning for Anatomy-Aligned Post-hoc Explanations in OCT

Multimodal 2026-05-04 arXiv cs.AI 4.2 4.0/5.5/3.0

Optical coherence tomography (OCT), a commonly used retinal imaging modality, plays a central role in retinal disease diagnosis by providing high-resolution visualization of retinal layers. While deep learning (DL) has achieved expert-level accuracy in OCT-based retinal disease detection, its "black box" nature poses challenges for clinical adoption, where explainability is essential for clinical trust and regulatory approval. Existing post-hoc explainable AI (XAI) methods often struggle to delineate fine-grained lesion structures, respect anatomical boundaries, or suppress noise, limiting the trustworthiness of their explanations.

cs.CV cs.AI

#85

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

Research 2026-04-28 Hugging Face Daily Papers 4.2 4.5/4.7/3.5

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (leq 0.25) as opposed to softmax, and a diagonal Jacobian struct…

#86

Learning to Act and Cooperate for Distributed Black-Box Consensus Optimization

Research 2026-04-30 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

Distributed blackbox consensus optimization is a fundamental problem in multi-agent systems, where agents must improve a global objective using only local objective queries and limited neighbor communication. Existing methods largely rely on handcrafted update rules and static cooperation patterns, which often struggle to balance local adaptation, global coordination, and communication efficiency in heterogeneous nonconvex environments. In this paper, we take an initial step toward trajectory-driven self-design for distributed black-box consensus optimization.

#87

AnalogRetriever: Learning Cross-Modal Representations for Analog Circuit Retrieval

Agents & Tool Use 2026-04-24 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search.

#88

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Research 2026-05-03 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems.

#89

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Research 2026-04-25 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes.

#90

Generative Modeling with Orbit-Space Particle Flow Matching

Research 2026-05-03 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

We present Orbit-Space Geometric Probability Paths (OGPP), a particle-native flow-matching framework for generative modeling of particle systems. OGPP is motivated by two insights: (i) particles are defined up to permutation symmetries, so anonymous indexing inflates per-index target variance and yields curved, hard-to-learn flows; and (ii) particles live in physical space, so the flow terminal velocity has physical meaning and can encode geometric attributes, e.g., surface normals. OGPP instantiates three key components: (1) orbit-space canonicalization of the probability-path terminal endpoint, (2) particle index embeddings for role specialization, and (3) geometric probability paths with …

#91

Quoting Andy Masley

Industry 2026-05-04 Simon Willison's Weblog 4.2 4.5/4.7/3.0

[...] Between 2000 and 2024, farmers sold in total a Colorado-sized chunk of land all on their own, 77 times all land on data center property in 2028, and grew more food than ever on what was left. None of this caused any problems for US food access. And then, in the middle of all this, a farmer in Loudoun County sells a few acres of mediocre hay field to a hyperscaler for ten times its agricultural value, and the response is that we’re running out of farmland. — Andy Masley, pushing back against the "land use" argument against data center construction Tags: ai-ethics, ai, generative-ai, andy-

#92

How AI-driven botnets are reshaping cyber defense strategies

Government & Defense 2026-05-04 DefenseScoop 4.2 4.5/4.7/3.0

Defending against the next wave of AI-driven cyberattacks Cyber threats targeting defense networks and the defense industrial base are evolving at unprecedented speed and scale. New research highlights how AI-powered botnets and low-cost, “attack-for-hire” services are enabling hyper-volumetric DDoS attacks capable of overwhelming infrastructure in seconds—often faster than traditional defenses can respond. As these attacks become more automated, distributed and difficult to detect, defense organizations must rethink how they protect mission-c

#93

Two FAA partners ramp up hiring, preparations for ATC overhaul

Government & Defense 2026-05-04 FedScoop — AI 4.2 4.5/4.7/3.0

Two FAA partners ramp up hiring, preparations for ATC overhaul L3Harris and Indra are working behind the scenes to increase capacity as the fiber cable and radar providers advance plans aimed at improving the FAA’s efficacy. ATC-controller at radar screen with microphone and control strips in visual-control-room with airport terminal view through windows at night.

#94

Unsupervised Machine Learning for Detecting Structural Anomalies in European Regional Statistics

Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Ensuring the coherence of regional socio-economic statistics is a central task for national statistical institutes. Traditional validation tools, such as range edits, ratio checks, or univariate outlier detection, are effective for identifying extreme values in individual series but are less suited for detecting unusual combinations of indicators in high-dimensional settings. This paper proposes an unsupervised machine learning framework for identifying structurally atypical regional profiles within Europe using publicly available Eurostat data.

cs.LG

#95

Adaptive Interpolation-Synthesis for Motion In-Betweening on Keyframe-Based Animation

Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Motion in-betweening is one of the most artistically demanding and time consuming stages of 3D animation, where the expressivity and rhythm of motion are defined. The level of creative control it requires makes it a major production bottleneck, underscoring the need for intelligent tools that assist animators in this process. Although recent deep learning approaches have achieved strong results in motion synthesis and in-betweening, they assume data characteristics, motion styles, and problem formulations that diverge from professional animation workflows.

cs.GR cs.LG

#96

ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor for Pair Programming

Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Effective pair programming depends on coordination of attention, cognitive effort, and joint regulation over time, yet most adaptive learning systems remain individual-centric and reactive. This paper introduces ProPACT, a proactive AI-driven adaptive collaborative tutor that treats collaboration itself as the object of instruction. ProPACT constructs a multimodal dyadic learner model based on Joint Visual Attention (JVA), Joint Mental Effort (JME), and individual mental effort, and employs an XGBoost-based forecasting model to predict emerging suboptimal collaboration states up to 30 seconds in advance.

cs.HC cs.AI cs.LG

#97

ParaRNN: An Interpretable and Parallelizable Recurrent Neural Network for Time-Dependent Data

Efficiency 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

The proliferation of large-scale and structurally complex data has spurred the integration of machine learning methods into statistical modeling. Recurrent neural networks (RNNs), a foundational class of models for time-dependent data, can be viewed as nonlinear extensions of classical autoregressive moving average models. Despite their flexibility and empirical success in machine learning, RNNs often suffer from limited interpretability and slow training, which hinders their use in statistics.

stat.ML cs.LG

#98

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

Efficiency 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Preference optimization has become a central paradigm for aligning large language models with human feedback. Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback by directly optimizing pairwise preferences, removing the need for reward modeling and policy optimization. However, recent work shows that DPO exhibits a squeezing effect, where negative gradients applied to rejected responses concentrate probability mass on high-confidence predictions while suppressing alternative responses.

cs.LG

#99

TRACED: In vivo imaging of extracellular intrinsic diffusivity, tortuosity, cell size distribution and cell density in human glioma patients

Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.7/3.0

The lack of analytical models describing diffusion time dependence at intermediate time scales in complex tissue microstructure limits the accurate quantification of extracellular diffusivity and tissue microstructure. We introduce TRACED, a biophysical model that incorporates diffusion time dependence in cell distributions to quantify pathologically-relevant properties in solid tumors. Neural networks were trained on Monte Carlo diffusion simulations using sphere distribution-based geometries to enable the rapid computation of time-dependent diffusion MRI signals in cell populations of variable cell size.

physics.med-ph cs.LG eess.IV

#100

Gradient-Discrepancy Acquisition for Pool-Based Active Learning

Research 2026-05-04 arXiv cs.LG 4.1 4.7/4.6/3.0

The effectiveness of active learning hinges on the choice of the acquisition criterion by which a learning algorithm selects potentially informative data points whose label is subsequently queried. This paper proposes a novel gradient-based acquisition criterion, derived from a generalization bound introduced by Luo et al. (2022).

cs.LG

#101

MPCS: Neuroplastic Continual Learning via Multi-Component Plasticity and Topology-Aware EWC

Research 2026-05-04 arXiv cs.LG 4.1 4.0/5.3/3.0

Continual learning systems face a fundamental tension between plasticity -- acquiring new knowledge -- and stability -- retaining prior knowledge. We introduce MPCS (Multi-Plasticity Continual System), a neuroplastic architecture that integrates eleven complementary mechanisms: task-driven neurogenesis, Fourier-encoded inputs, EWC regularization, meta-replay, mixed consolidation, hybrid gating, synapse pruning/regeneration, Hebbian updates, task similarity routing, adaptive growth control, and continuous neuron importance tracking. We evaluate MPCS on MEP-BENCH, a multi-track benchmark spanning 31 tasks across regression, classification, logic, and mixed domains, using a three-dimensional Pa…

cs.LG cs.NE

#102

Mitigating Misalignment Contagion by Steering with Implicit Traits

Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 4.1 3.7/5.5/3.0

Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games.

cs.AI cs.CL

#103

Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.1 4.7/4.6/3.0

There is growing interest in exploring user simulation as an alternative to gathering and scoring real user-chatbot interactions for AI chatbot evaluation. For this purpose, it is important to ensure the realism of the simulation, i.e., the extent to which simulated dialogues reflect real dialogues users have with chatbots. Most existing methods evaluating simulation realism produce coarse quality signal and remain solely at the level of individual dialogues.

cs.CL

#104

Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.1 4.0/5.3/3.0

Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it deserves. This paper presents a systematic empirical comparison of five retrieval strategies -- Dense Vector Search, Hybrid BM25 + Dense retrieval, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR) -- within a biomedical question-answering RAG pipeline. All strategies share a fixed generation model (GPT-4o-mini), a common vector store (ChromaDB), and OpenAI's text-…

cs.CL cs.AI cs.IR

#105

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.1 4.0/5.3/3.0

Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric.

cs.CL cs.AI cs.LG

#106

Semantic Risk-Aware Heuristic Planning for Robotic Navigation in Dynamic Environments: An LLM-Inspired Approach

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

The integration of Large Language Model (LLM) reasoning principles into classical robot path planning represents a rapidly emerging research direction. In this paper, we propose a Semantic Risk-Aware Heuristic (SRAH) planner that encodes LLM-inspired cost functions penalising geometrically cluttered or high-risk zones into an A$^*$ search framework, augmented with closed-loop replanning upon dynamic obstacle detection. We evaluate SRAH against two established baselines Breadth-First Search (BFS) with replanning and a Greedy heuristic without replanning across 200 randomised trials in a $15{\times}15$ grid-world with 20\% static obstacle density and stochastic dynamic obstacles.

cs.RO

#107

DynoSLAM: Dynamic SLAM with Generative Graph Neural Networks for Real-World Social Navigation

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model.

cs.RO cs.CV

#108

Sim-to-Real Transfer and Robustness Evaluation of Reinforcement Learning Control with Integrated Perception on an ASV for Floating Waste Capture

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 4.7/4.6/3.0

Autonomous surface vessels for floating-waste removal operate under varying hydrodynamics, external disturbances, and challenging water-surface perception. We present a field-validated system that combines camera-based polarimetric perception with a lightweight DRL-based controller for floating-waste detection and capture. Camera detections are converted into water-surface target points and tracked by a controller trained entirely in simulation and deployed directly on a retrofitted ASV platform.

cs.RO

#109

Visibility-Aware Mobile Grasping in Dynamic Environments

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between ``seeing'' around to reduce environmental uncertainty and ``moving'' the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation.

cs.RO

#110

SAGA: A Robust Self-Attention and Goal-Aware Anchor-based Planner for Safe UAV Autonomous Navigation

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

Agile unmanned aerial vehicle (UAV) navigation in cluttered environments demands a planning architecture that is both computationally efficient and structurally expressive enough to reason over multiple feasible motions. This paper presents SAGA, a robust self-attention and goal-aware anchor-based planner for safe UAV autonomous navigation. SAGA formulates local planning as a one-stage joint regression-and-ranking problem over a fixed lattice of motion anchors.

cs.RO

#111

HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar

Multimodal 2026-05-04 arXiv cs.CV 4.1 3.7/5.5/3.0

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses.

cs.CV

#112

Unified Map Prior Encoder for Mapping and Planning

Multimodal 2026-05-04 arXiv cs.CV 4.1 3.7/5.5/3.0

Online mapping and end-to-end (E2E) planning in autonomous driving remain largely sensor-centric, leaving rich map priors, including HD/SD vector maps, rasterized SD maps, and satellite imagery, underused because of heterogeneity, pose drift, and inconsistent availability at test time. We present UMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. UMPE has two branches.

cs.CV

#113

April 2026 newsletter

Frontier LLMs 2026-05-04 Simon Willison's Weblog 4.1 4.8/4.0/3.0

April 2026 newsletter I just sent out the April edition of my sponsors-only monthly newsletter. If you are a sponsor (or if you start a sponsorship now) you can access it here. In this month's newsletter: Opus 4.7 and GPT-5.5, both with price increases Claude Mythos and LLM security research ChatGPT Images 2.0 More model releases Other highlights from my blog What I'm using, April 2026 edition Here's a copy of the March newslett

#114

TRE Python binding — ReDoS robustness demo

Industry 2026-05-04 Simon Willison's Weblog 4.1 4.8/4.0/3.0

Research: TRE Python binding — ReDoS robustness demo If it's good enough for antirez to add to Redis I figured Ville Laurikari's TRE regular expression engine was worth exploring in a little more detail. I had Claude Code build an experimental Python binding (it used ctypes) and try some malicious regular expression attacks against the library. TRE handles those much better than Python's standard library implementation, thanks mainly to the lack of support for backtracking. Tags: security, python, regular-expressions, c, ctypes

#115

ARA: Agentic Reproducibility Assessment For Scalable Support Of Scientific Peer-Review

Agents & Tool Use 2026-05-04 arXiv cs.LG 4.0 3.7/5.3/3.0

Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents.

cs.DL cs.LG

#116

Gradient Boosted Risk Scores

Research 2026-05-04 arXiv cs.LG 4.0 3.7/5.3/3.0

Risk scores are an interpretable and actionable class of machine learning models with applications in medicine, insurance, and risk management. Unlike most computational methods, risk scores are designed to be computed by a human by attributing points to a data sample based on a limited set of criteria. The most common approaches for generating risk scores use linear regressions to estimate the effect of selected variables.

cs.LG

#117

Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.0 3.7/5.3/3.0

Stories hold a reader's attention because they have causes, secrets, and consequences. Shadow-Loom is an experimental open-source framework that turns a narrative into a versioned graphical world model and lets two engines act on it: a causal physics grounded in Pearl's ladder of causation and a recently proposed counterfactual calculus over Ancestral Multi-World Networks; and a narrative physics that scores the same graph against four structural reader-states -- mystery, dramatic irony, suspense, and surprise -- in the tradition of Sternberg's curiosity/suspense/surprise triad, with suspense formalised in the structural-affect line of work on story comprehension and computational suspense. …

cs.AI cs.CL

#118

From Sensors to Insight: Rapid, Edge-to-Core Application Development for Sensor-Driven Applications

Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 4.0 3.7/5.3/3.0

Scientists increasingly rely on sensor-based data, yet transforming raw streams into insights across the edge-to-cloud continuum remains difficult. Provisioning heterogeneous infrastructure and managing execution on emerging platforms like Data Processing Units typically requires cross-domain expertise, creating significant barriers to rapid prototyping. This paper introduces an experience-driven methodology for the rapid development of sensor-driven applications.

cs.DC cs.AI cs.SE

#119

Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.0 3.7/5.3/3.0

Dual-system Vision-Language-Action (VLA) models achieve state-of-the-art robotic manipulation but are bottlenecked by the VLM backbone, which must execute at every control step while producing temporally redundant features. We propose Latent Bridge, a lightweight model that predicts VLM output deltas between timesteps, enabling the action head to operate on predicted outputs while the expensive VLM backbone is called only periodically. We instantiate Latent Bridge on two architecturally distinct VLAs: GR00T-N1.6 (feature-space bridge) and π0.5 (KV-cache bridge), demonstrating that the approach generalizes across VLA designs.

cs.RO

#120

ShapeGrasp: Simultaneous Visuo-Haptic Shape Completion and Grasping for Improved Robot Manipulation

Robotic Autonomy 2026-05-04 arXiv cs.RO 4.0 3.7/5.3/3.0

Humans grasp unfamiliar objects by combining an initial visual estimate with tactile and proprioceptive feedback during interaction. We present ShapeGrasp, a robotic implementation of this approach. The proposed method is an iterative grasp-and-complete pipeline that couples implicit surface visuo-haptic shape completion (creation of full 3D shape from partial information) with physics-based grasp planning.

cs.RO

#121

AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

Multimodal 2026-05-04 arXiv cs.CV 4.0 5.0/4.0/3.0

Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored.

cs.CV

#122

The Bayesian Reflex: Online Learning as the Autonomic Nervous System of Modern and Future AI

Research 2026-05-04 arXiv stat.ML 4.0 3.7/5.3/3.0

This chapter introduces the Bayesian reflex -- an analogy with the autonomic nervous system -- as a unifying framework for online learning in AI. Bayesian online algorithms automatically maintain equilibrium in dynamic environments via three mechanisms: belief maintenance through probabilistic representations, sequential updating via Bayes' theorem, and uncertainty-driven action balancing exploration and exploitation. We survey online Bayesian methods, highlighting two computational principles: the look-up table principle for sequential inference in function space, and the ellipsoidal decomposition framework for nearly exact i.i.d.

stat.ME stat.ML

#123

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

Research 2026-04-30 Hugging Face Daily Papers 4.0 4.5/4.0/3.5

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately.

#124

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Research 2026-04-30 Hugging Face Daily Papers 4.0 4.5/4.0/3.5

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105.

#125

Motion-Aware Caching for Efficient Autoregressive Video Generation

Research 2026-05-02 Hugging Face Daily Papers 4.0 4.5/4.0/3.5

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping.

#126

Soft Anisotropic Diagrams for Differentiable Image Representation

Research 2026-04-26 Hugging Face Daily Papers 4.0 3.7/5.3/3.0

We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership.

#127

Register now for OpenClaw: After Hours @ GitHub

Agents & Tool Use 2026-05-04 GitHub Blog — AI & ML 4.0 4.8/4.0/3.0

OpenClaw, one of the fastest-growing open source projects, has already picked up over 350,000 stars and an early community of builders exploring what agentic systems can actually do in practice. That’s why, on June 3, 2026, we are hosting OpenClaw: After Hours at GitHub HQ in San Francisco. The event will take place during Microsoft Build 2026. This evening is a chance to bring the OpenClaw community together into the same room.

#128

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

Efficiency 2026-05-04 arXiv cs.LG 3.9 4.0/4.7/3.0

Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels.

cs.AI cs.LG cs.SE

#129

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Multimodal 2026-05-04 arXiv cs.LG 3.9 4.0/4.6/3.0

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities.

cs.CV cs.LG

#130

HARMES: A Multi-Modal Dataset for Wearable Human Activity Recognition with Motion, Environmental Sensing and Sound

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.9 4.0/4.6/3.0

With each sensing modality exhibiting inherent strengths and limitations, multi-modal approaches for wearable Human Activity Recognition (HAR) are becoming increasingly relevant -- particularly for recognizing Activities of Daily Living (ADLs), where individual modalities often produce ambiguous signals for similar or complex activities. This work introduces HARMES, a multi-modal wearable dataset combining three wrist-recorded modalities: motion sensing via an Inertial Measurement Unit (IMU), atmospheric environmental sensors (humidity, temperature, and pressure), and audio. Collected from 20 participants performing household activities in their own homes, HARMES totals over 80 hours of reco…

cs.LG

#131

StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

Research 2026-05-04 arXiv cs.LG 3.9 4.0/4.7/3.0

DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget.

cs.LG cs.PF

#132

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes.

cs.CL

#133

When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech.

cs.AI cs.CL eess.AS

#134

Fuzzy Fingerprinting Encoder Pre-trained Language Models for Emotion Recognition in Conversations: Human Assessment and Validity Study

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

In Emotion Recognition in Conversations (ERC), model decisions should align with nuanced human perception and ideally provide insights on the classification process. Standard encoder pre-trained language models (PLMs) are the state-of-the-art at these tasks but offer little insight into why a certain prediction is made. This is especially problematic in imbalanced datasets, where most utterances are labeled as neutral, making these models frequently misclassify minority emotions as the majority neutral class.

cs.CL cs.AI

#135

A multilingual hallucination benchmark: MultiWikiQHalluA

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. Leveraging the multilingual MultiWikiQA dataset, we utilize the LettuceDetect framework to create synthetic hallucination datasets for 306 languages, from which we train token-level hallucination classifiers for 30 European languages.

cs.CL

#136

Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication

Frontier LLMs 2026-05-04 arXiv cs.CL 3.9 4.0/4.7/3.0

Legal texts often contain computational legal clauses--provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized Intelligence, a neuro-symbolic approach where we use an LLM once to translate a legal text into Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation.

cs.CL

#137

ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge.

cs.CL

#138

Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

Large language models pick up social biases from the data they are trained on and carry those biases into downstream applications, often reinforcing stereotypes around gender, race, religion, disability, age, and socioeconomic status. The standard fixes (retraining on curated data or fine-tuning with human feedback) are expensive, need access to model weights, and risk degrading the model on other tasks. In this paper we take a different route: we debias the model at decoding time, treating bias mitigation as a structured search over candidate tokens without ever touching model weights.

cs.CL cs.LG

#139

Virtual Scanning for NSCLC Histology: Investigating the Discriminatory Power of Synthetic PET

Generative Media 2026-05-04 arXiv cs.AI 3.9 4.0/4.6/3.0

Accurate histological differentiation between adenocarcinoma (ADC) and squamous cell carcinoma (SCC) is critical for personalized treatment in non-small cell lung cancer (NSCLC). While [$^{18}$F]FDG PET/CT is a standard tool for the clinical evaluation of lung cancer, its utility is often limited by high costs and radiation exposure. In this paper, we investigate the feasibility of "virtual scanning" as a feature-enhancement strategy by evaluating whether synthetic PET data can provide complementary feature representations to supplement anatomical CT scans in histological subtype classification.

cs.CV cs.AI

#140

Triple Spectral Fusion for Sensor-based Human Activity Recognition

Multimodal 2026-05-04 arXiv cs.AI 3.9 4.0/4.6/3.0

The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR.

cs.AI cs.CV cs.HC

#141

Tensegrity crutches with compliance from a pre-stressed self-tensile module improve ground reaction force profiles, speed, effort, comfort, and perceived stability

Robotic Autonomy 2026-05-04 arXiv cs.RO 3.9 4.7/4.0/3.0

Purpose: Six million people use crutches as mobile aids in the US. Rigid designs with no axial mobility limit sensory feedback and lead to secondary injury on the upper joints. Spring-loaded designs offer compliance but may compromise stability.

cs.RO nlin.AO

#142

Human Activity Recognition Method for Moderate Violence Detection

Multimodal 2026-05-04 arXiv cs.CV 3.9 4.0/4.7/3.0

Physical violence in public spaces is a significant public health concern, with minor incidents such as pushing often serving as precursors to more severe escalations. This research develops an automated system for the real-time detection of moderate physical violence, specifically pushing, in surveillance camera footage. The proposed solution integrates state-of-the-art computer vision models, utilizing YOLO11 and YOLO11-Pose for human detection and skeletal keypoint extraction.

cs.CV

#143

Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning

Efficiency 2026-05-04 arXiv stat.ML 3.9 3.2/5.5/3.0

The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias.

cs.LG stat.ML

#144

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

Research 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Understanding whether deep neural networks are effectively optimized remains challenging, as training occurs in highly nonconvex landscapes and standard metrics provide limited visibility into layer-wise learning quality. This challenge is particularly acute for transformer-based language models, where training is expensive, models are often reused in frozen form, and poorly optimized layers can silently degrade performance. We propose a layer-wise peeling framework for monitoring training dynamics, in which each transformer layer is locally optimized against intermediate representations of the trained model.

cs.LG

#145

Fine-Grained Graph Generation through Latent Mixture Scheduling

AI for Science 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Structure aware graph generation aims to generate graphs that satisfy given topological properties. It has applications in domains such as drug discovery, social network modeling, and knowledge graph construction. Unlike existing methods that only provide coarse control over graph properties, we introduce a novel conditional variational autoencoder for fine-grained structural control in graph generation.

cs.AI cs.LG

#146

U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

Reinforcement Learning 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control. While recent systems incorporate verification techniques, it remains unclear how users can effectively apply such rigid constraints to represent intent or adapt to real-world variability. For example, prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users.

cs.AI cs.HC cs.LG

#147

Federated Reinforcement Learning for Efficient Mobile Crowdsensing under Incomplete Information

Efficiency 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Mobile crowdsensing (MCS) is a distributed sensing architecture that utilizes existing sensors on mobile units (MUs) to perform sensing tasks. A mobile crowdsensing platform (MCSP) publishes the sensing tasks and the MUs decide whether to participate in exchange for money. The MCS system is dynamic: the task requirements, the MUs' availability, and their available resources change over time.

cs.LG cs.NI

#148

Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions

Robotic Autonomy 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Learning data-efficient object dynamics models for robotic manipulation remains challenging, especially for deformable objects. A popular approach is to model objects as sets of 3D particles and learn their motion using graph neural networks. In practice, this is not enough to maintain physical feasibility over long horizons and may require large amounts of interaction data to learn.

cs.RO cs.AI cs.CV cs.LG

#149

CARD: Coarse-to-fine Autoregressive Modeling with Radix-based Decomposition for Transferable Free Energy Estimation

Reinforcement Learning 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Estimating free energy differences quantifies thermodynamic preferences in molecular interactions, which is central to chemistry and drug discovery. Despite fruitful progress, existing methods still face key limitations: classical computational approaches remain prohibitively expensive due to their reliance on extensive molecular dynamics simulations, while deep learning-based methods are constrained by either less-expressive generative models or input dimensions tied to a specific system, resulting in negligible generalization. To address these challenges, we propose CARD, a generative framework that employs a novel radix-based decomposition to bijectively convert 3D coordinates into mixed …

cs.LG

#150

CNNs for Vis-NIR Chemometrics: From Contradiction to Conditional Design

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Near-infrared (NIR; a.k.a.\ NIRS) deep-learning studies in chemometrics increasingly report mutually inconsistent conclusions regarding convolutional neural network (CNN) design, including small versus large kernels, shallow versus deep architectures, raw spectra versus preprocessing, and single-domain training versus transfer learning. As a result, the same architecture can appear superior in one study and inferior in another, creating a practical impasse for chemometric practitioners. In this review, we argue that these contradictions are not evidence of irreconcilable methods but a structurally expected consequence of uncontrolled moderating variables.

cs.LG physics.optics

#151

Evaluating Tabular Representation Learning for Network Intrusion Detection

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Classic Network Intrusion Detection Systems (NIDS) often rely on manual feature engineering to extract meaningful patterns from network traffic data. However, this approach requires domain expertise and runs counter to the widely adopted principle of modern machine learning and neural networks: that models themselves should learn meaningful representations directly from data. We investigate whether tabular representation learning techniques can improve intrusion detection performance by automatically learning robust feature representations for NetFlow data.

cs.LG cs.CR

#152

A Novel Preprocessing-Driven Approach to Remaining Useful Life (RUL) Prediction Using Temporal Convolutional Networks (TCN)

Research 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Accurate prediction of Remaining Useful Life (RUL) in aero-engines is vital for predictive maintenance, improved operational reliability, and reduced lifecycle costs. While deep learning approaches have demonstrated strong potential in this area, most existing methods focus primarily on model architecture design and treat input features uniformly, often neglecting the influence of data preprocessing. In this work, we propose a novel preprocessing pipeline that enhances RUL prediction by improving data quality and temporal representation before model training.

cs.LG cs.AI

#153

Pretraining on Sleep Data Improves non-Sleep Biosignal Tasks

Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Sleep foundation models have recently demonstrated strong performance on in-domain polysomnography tasks, including sleep staging, apnea detection, and disease risk prediction. In this work, we investigate whether sleep biosignals can serve as an effective pretraining distribution for learning representations that transfer beyond sleep to adjacent domains. Following sleep foundation models, we perform sleep-only multimodal contrastive pretraining (with a leave-one-out objective) and evaluate transfer to non-sleep EEG and ECG, two well-benchmarked biosignal modalities with heterogeneous datasets and clinically meaningful downstream tasks.

cs.LG cs.AI

#154

mdok-style at SemEval-2026 Task 9: Finetuning LLMs for Multilingual Polarization Detection