← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Tuesday, May 5, 2026

Coverage window: 2026-05-04 03:25 ET2026-05-05 03:14 ET
Press play to listen
Tuesday, May 5, 2026
13m 12s · top-4 narrated briefing
#1 · Frontier LLMs
LWiAI Podcast #243 - GPT 5.5, DeepSeek V4, AI safety sabotage
Last Week in AI #243, recorded April 29 and posted May 4, is the most concentrated single artifact of the week, packing four distinct frontier-model events and an alignment-flavored incident into one show. Andrey Kurenkov and Jeremie Harris frame it as a coding-and-voice-heavy we…
8.5 · 1 srcs
#2 · Industry
Week one of the Musk v. Altman trial: What it was like in the room
James O'Donnell's report from the Oakland federal courthouse covers the first week of trial in Musk's lawsuit alleging that OpenAI breached its founding nonprofit mission. The trial began the week of April 27 in front of a federal judge, with Musk's testimony and direct examinati…
7.8 · 1 srcs
#3 · Government & Defense
Pentagon seeks smarter, self-organizing drones as autonomous-warfare budget is poised to skyrocket
Defense One's Patrick Tucker reports that the autonomous-warfare line in the Pentagon's fiscal 2027 budget request is poised to grow significantly, with explicit emphasis on drones that can self-organize into formations and execute mission segments without per-platform operator i…
7.6 · 1 srcs
6.5
#1
Frontier LLMs 2026-05-04 Last Week in AI 8.5 8.5/9.3/7.5

Last Week in AI #243, recorded April 29 and posted May 4, is the most concentrated single artifact of the week, packing four distinct frontier-model events and an alignment-flavored incident into one show. Andrey Kurenkov and Jeremie Harris frame it as a coding-and-voice-heavy week followed by a fresh open-weights drop from China and a Tencent release that misses.

OpenAI shipped GPT-5.5, with the system card claiming meaningful gains on coding evaluations alongside higher per-token pricing than GPT-5.4. The card also surfaces chain-of-thought monitorability and misalignment testing as headline items — OpenAI continues to publish reasoning-trace probes against its own models — and includes a now-discussed system-prompt warning about "goblins" that the hosts treat as a quirk in OpenAI's deployment-time tooling rather than a serious capability claim. xAI countered with Grok Voice Think Fast 1.0, leading on real-time-voice-agent benchmarks and quantifying production impact at Starlink customer support and sales — large enough lifts that the hosts treat it as a credible Whisper-plus-frontier-model competitor rather than a demo, though the benchmarks are first-party.

The bigger frontier news is DeepSeek V4. Pro and Flash variants ship as open weights, with the architecture moving deeper into mixture-of-experts scaling and pushing context to one million tokens via hybrid compressed-attention modifications. The hosts read this as a continuation of the post-V3 cadence — the lab is converging on the same recipe that closed-weight labs are using internally and shipping the artifacts publicly. Tencent's Hunyuan 3 preview lands the same week with weaker benchmarks; Andrey treats it as evidence that the marginal value of "yet another Chinese frontier release" is dropping unless the lab can clearly differentiate on capability or efficiency.

The episode's safety-flavored thread is a sabotage incident that the hosts characterize as a deliberate attempt to insert harmful behavior into a frontier system — they cover what is publicly known so far without naming a perpetrator and treat it as evidence that supply-chain attacks against model-training pipelines are now a credible threat surface, not a hypothetical one. They tie this back to the "distillation attacks" framing that Nathan Lambert pushes back on in the same week, and to OpenAI's published chain-of-thought monitorability work — three threads converging on the same point that the misalignment frontier is increasingly about adversarial inputs to training rather than emergent goal-directed behavior in deployment.

For practitioners, the actionable signal is that DeepSeek V4 and Grok Voice Think Fast both land as serious challengers in their respective domains in the same week, GPT-5.5 raises the price-per-coding-task ceiling without obviously moving the floor, and the misalignment conversation is shifting from "will the model deceive" toward "will an adversary plant the deception." The full episode runs 1h52m and is worth listening to in full for context on each release.

#2
Industry 2026-05-04 MIT Technology Review — AI 7.8 7.5/8.5/7.0

James O'Donnell's report from the Oakland federal courthouse covers the first week of trial in Musk's lawsuit alleging that OpenAI breached its founding nonprofit mission. The trial began the week of April 27 in front of a federal judge, with Musk's testimony and direct examination of Sam Altman both occurring during week one. The piece is the highest-quality on-the-ground account of the trial so far and reads like a stenographer's notebook turned into prose.

The legal stakes are concrete. Musk seeks rescission and disgorgement tied to his early funding contributions, with the strongest of his claims hinging on whether OpenAI's transition from nonprofit to capped-profit and then to its current structure violated explicit promises made to him in writing. A partial Musk win would, by the reporter's read, materially complicate OpenAI's reported plan to go public this year — the company is in the middle of preparing a registration filing, and any successful claim that its capital structure was constructed on a breached fiduciary duty would have to be disclosed to investors and resolved before pricing.

The reporter highlights several substantive moments from week one. Musk on the stand is described as "more measured than expected" in legal posture, repeatedly returning to the email record to show that his early checks were sized and timed against a nonprofit governance structure he believed Altman would maintain. Altman's direct examination, scheduled for week two, is expected to push back on the premise that any fixed promise was made — OpenAI's defense is that the for-profit subsidiary was always contemplated as a path to fund the mission and that Musk himself proposed a corporate-structure pivot in 2017. The article surfaces the trove of texts and emails that have entered evidence, including the now-public exchange in which Musk asked Altman for a settlement-like resolution and was rebuffed.

Beyond the stakes for OpenAI's IPO, O'Donnell points at the spectacle dimension: two of the most powerful figures in AI testifying within feet of each other while their feud plays out on X in real time, with each side's PR apparatus pushing transcripts and selective quotes. The reporter notes that the courtroom itself is a small, packed federal venue — credentialed press, OpenAI senior leadership in attendance, Musk's legal team led by his usual outside counsel — and that the judge has been actively managing the pace to keep the trial on schedule for a verdict by mid-summer. Several TechCrunch threads this week extend the same beat: one piece on Musk's only AI-expert witness, who testified about an AGI arms race; another surfacing the ominous post-funding texts Musk sent Brockman and Altman.

For anyone tracking AI corporate governance or reading the OpenAI S-1 when it lands, this is the trial of the year. The reporter promises continued coverage through closing arguments.

#3
Government & Defense 2026-05-04 Defense One 7.6 7.5/8.3/6.5

Defense One's Patrick Tucker reports that the autonomous-warfare line in the Pentagon's fiscal 2027 budget request is poised to grow significantly, with explicit emphasis on drones that can self-organize into formations and execute mission segments without per-platform operator input. The reporting cites unnamed program-office sources who describe the policy framing as a deliberate move past the current Replicator paradigm — Replicator emphasizes mass-produced, low-cost individual platforms; the new push emphasizes coordinated swarms running shared autonomy stacks.

The technical architecture, as described by the program-office sources, is built around three layers: a perception layer using onboard sensor-fusion stacks to maintain local situational awareness without relying on continuous data-link to a controller; a coordination layer that handles role assignment, formation maintenance, and target hand-off across a swarm using mesh networking; and a tasking layer where a human operator specifies mission objectives at the formation level rather than tasking each platform individually. The piece notes that several of the named technology providers — Anduril, Shield AI, Skydio, and a handful of smaller defense-tech companies — already deploy versions of all three layers in standalone products, but the integration story across vendors is still unsolved.

The budget growth is the news. Tucker quantifies the trajectory: autonomous-systems lines have been growing at roughly 35 percent year-over-year for the past three budget cycles, and the FY27 request is expected to extend that growth, with most of the increment going to platform-agnostic autonomy software rather than specific airframes. The Pentagon is treating autonomy as a horizontal capability — the same software stack expected to fly on multiple Replicator-class platforms, kamikaze drones like the Switchblade 400 (which the Army awarded contracts on this week), and larger collaborative-combat-aircraft programs in development with the Air Force. This contrasts with prior generations of autonomous-systems programs that tightly coupled the autonomy stack to a single airframe.

The piece flags two emerging tensions. First, the procurement-model question: the Defense Innovation Unit's "non-traditional" rapid-prototyping pipeline has been the on-ramp for most of the smaller autonomy vendors, but scaling to formation-level deployment requires moving programs into traditional acquisition pathways, where procurement timelines stretch to years and software-update cadence drops. Second, the data-sharing question: smaller vendors have been reluctant to share their flight-test data with Pentagon program offices because the data is core IP, but the Pentagon needs cross-platform datasets to train and validate the formation-coordination layer. Several vendors interviewed pushed back on the framing that the swarm coordination problem is "almost solved," noting that adversarial-jamming environments and degraded-GPS scenarios remain weak spots in the current state of practice.

The piece reads as one of the cleaner pieces of agenda-setting reporting on Pentagon AI policy this year and pairs naturally with this week's other defense-tech beats — the Marine Corps drone roadmap, the U.S. Strait of Hormuz "umbrella" deployment, and the Switchblade 400 award. For practitioners, the takeaway is that the autonomy-software market for defense is shifting toward standardized, platform-agnostic stacks, and the FY27 budget will tell us how serious the Pentagon is about that shift.

#4
Safety, Policy & Regulation 2026-05-04 Interconnects (Nathan Lambert) 7.5 7.0/8.0/7.0

Nathan Lambert's "The distillation panic" is the most pointed policy-discourse essay of the week and pushes back against the recent framing of "distillation attacks" — the term that has emerged in U.S. policy circles to describe Chinese labs hacking or jailbreaking commercial APIs to extract training signal. Lambert's core claim is that "distillation attacks" is a horrible piece of terminology that will, by repeated use in policy documents, irrevocably associate the broad and useful research technique of distillation with the narrow misuse pattern.

The technical distinction Lambert is defending is real and important. Distillation as a research method covers everything from teacher-student knowledge transfer to model compression, MoE expert pruning, and the entire post-training synthetic-data pipeline that nearly every frontier lab now uses. It is one of the core tools that makes academic and economic diffusion of AI capability possible — small labs and startups can fine-tune on outputs from larger models, researchers can build interpretability tools by distilling behavior into more analyzable forms, and downstream consumers benefit from cheaper, faster models trained from larger ones. The "attack" framing turns this entire toolkit into an act adjacent to a hostile state.

Lambert draws an explicit parallel to the open-source / open-weights debate, where careless terminology in policy circles collapsed a meaningful technical distinction into a single label that nobody remembers the original definition of. He argues that the same fate is now playing out for distillation: U.S. national-security commentary now treats every API extraction as suspect, the press echoes the framing, and within months the academic community will find itself defending a research practice it previously considered uncontroversial. The Lawfare piece on U.S. distillation-attack response from the prior week, which Lambert references, is part of what he is reacting to.

Where the essay gets practically useful is in his prescription. Lambert proposes naming the actual misuse pattern more narrowly — "API extraction" or "training-signal exfiltration" — and reserving "distillation" for the legitimate research technique. He notes that the labs themselves have a role to play here: rate-limiting and detection of exfiltration patterns are concrete defenses that don't require terminological overhang, and the frontier labs are already deploying both. He argues that policy work focused on the actual exfiltration pathways — telemetry, terms of service enforcement, attribution in logs — is more useful than blanket framings that risk catching legitimate research in their net.

The piece is short by Interconnects standards (around eight minutes of audio narration) and lands as a deliberate intervention in the policy discourse rather than a deep technical post. For readers tracking how the U.S. AI-policy conversation is evolving in 2026, it is one of the most clearly argued pushbacks against the recent National-Security-Council-adjacent framing — and worth reading alongside this week's Lawfare and Defense One pieces, which adopt the framing Lambert is criticizing.

#5
Infrastructure 2026-05-04 TechCrunch — AI 7.4 7.5/7.0/7.5

In the long-running saga that is Cerebras Systems’ IPO, the finish line is finally in sight. The AI chipmaker said on Monday that it is preparing to sell 28 million shares at $115 to $125 a share. This would raise $3.5 billion and give it a $26.6 billion market cap at the high end. That would be a nice bump in just a couple of months for the late investors who piled into its $1 billion Series H at a $23 billion valuation in February.

#6
Industry 2026-05-04 TechCrunch — AI 7.4 7.7/7.5/7.0

On Monday, Anthropic announced a joint venture focusing on deploying enterprise AI services. Blackstone, Hellman & Friedman, and Goldman Sachs will be founding partners in the new venture, which is backed by a group of VCs, hedge funds, and private equity firms, including Apollo Global Management, General Atlantic, GIC, Leonard Green, and Sequoia Capital. The Wall Street Journal, which first reported news of the partnership, reported the new venture was valued at $1.5 billion, which includes a $300 million commitment each from Anthropic, Blackstone, and Hellman & Friedman. The announcement com

#7
Evaluations & Benchmarks 2026-05-04 Machine Learning Street Talk (MLST) · Machine Learning Street Talk 7.3 7.0/8.0/6.5

Machine Learning Street Talk (MLST)By Machine Learning Street Talk (MLST)Welcome! We engage in fascinating discussions with pre-eminent figures in the AI field. Our flagship show covers current affairs in AI, cognitive science, neuroscience and philosophy of mind with in-depth analysis. Our approach is unrivalled in terms of scope and rigour – we believe in intellectual diversity in AI, and we touch on all of the main ideas in the field with the hype surgically removed.

#8
Industry 2026-05-04 TechCrunch — AI 7.2 7.5/6.5/7.5

Bret Taylor’s AI startup Sierra is raising a $950 million funding round led by Tiger Global and GV, the company announced Monday, pushing its post-money valuation above $15 billion. The raise gives Sierra more than $1 billion to work with — capital the company says it will use to become the “global standard” for AI-powered customer experiences. Like a lot of AI companies, Sierra has, smartly, been very proactive in touting its own growth in a crowded market. The company says it started with just four design partners a couple of years ago.

#9
Audio & Speech 2026-05-04 Hacker News — AI front page 7.1 7.0/6.5/7.5

May 4, 2026EngineeringHow OpenAI delivers low-latency voice AI at scaleBy Yi Zhang and William McDonald, Members of Technical StaffShareVoice AI only feels natural if conversation moves at the speed of speech. When the network gets in the way, people hear it immediately as awkward pauses, clipped interruptions, or delayed barge-in. That matters for ChatGPT voice, for developers building with the Realtime API, for agents working in interactive workflows, and for models that need to process audio while a user is still talking.At OpenAI’s scale, that translates into three concrete requirements:Gl

#10
Research 2026-04-30 Hugging Face Daily Papers 6.8 6.5/6.0/8.0

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation.

#11
Safety, Policy & Regulation 2026-05-04 Hacker News — AI front page 6.8 5.5/7.5/7.0

A new, bipartisan bill introduced by Democratic Senator of California Adam Schiff and endorsed by the biggest AI developers in the world—including OpenAI, Google, and Microsoft—would change the K-12 curriculum to shoehorn in “AI literacy,” something that young people and teachers alike already hate in schools.The Literacy in Future Technologies Artificial Intelligence, or LIFT AI Act, would empower the new director of the National Science Foundation (NSF) to make grant awards “on a merit-reviewed, competitive basis to institutions of higher education or nonprofit organizations (or a consortium

#12
Safety, Policy & Regulation 2026-05-04 Import AI (Jack Clark) 6.7 6.5/7.0/6.0

Jack Clark argues there's now a 60%+ chance that no-human-involved AI R&D — an AI system powerful enough to autonomously build its own successor — happens by the end of 2028. The case is built from public benchmark data on the engineering components of AI development, plus the rate at which AI capability is compounding across them.

The headline evidence: SWE-Bench has effectively saturated (Claude Mythos Preview at 93.9% vs. Claude 2's ~2% in late 2023). METR's task-time-horizon plot shows AI systems going from ~30 seconds of independent work in 2022 to ~12 hours in 2026 (Opus 4.6), with Ajeya Cotra forecasting ~100 hours by end of 2026. CORE-Bench (computational reproducibility) was declared "solved" in December 2025 at 95.5%. MLE-Bench (Kaggle competitions) jumped from 16.9% at launch to 64.4% (Gemini3 with search). Anthropic's CPU-only LM-training optimization task went from 2.9× speedup (Opus 4 May 2025) to 52× (Claude Mythos Preview April 2026), against a human baseline of ~4× for 4–8 hours of work.

On the meta-skill axis, frontier models can now manage other AI systems (Claude Code's sub-agent supervision is the canonical example) and have produced proof-of-concept automated alignment research (Anthropic #454) that beats human baselines on small-scale scalable-oversight problems. PostTrainBench shows AI systems achieving ~half the uplift human researchers achieve when fine-tuning open-weight models. Frontier labs are explicit about the goal: OpenAI wants an "automated AI research intern by September 2026," Anthropic publishes on automated alignment researchers, DeepMind says automation of alignment research "should be done when feasible." Recursive Superintelligence raised $500M with this exact mandate.

The 60% by 2028 estimate (30% by 2027) hedges on whether AI can do creative, paradigm-shifting research — the transformer-architecture-class insight, not the engineering schlep. Clark notes math/CS results (Erdős-1051, the UBC/UNSW/Stanford/DeepMind math proof) are tantalizing but might be domain-specific. The downstream implications he flags: alignment under recursive self-improvement (small accuracy gaps compound: 99.9% becomes 60.5% after 500 generations), inequality of compute access, and the formation of capital-heavy human-light "machine economies" that may eventually trade primarily with each other.

#13
Industry 2026-05-04 TechCrunch — AI 6.7 6.5/7.0/6.5

When do we take AI doomers seriously? That’s a key subtext of Elon Musk’s attempt to shut down OpenAI’s for-profit AI business. His attorneys argue that the organization was set up as a charity focused on AI safety and lost its way in pursuit of lucre. To prove that, they cite old emails and statements from the organization’s founders about the need for a public-spirited counterweight to Google DeepMind.

#14
Robotic Autonomy 2026-05-04 arXiv cs.RO · Hugging Face Daily Papers 6.6 6.8/6.0/7.0

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes.

cs.RO
#15
Frontier LLMs 2026-05-04 Simon Willison's Weblog 6.5 6.5/6.0/6.5

Granite 4.1 3B SVG Pelican Gallery Granite 4.1 3B SVG Pelican Gallery. IBM released their Granite 4.1 family of LLMs a few days ago. They're Apache 2.0 licensed and come in 3B, 8B and 30B sizes.

#16
Industry 2026-05-04 TechCrunch — AI 6.5 6.0/6.0/7.5

Image model releases are driving growth for AI mobile apps, generating 6.5x more downloads than traditional model updates, according to a new report from app intelligence provider Appfigures. This marks a shift from earlier days, when the release of new models powering the conversational experiences drove more demand, alongside the new features like a voice chat interface. For instance, ChatGPT and Gemini each added tens of millions of new downloads after releasing their respective image models, Appfigures found. For Google’s Gemini, the release of its image model Nano Banana drove an addition

#17
Research 2026-05-02 Hugging Face Daily Papers 6.3 6.0/6.0/7.0

Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference-time skill augmentation: extracting the rules and procedures from context into natural-language skills.

#18
Agents & Tool Use 2026-05-04 arXiv cs.AI · Hugging Face Daily Papers 6.3 5.5/7.5/6.0

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-st…

cs.AI cs.CY
#19
Research 2026-04-30 Hugging Face Daily Papers 6.3 6.5/6.0/6.5

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies.

#21
Agents & Tool Use 2026-04-28 Hugging Face Daily Papers 6.2 6.0/5.5/7.0

Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories.

#22
Safety, Policy & Regulation 2026-04-30 Hugging Face Daily Papers 6.2 5.5/7.0/6.0

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse.

#23
AI Coding 2026-05-05 Hacker News — AI front page 6.1 5.5/5.0/7.5

Train Your Own LLM From Scratch A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why. Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space. This workshop is my attempt to give others that same experience.

#24
Industry 2026-05-04 Stratechery 6.1 6.0/6.0/6.0

Google Earnings, Meta Earnings Monday, May 4, 2026 Listen to Podcast Wall Street loved Google’s earnings, and hated Meta’s, even though the latter’s core business was more impressive. The difference is that Google is monetizing its investments now (and it might be all Anthropic).

#25
Safety, Policy & Regulation 2026-05-04 FedScoop — AI 6.1 5.5/7.5/5.0

Senators seek Labor-led database on AI workforce impacts The Workforce Transparency Act from Sens. Mark Warner and Ted Budd would charge the DOL with creating a public resource with “aggregated workforce transparency data.” The Department of Labor building is seen behind a sign marking the location of the agency's headquarters on March 18, 2025, in Washington, D.C.

#26
Government & Defense 2026-05-04 DefenseScoop 6.0 6.0/6.5/5.0

Army awards deal to AV for new Switchblade 400 kamikaze drone to support LASSO program The Army has awarded AeroVironment a prototype agreement for the drone maker’s latest Switchblade variant. (Image courtesy of AV) The Army has awarded AeroVironment a prototype agreement for the drone maker’s latest Switchblade variant, the company announced Monday.

#27
Industry 2026-05-04 Latent Space (swyx & Alessio) 6.0 5.5/6.0/6.0

Congrats to Sierra, raising ~$1B at a $15B valuation — normally a headline story but we already covered their $10B round and CEO Bret Taylor on the pod — they crossed 100M ARR in November and 150M in Feb, so presumably they are at or above the 200M mark (a nice 75x current multiple, whew - 50x if you give them credit thru EOY).Today though we are choosing to focus on this discussion bravely sparked by Roon, an OpenAI employee

#28
Industry 2026-05-05 TechCrunch — AI 5.9 5.0/6.0/6.5

When it comes to the specter of AI’s labor-displacing potential, Jensen Huang thinks that the American worker has nothing to fear. During a conversation Monday night with MSNBC’s Becky Quick hosted by the Milken Institute — an economic policy think tank, the jovial Nvidia CEO said that AI was an industrial-scale generator of jobs, not the harbinger of mass unemployment that so-called “AI doomers” have often accused it of being. A number of different topics were broached during the talk, but a central theme that kept coming back was the ongoing economic anxiety surrounding the AI industry and w

#29
Industry 2026-05-04 TechCrunch — AI 5.9 5.5/6.0/6.0

In Brief Posted: 9:36 AM PDT · May 4, 2026 Image Credits:Marc Piasecki / Getty Images Julie Bort Tim Fernholz Elon Musk sent ominous texts to Greg Brockman, Sam Altman after asking for a settlement, OpenAI claims Two days before the Elon Musk vs. OpenAI trial began last week, Musk texted the model maker’s president and co-founder Greg Brockman. Musk suggested to Brockman that OpenAI settle the suit. After Brockman replied by suggesting both sides drop their suits, the exchange went off the rails, with Musk responding: “By the end of this week, you and Sam will be the most hated men in America.

#30
Research 2026-04-30 Hugging Face Daily Papers 5.8 6.0/6.0/5.5

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code.

#31
Government & Defense 2026-05-04 War on the Rocks 5.8 5.5/6.5/5.0

Vitaliy Goncharuk argues NATO is making a costly bet by pouring money into propeller-driven counter-drone systems, when the Ukrainian battlefield is already showing the bet is wrong. Russia has retrofitted its slow Shahed drones with turbojet engines, jumping speed from ~90 mph to ~460 mph and ceiling from 6,500 ft to 29,000 ft. Ukraine's propeller-based interceptor drones (max 280 mph) can no longer catch them from behind — only head-on intercepts remain viable, with sharply reduced hit rates.

Iran's $90,000 "358" missile already intercepts the full class of aerial threats from Shaheds to MQ-9 Reapers and AH-64 Apaches, while Western counter-drone efforts double down on quadcopters at the wrong end of the speed-altitude curve. The right answer, per Goncharuk, is a new class of cheap autonomous interceptor missiles — low-thousands to tens-of-thousands of dollars per unit, AI-guided with onboard inertial/visual nav — that scales economically against turbojet drones costing $20–50K. The components exist; what's missing is integration and production scale. YC-backed Perseus Defense and Ares Industries, plus European players Frankenburg Technologies and Origin Robotics, are building toward this; none are at production scale.

Drones in this view become trucks — propeller-driven motherships carrying 2–10 cheap interceptor missiles, with autonomy mandatory because comms and GPS won't be assumed. Five structural reasons the West isn't moving: institutional momentum (drones are politically legible), missile production complexity (concentrated in legacy primes), the sensor/nav scaling gap from civilian autonomy (missile-class targeting needs different hardware), ITAR friction once propulsion is involved, and missile-engineering workforce scarcity. China is supplying components and also investing in affordable counter-drone missiles (Yitian, FK-3000 with 96 missiles per platform); Russia is fielding the S8000 Banderol, sometimes nicknamed an "AliExpress missile."

#32
Research 2026-04-29 Hugging Face Daily Papers 5.7 5.5/6.0/5.5

Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios.

#33
Agents & Tool Use 2026-05-03 Hugging Face Daily Papers 5.7 5.5/6.0/5.5

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress.

#34
Multimodal 2026-05-04 arXiv cs.AI · Hugging Face Daily Papers 5.3 5.5/5.5/5.0

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility.

cs.CV cs.AI
#35
Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 5.1 5.5/6.8/3.0

As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability -- knowing what they do not know, detecting errors, seeking clarification -- under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring.

cs.AI cs.CL cs.LG
#36
Agents & Tool Use 2026-05-04 arXiv cs.AI 5.1 4.0/8.3/3.0

Drug-induced liver injury (DILI) remains a leading cause of late-stage clinical trial attrition. However, existing computational predictors primarily rely on binary classification, a framing that limits generalization and yields no mechanistic insight to guide translational decisions. We argue that DILI prediction is better posed as an explainable hypothesis-generation problem.

cs.AI
#37
Research 2026-04-24 Hugging Face Daily Papers 5.1 5.0/6.1/4.2

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environme…

#38
Research 2026-04-30 Hugging Face Daily Papers 5.0 5.5/4.6/5.0

In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competi…

#40
Government & Defense 2026-05-04 Defense One 5.0 4.5/7.2/3.0

}); Elon Musk holds a chainsaw reading “Long live freedom, damn it” during the 2025 Conservative Political Action Conference. Musk took the helm of DOGE, the Trump administration’s Department of Government Efficiency, and oversaw cuts and reorganizations across federal agencies. SAUL LOEB/AFP via Getty Images Get all our news and commentary in your inbox at 6 a.m. ET.

#41
Research 2026-04-30 Hugging Face Daily Papers 4.8 5.5/4.0/5.0

3D world generation is essential for applications such as immersive content creation or autonomous driving simulation. Recent advances in 3D world generation have shown promising results; however, these methods are constrained by grid layouts and suffer from inconsistencies in object scale throughout the entire world. In this work, we introduce a novel framework, Map2World, that first enables 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales, ensuring global-scale consistency and flexibility across expansive environments.

#42
Research 2026-04-28 Hugging Face Daily Papers 4.8 5.5/4.0/5.0

Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction.

#43
Research 2026-04-26 Hugging Face Daily Papers 4.8 5.5/4.0/5.0

LLM agents increasingly rely on reusable skills, capability packages that combine instructions, control flow, constraints, and tool calls. In most current agent systems, however, skills are still represented by text-heavy artifacts, including SKILL.md-style documents and structured records whose machine-usable evidence remains embedded largely in natural-language descriptions. This poses a challenge for skill-centered agent systems: managing skill collections and using skills to support agent both require reasoning over invocation interfaces, execution structure, and concrete side effects that are often entangled in a single textual surface.

#44
Government & Defense 2026-05-04 Defense One 4.8 4.5/6.5/3.0

}); Leon Neal/Getty Images An Air Force B-52 Stratofortress takes off from RAF Fairford on March 19, 2026, in Fairford, England. Get all our news and commentary in your inbox at 6 a.m. ET. emailRegister for NewsletterStay Connected Insights & Reports }); Commanding the digital domain: Turning data into decision advantage resiliencePresented By Splunk Download Now }); From Paper to PixelsPresented By Adobe 2025 Download Now Defense Systems First B-52s to get new engines this year Critical design review clears Boeing to upgrade two Stratofortresses in bid to keep them flying past 2050.

#45
Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 4.7 4.0/7.0/3.0

Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn scaffolds consistently outperforming single-turn manipulations on capable models. However, automated optimization-based red-teaming has remained largely limited to the single-turn setting, iterating over static prompts and lacking the ability to reason about which forms of conversational priming induce compliance.

cs.CL cs.CR
#46
Research 2026-04-30 Hugging Face Daily Papers 4.7 4.5/6.1/3.5

Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see.

#47
Industry 2026-05-04 MIT Technology Review — AI 4.7 4.5/6.2/3.0

SponsoredBiotechnology and healthTailoring AI solutions for health care needsBy tapping into sector-specific data and expertise, developers can build AI applications that address some of health care’s biggest challenges. By MIT Technology Review Insightsarchive pageMay 4, 2026In partnership withMayo Clinic Platform The AI market is full of big promises of grand transformation. Health care is a prime target for those promises, beset as it is by financial pressures, labor shortages, and the growing burden of caring for an aging population. AI developers are targeting functions that vary widely,

#48
Government & Defense 2026-05-04 DefenseScoop 4.7 4.8/5.7/3.0

The Army wants a new drone to close ‘reconnaissance and security gaps’ for its battalions Amid an ongoing effort to push longer-range, quick-launch drones to tactical units, the service wants battalion commanders to have an unmanned aerial system organic to their unit that can take-off vertically and fly over 40 kilometers.

#49
Reinforcement Learning 2026-05-04 arXiv cs.LG 4.6 4.0/6.9/3.0

Molecular property models increasingly support high-stakes drug-discovery decisions, but their outputs are often difficult to audit: classical predictors return scores without rationale, while language models can produce fluent explanations weakly grounded in the input molecule. We introduce Bolek, a compact multimodal language model that grounds natural-language reasoning in molecular structure by injecting a Morgan fingerprint embedding into an instruction-tuned text decoder. Bolek is fine-tuned on molecular alignment tasks, including molecule description, RDKit descriptor prediction, and substructure detection, and on downstream reasoning over 15 TDC binary classification tasks using synt…

cs.LG cs.AI q-bio.BM
#50
Research 2026-05-04 arXiv cs.LG 4.6 4.0/6.8/3.0

Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer.

cs.LG cs.AI
#51
Evaluations & Benchmarks 2026-05-04 arXiv cs.RO 4.6 4.7/6.1/3.0

Dense and accurate depth estimation is essential for robotic manipulation, grasping, and navigation, yet currently available depth sensors are prone to errors on transparent, specular, and general non-Lambertian surfaces. To mitigate these errors, large-scale monocular depth estimation approaches provide strong structural priors, but their predictions can be potentially skewed or mis-scaled in metric units, limiting their direct use in robotics. Thus, in this work, we propose a training-free depth grounding framework that anchors monocular depth estimation priors from a depth foundation model in raw sensor depth through factor graph optimization.

cs.RO cs.CV
#52
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.6 3.7/7.1/3.0

Recognizing unknown objects is crucial for safety-critical applications such as autonomous driving and robotics. Open-Set Panoptic Segmentation (OPS) aims to segment known thing and stuff classes while identifying valid unknown objects as separate instances. Prior OPS approaches largely treat known categories as a flat label set, ignoring the semantic hierarchy that provides valuable structural priors for distinguishing unknown objects from in-distribution classes.

cs.CV cs.AI cs.RO
#53
Multimodal 2026-05-04 arXiv cs.CV 4.6 4.7/6.1/3.0

Open-world text-guided class-agnostic counting (CAC) has emerged as a flexible paradigm for counting arbitrary object classes by using natural language prompts. However, current evaluation protocols primarily focus on standard counting errors within single-category images, overlooking a fundamental requirement: the ability to correctly ground the textual prompt in the visual scene. In this paper, we show that several state-of-the-art CAC models often struggle to determine which object class should be counted based on the given prompt, revealing a misalignment between textual semantics and visual object representations.

cs.CV
#54
Research 2026-04-30 Hugging Face Daily Papers 4.6 5.0/4.6/4.2

Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: Global Trajectory Score Matching (GTSM), for which gradient boosting (in an idealized version) is asymptotically optimal.

#56
Industry 2026-05-05 Simon Willison's Weblog 4.6 4.8/5.5/3.0

So it’s well known that Y Combinator owns some stake in OpenAI. But how big is that stake? This seems like devilishly difficult information to obtain. I asked around and a little birdie who knows several OpenAI investors came back with an answer: Y Combinator owns about 0.6 percent of OpenAI.

#57
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.5 3.7/6.8/3.0

Autonomous 3D indoor scene synthesis breaks down in non-convex rooms with tightly coupled spatial constraints. Data-driven generators lack topological priors for long-horizon planning, while iterative agents fragment semantics and become geometrically brittle. We present ZoneMaestro, a unified framework that shifts the paradigm from object-centric synthesis to Zone-Graph Orchestration.

cs.RO cs.AI
#59
Government & Defense 2026-05-04 War on the Rocks 4.5 4.5/5.7/3.0

Ukrainian security researchers Mykhailo Andreichyn and Serhii Demediuk (a former deputy secretary of Ukraine's National Security and Defense Council) argue state sovereignty is being squeezed between two hammers: the calculated ambiguity of international law in cyberspace, and critical state-defense dependence on private tech infrastructure (AWS, Microsoft, Meta, Google, SpaceX). The combination produces a regime in which states must ask permission to use private capabilities for defense, while aggressors weaponize legal restraint as a shield.

The first hammer: international law treats cyber operations as armed attacks only under "kinetic equivalence." Russia exploited this to plant thousands of backdoors in Ukrainian systems from 2014 onward, activating them at the moment of full-scale invasion in 2022. Per Demediuk's direct operational experience, ~2,500 backdoors may still be prepositioned in Ukrainian systems as of early 2026. The "kinetic-cyber cycle" Russia uses — information pretext, digital targeting via compromised routers, kinetic strike, information rationalization — has been predictable enough that Ukraine built an automated predictive system reportedly running at 60–65% accuracy. The first weapon Russia deployed in February 2022 wasn't a tank but a cyberattack on the Viasat satellite network ~1 hour before the ground invasion.

The second hammer: the 2026 DoD–Anthropic confrontation showed even the U.S. can't fully control how a private company deploys AI (Anthropic refused to lift restrictions on autonomous lethal systems and mass surveillance, and was designated a supply chain risk). Ukraine's Starlink dependency went the other way — during the autumn 2022 southern counteroffensive, Ukrainian forces crossed a geofenced line and lost connectivity mid-assault, leading to casualties from a single private actor's operational decision. The authors argue for two recalibrations: classify destructive state-sponsored cyber operations as armed attacks by intent and cumulative effect (not kinetic equivalence) via the new UN Global Mechanism on ICT security, and impose continuity-of-service obligations on digital-infrastructure providers comparable to those on traditional defense contractors.

#60
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.4 5.0/5.3/3.0

Spectral-based machine learning models have been increasingly deployed in chemometrics and spectroscopy, where predictive accuracy is as important as explainability. Current employed eXplainable Artificial Intelligence (XAI) methods are largely adapted from tabular or generic multivariate domains, assigning relevance to isolated spectral variables rather than to the chemically meaningful spectral zones. Widely adopted tools such as SHapley Additive exPlanations (SHAP), Permutation Feature Importance (PFI), and Variable Importance in Projection scores (VIP) were not designed for the physical continuity and high collinearity of spectral data, and their variable-level outputs require post-hoc a…

cs.LG physics.app-ph
#61
Research 2026-04-30 Hugging Face Daily Papers 4.4 5.0/4.0/4.2

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action.

#62
Research 2026-04-29 Hugging Face Daily Papers 4.4 5.0/4.0/4.2

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself.

#63
Research 2026-04-28 Hugging Face Daily Papers 4.4 5.0/4.0/4.2

In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are generally high-dimensional, and for various structured generation tasks, additional attributes are combined to associate with data samples. We show that the space spanned by the combination of dimensions and attributes can be insufficiently covered by existing training schemes of diffusion generative models, potentially limiting test time performance.

#64
Agents & Tool Use 2026-05-04 Simon Willison's Weblog 4.4 5.8/4.0/3.0

Tool: Redis Array Playground Tool Redis Array Playground — # Redis Array Playground Salvatore Sanfilippo submitted a PR adding a new data type - arrays - to Redis. The new commands are ARCOUNT, ARDEL, ARDELRANGE, ARGET, ARGETRANGE, ARGREP, ARINFO, ARINSERT, ARLASTITEMS, ARLEN, ARMGET, ARMSET, ARNEXT, AROP, ARRING, ARSCAN, ARSEEK, ARSET. The implementation is currently available in a branch, so I had Claude Code for web build thi

#65
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.3 3.7/6.3/3.0

We introduce PLACE (Persistence-Landmark Analytic Classification Engine), a closed-form pipeline for classifying point clouds and graphs through their persistent-homology signatures. Three quantitative guarantees -- a margin-based excess-risk rate, a closed-form descriptor-selection rule, and a per-prediction certificate -- are derived from training labels alone, with no learned weights or held-out calibration. The embedding sums Mitra-Virk single-point coordinate functions over a sparse landmark grid; closed-form weights maximize a structural distortion constant $λ(ν)$ (a Lipschitz lower bound on $\mathcal{D}_n$ under non-interference).

cs.LG math.AT
#66
Reinforcement Learning 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region.

cs.LG cs.AI
#67
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content.

cs.LG
#68
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 4.3 3.7/6.3/3.0

Long-term time series forecasting requires models that simultaneously capture rapid oscillations, medium-range periodicities, and slowly evolving macro-trends from a fixed look-back window. Existing lightweight MLP-based models typically operate on a single temporal resolution, limiting their ability to explicitly model patterns at multiple scales. We propose MSMixer, a channel-independent multi-scale MLP architecture that addresses this limitation through three complementary innovations: (i) three parallel scale branches at down-sample factors {1x, 4x, 16x} with independent MLP blocks, (ii) a learnable softmax gate that dynamically weighs branch outputs, and (iii) a DLinear complementary sh…

cs.LG
#69
Generative Media 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Optical Coherence Tomography (OCT) has become one of the most used imaging modality in ophthalmology. It provides high-resolution, non-invasive visualization of retinal microarchitecture. The automated analysis of OCT images through representation learning has emerged as a central research frontier.

cs.CV cs.LG
#70
Efficiency 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Chemotherapy dose optimization can be formulated as a dynamic treatment regime, requiring sequential decisions under uncertainty that must balance tumor suppression against toxicity. However, most reinforcement learning approaches assume full observability of the patient state, a condition rarely met in clinical practice. We investigate whether memory-augmented policies can improve chemotherapy control under partial observability.

cs.LG cs.AI
#71
Robotic Autonomy 2026-05-04 arXiv cs.LG 4.3 3.7/6.1/3.0

Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation.

cs.RO cs.LG
#72
Efficiency 2026-05-04 arXiv cs.LG 4.3 3.7/6.3/3.0

Physics-informed neural networks (PINNs) have recently emerged as a promising framework for integrating data-driven learning with physical knowledge. In this work, we propose a coupled PINN approach for the joint reconstruction of indoor temperature and humidity dynamics in greenhouse environments, together with simultaneous identification of key model parameters. The method incorporates a reduced-order physically motivated model into the learning process, enabling consistent estimation under sparse and noisy observations.

cs.LG
#73
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.3 3.7/6.1/3.0

Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022.

cs.AI cs.CL
#74
Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 4.3 3.7/6.1/3.0

We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al. 2024), covering more than 30 language-culture pairs, predominantly representing low-resource languages spoken across multiple continents.

cs.CL
#75
Agents & Tool Use 2026-05-04 arXiv cs.AI 4.3 3.7/6.1/3.0

Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on context, fatigue, and the stakes involved. Governing that distribution -- balancing efficiency, oversight, and human capability -- remains an open problem.

cs.AI cs.HC cs.SE
#76
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 4.3 3.7/6.1/3.0

Learning to Defer (L2D) enables a model to predict autonomously or defer to an expert, but prior work largely assumes flat label spaces. We study the first L2D setting with hierarchical multi-label decisions, motivated by medical-imaging workflows in which findings are organised by clinical taxonomies. In this setting, deferral is a delegation action rather than a label assignment, so treating it as an independent per-label decision can produce deferral incoherence, including taxonomic contradictions, delegation violations, and deferrals of labels already implied by the model's own assertions.

cs.AI
#77
Agents & Tool Use 2026-05-04 arXiv cs.AI 4.3 3.7/6.1/3.0

Healthcare automation is shaped by local procedures and organizational constraints, so agent capabilities rarely transfer unchanged across settings. Agent skills, self-contained directories that package reusable procedures for AI agents, are emerging as a procedural layer for adapting healthcare agents across diverse healthcare settings. We present the first empirical analysis of healthcare agent skills, drawing on 557 healthcare-related skills filtered from 58,159 public skills on ClawHub and annotated along ten dimensions covering function, deployment context, autonomy, and safety.

cs.AI
#78
Agents & Tool Use 2026-05-04 arXiv cs.AI 4.3 3.7/6.2/3.0

We present two new classes of causal models of decision-making agents. Our approach is motivated by the needs of modeling the economics of computing systems. These systems are composed of subsystems and can exhibit endogenous limits on cognitive resources and value discounting.

cs.CE cs.AI cs.GT econ.TH
#79
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.3 3.7/6.2/3.0

Autonomous driving technology has rapidly evolved over the past decade, offering significant improvements in transportation efficiency, safety, and cost reduction. While much of the progress has focused on highway driving and obstacle avoidance, low-speed maneuvers such as parking remain among the most difficult challenges for autonomous systems. This challenge is especially pronounced in trailer-truck transport vehicles due to their articulated motion and environmental constraints.

cs.RO
#80
Generative Media 2026-05-04 arXiv cs.CV 4.3 3.7/6.2/3.0

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment.

cs.CV
#81
Government & Defense 2026-05-04 War on the Rocks 4.3 4.5/5.0/3.0

The latest installment of WOTR's "Ukraine Compass" weekly digest of Ukrainian-language commentary leads with Espreso's Dmytro Snegiryov on Russia's intensified push to capture the Sloviansk-Kramatorsk agglomeration in eastern Ukraine. Russian forces are concentrating attacks on Kostiantynivka, Chasovyi Yar, and the surrounding area, attempting to flank Ukrainian positions near Sloviansk by advancing from multiple directions in a tactic that echoes Bakhmut. Kostiantynivka is being systematically destroyed by heavy bombs with over 2,500 civilians trapped and all access roads under fire.

Snegiryov's read on the difference from Bakhmut: Ukraine has substantially expanded its drone capabilities since 2023, giving it a clear edge in the number of operator units and in first-person-view and drop-drone usage, plus a wider surveillance zone. He argues this is slowing — but not stopping — Russian advances, and the only way Ukraine continues to hold these areas is by leaning on partner-supplied equipment and expertise rather than going it alone. The full Compass roundup is members-only past this point; the article curates additional pieces from Ukrainian outlets across the political spectrum on frontline strategy, domestic politics, and public argument inside a country at war.

#82
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.2 3.7/6.0/3.0

Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molecular understanding, and domain-specific reasoning capabilities. However, existing benchmarks remain disconnected.

cs.CL
#83
Research 2026-05-04 arXiv cs.AI 4.2 4.0/5.5/3.0

Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence. This decoupled practice first compresses and then fine-tunes adapters, potentially misaligning the compressed subspace with downstream objectives and squandering a global parameter budget. To overcome this limitation, we introduce JACTUS (Joint Adaptation and Compression with a Task-aware Union of Subspaces), a single framework that unifies compression and adaptation.

cs.AI
#84
Multimodal 2026-05-04 arXiv cs.AI 4.2 4.0/5.5/3.0

Optical coherence tomography (OCT), a commonly used retinal imaging modality, plays a central role in retinal disease diagnosis by providing high-resolution visualization of retinal layers. While deep learning (DL) has achieved expert-level accuracy in OCT-based retinal disease detection, its "black box" nature poses challenges for clinical adoption, where explainability is essential for clinical trust and regulatory approval. Existing post-hoc explainable AI (XAI) methods often struggle to delineate fine-grained lesion structures, respect anatomical boundaries, or suppress noise, limiting the trustworthiness of their explanations.

cs.CV cs.AI
#85
Research 2026-04-28 Hugging Face Daily Papers 4.2 4.5/4.7/3.5

Training stable biological foundation models requires rethinking attention mechanisms: we find that using sigmoid attention as a drop in replacement for softmax attention a) produces better learned representations: on six diverse single-cell datasets, sigmoid achieves 25% higher cell-type separation, better cell-type cohesion metrics, and lower validation loss, b) faster training, models with sigmoid attention train up to 10% faster than their softmax counterparts, and c) more stable training by eliminating inherent sources of instability in softmax attention. We establish that sigmoid attention has globally bounded derivatives (leq 0.25) as opposed to softmax, and a diagonal Jacobian struct…

#86
Research 2026-04-30 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

Distributed blackbox consensus optimization is a fundamental problem in multi-agent systems, where agents must improve a global objective using only local objective queries and limited neighbor communication. Existing methods largely rely on handcrafted update rules and static cooperation patterns, which often struggle to balance local adaptation, global coordination, and communication efficiency in heterogeneous nonconvex environments. In this paper, we take an initial step toward trajectory-driven self-design for distributed black-box consensus optimization.

#87
Agents & Tool Use 2026-04-24 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

Analog circuit design relies heavily on reusing existing intellectual property (IP), yet searching across heterogeneous representations such as SPICE netlists, schematics, and functional descriptions remains challenging. Existing methods are largely limited to exact matching within a single modality, failing to capture cross-modal semantic relationships. To bridge this gap, we present AnalogRetriever, a unified tri-modal retrieval framework for analog circuit search.

#88
Research 2026-05-03 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems.

#89
Research 2026-04-25 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes.

#90
Research 2026-05-03 Hugging Face Daily Papers 4.2 4.5/4.6/3.5

We present Orbit-Space Geometric Probability Paths (OGPP), a particle-native flow-matching framework for generative modeling of particle systems. OGPP is motivated by two insights: (i) particles are defined up to permutation symmetries, so anonymous indexing inflates per-index target variance and yields curved, hard-to-learn flows; and (ii) particles live in physical space, so the flow terminal velocity has physical meaning and can encode geometric attributes, e.g., surface normals. OGPP instantiates three key components: (1) orbit-space canonicalization of the probability-path terminal endpoint, (2) particle index embeddings for role specialization, and (3) geometric probability paths with …

#91
Industry 2026-05-04 Simon Willison's Weblog 4.2 4.5/4.7/3.0

[...] Between 2000 and 2024, farmers sold in total a Colorado-sized chunk of land all on their own, 77 times all land on data center property in 2028, and grew more food than ever on what was left. None of this caused any problems for US food access. And then, in the middle of all this, a farmer in Loudoun County sells a few acres of mediocre hay field to a hyperscaler for ten times its agricultural value, and the response is that we’re running out of farmland. — Andy Masley, pushing back against the "land use" argument against data center construction Tags: ai-ethics, ai, generative-ai, andy-

#92
Government & Defense 2026-05-04 DefenseScoop 4.2 4.5/4.7/3.0

Defending against the next wave of AI-driven cyberattacks Cyber threats targeting defense networks and the defense industrial base are evolving at unprecedented speed and scale. New research highlights how AI-powered botnets and low-cost, “attack-for-hire” services are enabling hyper-volumetric DDoS attacks capable of overwhelming infrastructure in seconds—often faster than traditional defenses can respond. As these attacks become more automated, distributed and difficult to detect, defense organizations must rethink how they protect mission-c

#93
Government & Defense 2026-05-04 FedScoop — AI 4.2 4.5/4.7/3.0

Two FAA partners ramp up hiring, preparations for ATC overhaul L3Harris and Indra are working behind the scenes to increase capacity as the fiber cable and radar providers advance plans aimed at improving the FAA’s efficacy. ATC-controller at radar screen with microphone and control strips in visual-control-room with airport terminal view through windows at night.

#94
Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Ensuring the coherence of regional socio-economic statistics is a central task for national statistical institutes. Traditional validation tools, such as range edits, ratio checks, or univariate outlier detection, are effective for identifying extreme values in individual series but are less suited for detecting unusual combinations of indicators in high-dimensional settings. This paper proposes an unsupervised machine learning framework for identifying structurally atypical regional profiles within Europe using publicly available Eurostat data.

cs.LG
#95
Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Motion in-betweening is one of the most artistically demanding and time consuming stages of 3D animation, where the expressivity and rhythm of motion are defined. The level of creative control it requires makes it a major production bottleneck, underscoring the need for intelligent tools that assist animators in this process. Although recent deep learning approaches have achieved strong results in motion synthesis and in-betweening, they assume data characteristics, motion styles, and problem formulations that diverge from professional animation workflows.

cs.GR cs.LG
#96
Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Effective pair programming depends on coordination of attention, cognitive effort, and joint regulation over time, yet most adaptive learning systems remain individual-centric and reactive. This paper introduces ProPACT, a proactive AI-driven adaptive collaborative tutor that treats collaboration itself as the object of instruction. ProPACT constructs a multimodal dyadic learner model based on Joint Visual Attention (JVA), Joint Mental Effort (JME), and individual mental effort, and employs an XGBoost-based forecasting model to predict emerging suboptimal collaboration states up to 30 seconds in advance.

cs.HC cs.AI cs.LG
#97
Efficiency 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

The proliferation of large-scale and structurally complex data has spurred the integration of machine learning methods into statistical modeling. Recurrent neural networks (RNNs), a foundational class of models for time-dependent data, can be viewed as nonlinear extensions of classical autoregressive moving average models. Despite their flexibility and empirical success in machine learning, RNNs often suffer from limited interpretability and slow training, which hinders their use in statistics.

stat.ML cs.LG
#98
Efficiency 2026-05-04 arXiv cs.LG 4.1 3.7/5.5/3.0

Preference optimization has become a central paradigm for aligning large language models with human feedback. Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback by directly optimizing pairwise preferences, removing the need for reward modeling and policy optimization. However, recent work shows that DPO exhibits a squeezing effect, where negative gradients applied to rejected responses concentrate probability mass on high-confidence predictions while suppressing alternative responses.

cs.LG
#99
Research 2026-05-04 arXiv cs.LG 4.1 3.7/5.7/3.0

The lack of analytical models describing diffusion time dependence at intermediate time scales in complex tissue microstructure limits the accurate quantification of extracellular diffusivity and tissue microstructure. We introduce TRACED, a biophysical model that incorporates diffusion time dependence in cell distributions to quantify pathologically-relevant properties in solid tumors. Neural networks were trained on Monte Carlo diffusion simulations using sphere distribution-based geometries to enable the rapid computation of time-dependent diffusion MRI signals in cell populations of variable cell size.

physics.med-ph cs.LG eess.IV
#100
Research 2026-05-04 arXiv cs.LG 4.1 4.7/4.6/3.0

The effectiveness of active learning hinges on the choice of the acquisition criterion by which a learning algorithm selects potentially informative data points whose label is subsequently queried. This paper proposes a novel gradient-based acquisition criterion, derived from a generalization bound introduced by Luo et al. (2022).

cs.LG
#101
Research 2026-05-04 arXiv cs.LG 4.1 4.0/5.3/3.0

Continual learning systems face a fundamental tension between plasticity -- acquiring new knowledge -- and stability -- retaining prior knowledge. We introduce MPCS (Multi-Plasticity Continual System), a neuroplastic architecture that integrates eleven complementary mechanisms: task-driven neurogenesis, Fourier-encoded inputs, EWC regularization, meta-replay, mixed consolidation, hybrid gating, synapse pruning/regeneration, Hebbian updates, task similarity routing, adaptive growth control, and continuous neuron importance tracking. We evaluate MPCS on MEP-BENCH, a multi-track benchmark spanning 31 tasks across regression, classification, logic, and mixed domains, using a three-dimensional Pa…

cs.LG cs.NE
#102
Safety, Policy & Regulation 2026-05-04 arXiv cs.CL 4.1 3.7/5.5/3.0

Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games.

cs.AI cs.CL
#103
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.1 4.7/4.6/3.0

There is growing interest in exploring user simulation as an alternative to gathering and scoring real user-chatbot interactions for AI chatbot evaluation. For this purpose, it is important to ensure the realism of the simulation, i.e., the extent to which simulated dialogues reflect real dialogues users have with chatbots. Most existing methods evaluating simulation realism produce coarse quality signal and remain solely at the level of individual dialogues.

cs.CL
#104
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.1 4.0/5.3/3.0

Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it deserves. This paper presents a systematic empirical comparison of five retrieval strategies -- Dense Vector Search, Hybrid BM25 + Dense retrieval, Cross-Encoder Reranking, Multi-Query Expansion, and Maximal Marginal Relevance (MMR) -- within a biomedical question-answering RAG pipeline. All strategies share a fixed generation model (GPT-4o-mini), a common vector store (ChromaDB), and OpenAI's text-…

cs.CL cs.AI cs.IR
#105
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.1 4.0/5.3/3.0

Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required fields). We evaluate three 7-9B models under five prompting strategies and report output accuracy -- the joint event of mathematical correctness and valid JSON structure -- as the primary metric.

cs.CL cs.AI cs.LG
#106
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

The integration of Large Language Model (LLM) reasoning principles into classical robot path planning represents a rapidly emerging research direction. In this paper, we propose a Semantic Risk-Aware Heuristic (SRAH) planner that encodes LLM-inspired cost functions penalising geometrically cluttered or high-risk zones into an A$^*$ search framework, augmented with closed-loop replanning upon dynamic obstacle detection. We evaluate SRAH against two established baselines Breadth-First Search (BFS) with replanning and a Greedy heuristic without replanning across 200 randomised trials in a $15{\times}15$ grid-world with 20\% static obstacle density and stochastic dynamic obstacles.

cs.RO
#107
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

Traditional Simultaneous Localization and Mapping (SLAM) algorithms rely heavily on the static environment assumption, which severely limits their applicability in real-world spaces populated by moving entities, such as pedestrians. In this work, we propose DynoSLAM, a tightly-coupled Dynamic GraphSLAM architecture that integrates socially-aware Graph Neural Networks (GNNs) directly into the factor graph optimization. Unlike conventional approaches that use rigid constant-velocity heuristics or deterministic single-agent neural priors, our framework formulates pedestrian motion forecasting as a stochastic World Model.

cs.RO cs.CV
#108
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 4.7/4.6/3.0

Autonomous surface vessels for floating-waste removal operate under varying hydrodynamics, external disturbances, and challenging water-surface perception. We present a field-validated system that combines camera-based polarimetric perception with a lightweight DRL-based controller for floating-waste detection and capture. Camera detections are converted into water-surface target points and tracked by a controller trained entirely in simulation and deployed directly on a retrofitted ASV platform.

cs.RO
#109
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between ``seeing'' around to reduce environmental uncertainty and ``moving'' the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation.

cs.RO
#110
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.1 3.7/5.5/3.0

Agile unmanned aerial vehicle (UAV) navigation in cluttered environments demands a planning architecture that is both computationally efficient and structurally expressive enough to reason over multiple feasible motions. This paper presents SAGA, a robust self-attention and goal-aware anchor-based planner for safe UAV autonomous navigation. SAGA formulates local planning as a one-stage joint regression-and-ranking problem over a fixed lattice of motion anchors.

cs.RO
#111
Multimodal 2026-05-04 arXiv cs.CV 4.1 3.7/5.5/3.0

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses.

cs.CV
#112
Multimodal 2026-05-04 arXiv cs.CV 4.1 3.7/5.5/3.0

Online mapping and end-to-end (E2E) planning in autonomous driving remain largely sensor-centric, leaving rich map priors, including HD/SD vector maps, rasterized SD maps, and satellite imagery, underused because of heterogeneity, pose drift, and inconsistent availability at test time. We present UMPE, a Unified Map Prior Encoder that can ingest any subset of four priors and fuse them with BEV features for both mapping and planning. UMPE has two branches.

cs.CV
#113
Frontier LLMs 2026-05-04 Simon Willison's Weblog 4.1 4.8/4.0/3.0

April 2026 newsletter I just sent out the April edition of my sponsors-only monthly newsletter. If you are a sponsor (or if you start a sponsorship now) you can access it here. In this month's newsletter: Opus 4.7 and GPT-5.5, both with price increases Claude Mythos and LLM security research ChatGPT Images 2.0 More model releases Other highlights from my blog What I'm using, April 2026 edition Here's a copy of the March newslett

#114
Industry 2026-05-04 Simon Willison's Weblog 4.1 4.8/4.0/3.0

Research: TRE Python binding — ReDoS robustness demo If it's good enough for antirez to add to Redis I figured Ville Laurikari's TRE regular expression engine was worth exploring in a little more detail. I had Claude Code build an experimental Python binding (it used ctypes) and try some malicious regular expression attacks against the library. TRE handles those much better than Python's standard library implementation, thanks mainly to the lack of support for backtracking. Tags: security, python, regular-expressions, c, ctypes

#115
Agents & Tool Use 2026-05-04 arXiv cs.LG 4.0 3.7/5.3/3.0

Scientific peer review increasingly struggles to assess reproducibility at the scale and complexity of modern research output. Evaluating reproducibility requires reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, which often exceeds what human reviewers can provide. Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents.

cs.DL cs.LG
#116
Research 2026-05-04 arXiv cs.LG 4.0 3.7/5.3/3.0

Risk scores are an interpretable and actionable class of machine learning models with applications in medicine, insurance, and risk management. Unlike most computational methods, risk scores are designed to be computed by a human by attributing points to a data sample based on a limited set of criteria. The most common approaches for generating risk scores use linear regressions to estimate the effect of selected variables.

cs.LG
#117
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 4.0 3.7/5.3/3.0

Stories hold a reader's attention because they have causes, secrets, and consequences. Shadow-Loom is an experimental open-source framework that turns a narrative into a versioned graphical world model and lets two engines act on it: a causal physics grounded in Pearl's ladder of causation and a recently proposed counterfactual calculus over Ancestral Multi-World Networks; and a narrative physics that scores the same graph against four structural reader-states -- mystery, dramatic irony, suspense, and surprise -- in the tradition of Sternberg's curiosity/suspense/surprise triad, with suspense formalised in the structural-affect line of work on story comprehension and computational suspense. …

cs.AI cs.CL
#118
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 4.0 3.7/5.3/3.0

Scientists increasingly rely on sensor-based data, yet transforming raw streams into insights across the edge-to-cloud continuum remains difficult. Provisioning heterogeneous infrastructure and managing execution on emerging platforms like Data Processing Units typically requires cross-domain expertise, creating significant barriers to rapid prototyping. This paper introduces an experience-driven methodology for the rapid development of sensor-driven applications.

cs.DC cs.AI cs.SE
#119
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.0 3.7/5.3/3.0

Dual-system Vision-Language-Action (VLA) models achieve state-of-the-art robotic manipulation but are bottlenecked by the VLM backbone, which must execute at every control step while producing temporally redundant features. We propose Latent Bridge, a lightweight model that predicts VLM output deltas between timesteps, enabling the action head to operate on predicted outputs while the expensive VLM backbone is called only periodically. We instantiate Latent Bridge on two architecturally distinct VLAs: GR00T-N1.6 (feature-space bridge) and π0.5 (KV-cache bridge), demonstrating that the approach generalizes across VLA designs.

cs.RO
#120
Robotic Autonomy 2026-05-04 arXiv cs.RO 4.0 3.7/5.3/3.0

Humans grasp unfamiliar objects by combining an initial visual estimate with tactile and proprioceptive feedback during interaction. We present ShapeGrasp, a robotic implementation of this approach. The proposed method is an iterative grasp-and-complete pipeline that couples implicit surface visuo-haptic shape completion (creation of full 3D shape from partial information) with physics-based grasp planning.

cs.RO
#121
Multimodal 2026-05-04 arXiv cs.CV 4.0 5.0/4.0/3.0

Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored.

cs.CV
#122
Research 2026-05-04 arXiv stat.ML 4.0 3.7/5.3/3.0

This chapter introduces the Bayesian reflex -- an analogy with the autonomic nervous system -- as a unifying framework for online learning in AI. Bayesian online algorithms automatically maintain equilibrium in dynamic environments via three mechanisms: belief maintenance through probabilistic representations, sequential updating via Bayes' theorem, and uncertainty-driven action balancing exploration and exploitation. We survey online Bayesian methods, highlighting two computational principles: the look-up table principle for sequential inference in function space, and the ellipsoidal decomposition framework for nearly exact i.i.d.

stat.ME stat.ML
#123
Research 2026-04-30 Hugging Face Daily Papers 4.0 4.5/4.0/3.5

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately.

#124
Research 2026-04-30 Hugging Face Daily Papers 4.0 4.5/4.0/3.5

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105.

#125
Research 2026-05-02 Hugging Face Daily Papers 4.0 4.5/4.0/3.5

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping.

#126
Research 2026-04-26 Hugging Face Daily Papers 4.0 3.7/5.3/3.0

We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership.

#127
Agents & Tool Use 2026-05-04 GitHub Blog — AI & ML 4.0 4.8/4.0/3.0

OpenClaw, one of the fastest-growing open source projects, has already picked up over 350,000 stars and an early community of builders exploring what agentic systems can actually do in practice. That’s why, on June 3, 2026, we are hosting OpenClaw: After Hours at GitHub HQ in San Francisco. The event will take place during Microsoft Build 2026. This evening is a chance to bring the OpenClaw community together into the same room.

#128
Efficiency 2026-05-04 arXiv cs.LG 3.9 4.0/4.7/3.0

Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels.

cs.AI cs.LG cs.SE
#129
Multimodal 2026-05-04 arXiv cs.LG 3.9 4.0/4.6/3.0

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities.

cs.CV cs.LG
#130
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.9 4.0/4.6/3.0

With each sensing modality exhibiting inherent strengths and limitations, multi-modal approaches for wearable Human Activity Recognition (HAR) are becoming increasingly relevant -- particularly for recognizing Activities of Daily Living (ADLs), where individual modalities often produce ambiguous signals for similar or complex activities. This work introduces HARMES, a multi-modal wearable dataset combining three wrist-recorded modalities: motion sensing via an Inertial Measurement Unit (IMU), atmospheric environmental sensors (humidity, temperature, and pressure), and audio. Collected from 20 participants performing household activities in their own homes, HARMES totals over 80 hours of reco…

cs.LG
#131
Research 2026-05-04 arXiv cs.LG 3.9 4.0/4.7/3.0

DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget.

cs.LG cs.PF
#132
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes.

cs.CL
#133
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech.

cs.AI cs.CL eess.AS
#134
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

In Emotion Recognition in Conversations (ERC), model decisions should align with nuanced human perception and ideally provide insights on the classification process. Standard encoder pre-trained language models (PLMs) are the state-of-the-art at these tasks but offer little insight into why a certain prediction is made. This is especially problematic in imbalanced datasets, where most utterances are labeled as neutral, making these models frequently misclassify minority emotions as the majority neutral class.

cs.CL cs.AI
#135
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. Leveraging the multilingual MultiWikiQA dataset, we utilize the LettuceDetect framework to create synthetic hallucination datasets for 306 languages, from which we train token-level hallucination classifiers for 30 European languages.

cs.CL
#136
Frontier LLMs 2026-05-04 arXiv cs.CL 3.9 4.0/4.7/3.0

Legal texts often contain computational legal clauses--provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized Intelligence, a neuro-symbolic approach where we use an LLM once to translate a legal text into Deterministic Autonomous Contract Language (DACL): a typed graph intermediate representation.

cs.CL
#137
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge.

cs.CL
#138
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.9 4.0/4.6/3.0

Large language models pick up social biases from the data they are trained on and carry those biases into downstream applications, often reinforcing stereotypes around gender, race, religion, disability, age, and socioeconomic status. The standard fixes (retraining on curated data or fine-tuning with human feedback) are expensive, need access to model weights, and risk degrading the model on other tasks. In this paper we take a different route: we debias the model at decoding time, treating bias mitigation as a structured search over candidate tokens without ever touching model weights.

cs.CL cs.LG
#139
Generative Media 2026-05-04 arXiv cs.AI 3.9 4.0/4.6/3.0

Accurate histological differentiation between adenocarcinoma (ADC) and squamous cell carcinoma (SCC) is critical for personalized treatment in non-small cell lung cancer (NSCLC). While [$^{18}$F]FDG PET/CT is a standard tool for the clinical evaluation of lung cancer, its utility is often limited by high costs and radiation exposure. In this paper, we investigate the feasibility of "virtual scanning" as a feature-enhancement strategy by evaluating whether synthetic PET data can provide complementary feature representations to supplement anatomical CT scans in histological subtype classification.

cs.CV cs.AI
#140
Multimodal 2026-05-04 arXiv cs.AI 3.9 4.0/4.6/3.0

The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR.

cs.AI cs.CV cs.HC
#142
Multimodal 2026-05-04 arXiv cs.CV 3.9 4.0/4.7/3.0

Physical violence in public spaces is a significant public health concern, with minor incidents such as pushing often serving as precursors to more severe escalations. This research develops an automated system for the real-time detection of moderate physical violence, specifically pushing, in surveillance camera footage. The proposed solution integrates state-of-the-art computer vision models, utilizing YOLO11 and YOLO11-Pose for human detection and skeletal keypoint extraction.

cs.CV
#143
Efficiency 2026-05-04 arXiv stat.ML 3.9 3.2/5.5/3.0

The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias.

cs.LG stat.ML
#144
Research 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Understanding whether deep neural networks are effectively optimized remains challenging, as training occurs in highly nonconvex landscapes and standard metrics provide limited visibility into layer-wise learning quality. This challenge is particularly acute for transformer-based language models, where training is expensive, models are often reused in frozen form, and poorly optimized layers can silently degrade performance. We propose a layer-wise peeling framework for monitoring training dynamics, in which each transformer layer is locally optimized against intermediate representations of the trained model.

cs.LG
#145
AI for Science 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Structure aware graph generation aims to generate graphs that satisfy given topological properties. It has applications in domains such as drug discovery, social network modeling, and knowledge graph construction. Unlike existing methods that only provide coarse control over graph properties, we introduce a novel conditional variational autoencoder for fine-grained structural control in graph generation.

cs.AI cs.LG
#146
Reinforcement Learning 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control. While recent systems incorporate verification techniques, it remains unclear how users can effectively apply such rigid constraints to represent intent or adapt to real-world variability. For example, prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users.

cs.AI cs.HC cs.LG
#147
Efficiency 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Mobile crowdsensing (MCS) is a distributed sensing architecture that utilizes existing sensors on mobile units (MUs) to perform sensing tasks. A mobile crowdsensing platform (MCSP) publishes the sensing tasks and the MUs decide whether to participate in exchange for money. The MCS system is dynamic: the task requirements, the MUs' availability, and their available resources change over time.

cs.LG cs.NI
#148
Robotic Autonomy 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Learning data-efficient object dynamics models for robotic manipulation remains challenging, especially for deformable objects. A popular approach is to model objects as sets of 3D particles and learn their motion using graph neural networks. In practice, this is not enough to maintain physical feasibility over long horizons and may require large amounts of interaction data to learn.

cs.RO cs.AI cs.CV cs.LG
#149
Reinforcement Learning 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Estimating free energy differences quantifies thermodynamic preferences in molecular interactions, which is central to chemistry and drug discovery. Despite fruitful progress, existing methods still face key limitations: classical computational approaches remain prohibitively expensive due to their reliance on extensive molecular dynamics simulations, while deep learning-based methods are constrained by either less-expressive generative models or input dimensions tied to a specific system, resulting in negligible generalization. To address these challenges, we propose CARD, a generative framework that employs a novel radix-based decomposition to bijectively convert 3D coordinates into mixed …

cs.LG
#150
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.8 3.7/4.7/3.0

Near-infrared (NIR; a.k.a.\ NIRS) deep-learning studies in chemometrics increasingly report mutually inconsistent conclusions regarding convolutional neural network (CNN) design, including small versus large kernels, shallow versus deep architectures, raw spectra versus preprocessing, and single-domain training versus transfer learning. As a result, the same architecture can appear superior in one study and inferior in another, creating a practical impasse for chemometric practitioners. In this review, we argue that these contradictions are not evidence of irreconcilable methods but a structurally expected consequence of uncontrolled moderating variables.

cs.LG physics.optics
#151
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Classic Network Intrusion Detection Systems (NIDS) often rely on manual feature engineering to extract meaningful patterns from network traffic data. However, this approach requires domain expertise and runs counter to the widely adopted principle of modern machine learning and neural networks: that models themselves should learn meaningful representations directly from data. We investigate whether tabular representation learning techniques can improve intrusion detection performance by automatically learning robust feature representations for NetFlow data.

cs.LG cs.CR
#152
Research 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Accurate prediction of Remaining Useful Life (RUL) in aero-engines is vital for predictive maintenance, improved operational reliability, and reduced lifecycle costs. While deep learning approaches have demonstrated strong potential in this area, most existing methods focus primarily on model architecture design and treat input features uniformly, often neglecting the influence of data preprocessing. In this work, we propose a novel preprocessing pipeline that enhances RUL prediction by improving data quality and temporal representation before model training.

cs.LG cs.AI
#153
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.8 3.7/4.6/3.0

Sleep foundation models have recently demonstrated strong performance on in-domain polysomnography tasks, including sleep staging, apnea detection, and disease risk prediction. In this work, we investigate whether sleep biosignals can serve as an effective pretraining distribution for learning representations that transfer beyond sleep to adjacent domains. Following sleep foundation models, we perform sleep-only multimodal contrastive pretraining (with a leave-one-out objective) and evaluate transfer to non-sleep EEG and ECG, two well-benchmarked biosignal modalities with heterogeneous datasets and clinically meaningful downstream tasks.

cs.LG cs.AI
#154
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

SemEval-2026 Task 9 is focused on multilingual polarization detection. Specifically, it covers the identification of multilingual, multicultural and multievent polarization along three axes (in subtasks), namely detection, type, and manifestation. Online polarization presents a concern, because it is often followed by hate speech, offensive discourse, and social fragmentation.

cs.CL cs.AI
#155
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

The 2026 ACII Dyadic Conversations (ACII-DaiKon) Workshop & Challenge introduces a benchmark for modeling interpersonal affect and social dynamics in dyadic conversations. Although conversational affect modeling has advanced rapidly, most benchmarks remain speaker-centric and underrepresent coupled, time-evolving processes between partners, including directional influence, conversational timing coordination, and rapport development. To address this gap, ACII-DaiKon presents three coordinated sub-challenges built on a shared dataset: (1) directional interpersonal influence prediction, (2) turn-taking prediction (next-speaker and time-to-next-speech), and (3) rapport trajectory prediction acro…

cs.AI cs.CL cs.HC
#156
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-model-based Tibetan TTS system in the industry, built upon a large speech synthesis model developed by Xingchen AGI Lab. The proposed system integrates data quality enhancement, Tibetan-oriented text representation and tokenizer adaptation, and cross-lingual adaptive training for low-resource Tibetan speech synthesis.

cs.SD cs.CL
#157
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

As the ecosystem of Large Language Model (LLM)-based agents expands rapidly, efficient and accurate Agent Discovery becomes a critical bottleneck for large-scale multi-agent collaboration. Existing approaches typically face a dichotomy: either relying on heavy-weight LLMs for intent parsing, leading to prohibitive latency (often exceeding 30 seconds), or using monolithic vector retrieval that sacrifices semantic precision for speed. To bridge this gap, we propose \textbf{GRAIL} (Granular Resonance-based Agent/AI Link), a novel framework achieving sub-400ms discovery latency without compromising accuracy.

cs.AI cs.CL cs.IR
#158
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

Multimodal sarcasm detection, which aims to precisely identify pragmatic incongruities between literal text and nonverbal cues, has gained substantial attention in multimodal understanding. Recent advancements have predominantly relied on naïve similarity-based attention mechanisms and uniform late fusion strategies.Furthermore, given that functional entanglement restricts traditional late fusions, we incorporate a scalar congruity routing mechanism and a prior-guided contextual graph. This mechanism anchors a generalized incongruity manifold through a two-stage asymmetric optimization driven by inconsistency-aware contrastive learning, selectively fusing only the most discriminative multi-g…

cs.CL cs.AI
#159
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detectio…

cs.CL
#160
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecting intermediate steps and halting according to input-dependent conditions, which we formalize as a search-like procedure. We show that single forward passes in scalable architectures are structurally limited in their ability to realize such variable-depth computation, motivating intermediate decoding and externalized reasoning traces as appropriate evaluation interfaces.

cs.AI cs.CL
#161
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.6/3.0

Novelty assessment is a critical yet complex task in the examination process for patent acceptance, requiring examiners to determine whether an invention is disclosed in a prior art document. The process involves intricate matching between specific features of a patent claim and passages in the prior art. While prior work has approached novelty prediction primarily as a binary classification task at the claim level, we argue that this formulation is susceptible to spurious correlations and lacks the granularity required for practical application.

cs.CL cs.AI cs.IR
#162
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.8 3.7/4.7/3.0

Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker's viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks.

cs.CR cs.CL
#163
Frontier LLMs 2026-05-04 arXiv cs.CL 3.8 3.7/4.7/3.0

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition.

cs.CL
#164
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 3.8 3.7/4.6/3.0

Scientists increasingly rely on sensor-based data; however transforming raw streams into insights across the edge-to-cloud continuum remains difficult due to the breadth of expertise required to coordinate the necessary data and computation flow. This paper introduces a pattern-based, AI-assisted methodology for rapid development of sensor-driven applications. Using Pegasus workflows executing on the FABRIC testbed, we demonstrate a 5-step development loop that shifts workflow construction and deployment from code-first to intent-first design.

cs.DC cs.AI cs.SE
#165
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 3.8 3.7/4.6/3.0

Probabilistic values, including Shapley values and semivalues, provide a model-agnostic framework to attribute the behavior of a black-box model to data points or features, with a wide range of applications including explainable artificial intelligence and data valuation. However, their exact computation requires utility evaluations over exponentially many coalitions, making Monte Carlo approximation essential in modern machine learning applications. Existing estimators are often developed through different identification strategies, including weighted averages, self-normalized weighting, regression adjustment, and weighted least squares.

cs.AI stat.ME stat.ML
#166
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 3.8 3.7/4.6/3.0

Large language models excel at complex reasoning, yet evaluating their intermediate steps remains challenging. Although process reward models provide step-wise supervision, they often suffer from a risk compensation effect, where incorrect steps are offset by later correct ones, assigning high rewards to flawed reasoning paths. This issue is further exacerbated in knowledge graph (KG) reasoning, as there may exist multiple paths between the start and end entities in the KGs, and a risky step can make the reasoning path flawed.

cs.AI
#167
Frontier LLMs 2026-05-04 arXiv cs.AI 3.8 3.7/4.7/3.0

Solar photovoltaic (PV) deployment is expanding rapidly, yet detailed, up-to-date information on the spatial distribution and capacity of rooftop PV remains limited. This paper presents an open, scalable framework for detecting solar panels from open data and generating city-level solar power profiles. We leverage foundation vision AI models to detect solar panel geometries from open-source satellite imagery.

cs.AI
#168
Agents & Tool Use 2026-05-04 arXiv cs.AI 3.8 3.7/4.6/3.0

This paper presents ORPilot, an open-source agentic AI system that translates real-world business problems into solver-ready optimization models. Unlike academic LLM-for-OR tools that assume clean problem specifications with preformatted inline data, ORPilot is designed for production conditions: ambiguous descriptions, large-scale raw operational data, and the need for portability across solver backends. The system introduces four novel components: (1) a conversational interview agent to elicit complete problem specifications, (2) a data collection agent that retrieves data independently of prompts, (3) a parameter computation agent to bridge raw tabular data and model-ready parameters, and…

cs.AI
#169
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 3.8 3.7/4.6/3.0

The advent of foundation models has heralded a new era in medical artificial intelligence (AI), enabling the extraction of generalizable representations from large-scale unlabeled datasets. However, current ophthalmic AI paradigms are predominantly constrained to single-modality inference, thereby creating a dissonance with clinical practice where diagnosis relies on the synthesis of complementary imaging modalities. Furthermore, the deployment of high-performance AI in resource-limited settings is frequently impeded by the unavailability of advanced three-dimensional imaging hardware.

cs.CV cs.AI
#170
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 3.8 3.7/4.6/3.0

Hyperledger Fabric performance depends on many interacting configuration parameters, making manual tuning difficult. We study automated throughput tuning by treating benchmarking as a noisy black-box optimization problem and applying Bayesian optimization (BO) with dimensionality reduction (DR). We implement an end-to-end Caliper-in-the-loop pipeline that deploys candidate configurations, benchmarks them, and updates the optimizer from observed throughput.

cs.DC cs.AI
#171
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.8 3.7/4.6/3.0

Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize r…

cs.CV cs.RO
#172
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.8 3.7/4.6/3.0

Single-view RGB object pose estimators have reached a level of precision and efficiency that makes them good candidates for vision-based robot control. However, off-the-shelf methods lack temporal consistency and robustness that are mandatory for a stable feedback control. In this work, we develop a factor graph approach to enforce temporal consistency of the object pose estimates.

cs.RO cs.CV
#173
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.8 3.7/4.7/3.0

Autonomous indoor mobile robots can navigate reliably to metric coordinates using established frameworks such as ROS 2 Navigation 2, yet they lack the ability to interpret natural language instructions that express intent rather than positions. Vision-Language Models offer the semantic reasoning required to bridge this gap, but their inference latency (2-9 seconds per decision on consumer hardware) and session-by-session amnesia limit practical deployment. This paper presents the Semantic Autonomy Stack, a six-layer reference framework for semantically autonomous indoor navigation, and validates a complete instance featuring hybrid deterministic-VLM reasoning and cross-robot adaptive memory …

cs.RO cs.AI
#174
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.8 3.7/4.7/3.0

We study feedback motion planning for continuous-time stochastic nonlinear systems under signal temporal logic (STL) specifications. We propose a framework that synthesizes control policies for chance-constrained STL trajectory optimization problems, with the goal of ensuring that the closed-loop stochastic system satisfies a given STL formula with high probability (e.g., 99.99\%). Our approach is based on a predicate erosion strategy that transforms the intractable stochastic problem into a deterministic STL trajectory optimization problem with tightened STL formula constraints.

cs.RO eess.SY
#175
Research 2026-05-04 arXiv cs.NE 3.8 3.7/4.7/3.0

The popular 2009 voxel based videogame, Minecraft, contains several distinct disciplines. One of which is "parkour," gameplay that focuses on traversing a world's environment with maximum efficiency. The Minecraft online community has turned the game's physics engine into dynamic puzzles, requiring players to masterfully manipulate motion mechanics through frame precise timing of keystrokes.

cs.NE
#176
Multimodal 2026-05-04 arXiv cs.CV 3.8 3.7/4.6/3.0

Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections.

cs.CV cs.IR
#177
Generative Media 2026-05-04 arXiv cs.CV 3.8 3.7/4.6/3.0

Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories.

cs.CV
#178
Multimodal 2026-05-04 arXiv cs.CV 3.8 3.7/4.6/3.0

We present FoR-Net, a lightweight architecture for semantic segmentation that focuses on identifying and enhancing hard regions. Instead of relying on heavy global modeling, FoR-Net adopts an efficient strategy that selectively emphasizes informative regions through a learned importance map and a Top-K activation mechanism. Specifically, a selector module predicts region-wise importance, enabling the model to focus on challenging areas such as thin structures and object boundaries.

cs.CV
#179
Generative Media 2026-05-04 arXiv cs.CV 3.8 3.7/4.6/3.0

Synthetic training has recently advanced brain MRI segmentation by enabling contrast-agnostic models trained entirely on generated data. However, most existing approaches rely on hundreds of automatically labeled templates, introducing systematic biases and limiting their flexibility to incorporate new anatomical structures. We present the Segment It All Model (SIAM), a 3D whole-head segmentation framework for 16 anatomical structures, trained using only six high-quality, manually annotated templates.

cs.CV
#180
Multimodal 2026-05-04 arXiv cs.CV 3.8 3.7/4.7/3.0

Predicting microsatellite instability (MSI) status from routine hematoxylin and eosin (H&E) whole slide images (WSIs) offers a practical alternative to molecular testing, but models trained at one institution tend to generalize poorly to slides acquired at a different site. Foundation model representations, despite their generality, still encode site-specific texture alongside the conserved biological morphology underlying MSI. We investigate whether tile-level spatial priors derived from known MSI histology can guide these representations toward more site-invariant features.

eess.IV cs.CV
#181
Research 2026-05-04 arXiv stat.ML 3.8 3.7/4.6/3.0

We study the problem of black-box optimization of a function f of any dimension, given function evaluations perturbed by noise. The function is assumed to be locally smooth around one of its global optima, but this smoothness is unknown. Our contribution is an adaptive optimization algorithm, POO or parallel optimistic optimization, that is able to deal with this setting.

stat.ML cs.LG
#182
Reinforcement Learning 2026-05-04 arXiv stat.ML 3.8 3.7/4.7/3.0

In 2011, Judea Pearl received the Turing Award, considered the Nobel Prize in Computing, for fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning. It includes pioneering the development of causal discovery algorithms. These computer algorithms can analyze large multivariate datasets and automatically discover the causal relationships among the constituent variables.

cs.AI cs.CE cs.LG stat.ML
#183
Research 2026-04-30 Hugging Face Daily Papers 3.8 3.7/4.6/3.0

Retrieval-augmented generation (RAG) enhances large language models with external knowledge, and tree-based RAG organizes documents into hierarchical indexes to support queries at multiple granularities. However, existing Tree-RAG methods designed for single-document retrieval face critical challenges in scaling to cross-document multi-hop questions: (1) poor distribution adaptability, where k-means clustering introduces noise due to rigid distribution assumptions; (2) structural isolation, as tree indexes lack explicit cross-document connections; and (3) coarse abstraction, which obscures fine-grained details. To address these limitations, we propose Ψ-RAG, a tree-RAG framework with two key…

#184
Research 2026-05-04 arXiv cs.LG 3.7 4.0/4.0/3.0

Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibility into local geometric changes. We ask: how do perturbations deform local geometry, and do these shifts track downstream automatic speech recognition (ASR) degradation?

eess.AS cs.CR cs.LG
#185
Frontier LLMs 2026-05-04 arXiv cs.LG 3.7 4.0/4.0/3.0

Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper's headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places ($r{=}{+}0.244$, $p{<}10^{-8}$, $n{=}648$).

cs.CL cs.LG
#186
Frontier LLMs 2026-05-04 arXiv cs.CL 3.7 4.0/4.0/3.0

Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, limiting recovery from early mistakes. We present FlexSQL, a text-to-SQL agent whose core design principle is flexible database interaction: the agent can explore schema structure, inspect data values, and run verification queries at any point during reasoning.

cs.CL
#187
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.7 4.0/4.0/3.0

SemEval-2026 Task 10 is focused on conspiracy detection. Specifically, the goal is to detect whether a Reddit comment expresses a conspiracy belief. Our submitted mdok-style system utilizes data augmentation and self-training (to cope with a rather small amount of training data) to finetune the Qwen3-32B model for a binary text-classification task. The submitted system is very competitive, ranking in the 85th percentile (8th out of 52 submissions). The results shown that our approach, which originated in machine-generated text detection, can be used for conspiracy detection as well.

cs.CL cs.AI
#188
Generative Media 2026-05-04 arXiv cs.CV 3.7 4.0/4.0/3.0

We present Mamoda2.5, a unified AR-Diffusion framework that seamlessly integrates multimodal understanding and generation within a single architecture. To efficiently enhance the model's generation capability, we equip the Diffusion Transformer backbone with a fine-grained Mixture-of-Experts (MoE) design (128 experts, Top-8 routing), yielding a 25B-parameter model that activates only 3B parameters, significantly reducing training costs while scaling up the model capacity. Mamoda2.5 achieves top-tier generation performance on VBench 2.0 and sets a new record in video editing quality, surpassing evaluated open-source models and matching the performance of current top-tier proprietary models, i…

cs.CV
#189
Research 2026-04-30 Hugging Face Daily Papers 3.7 4.0/4.0/3.0

This report documents the preparedness assessment of Code World Model (CWM), a model for code generation and reasoning about code from Meta. We conducted pre-release testing across domains identified in our Frontier AI Framework as potentially presenting catastrophic risks, and also evaluated the model's misaligned propensities. Our assessment found that CWM does not pose additional frontier risks beyond those present in the current AI ecosystem. We therefore release it as an open-weight model.

#191
Industry 2026-05-04 TechCrunch — AI 3.7 4.0/4.0/3.0

DoorDash on Monday added new AI-powered tools that let merchants speed up onboarding, edit photos to make dishes look better, and create websites based on their app listings. The onboarding tool works similarly to the one Amazon launched in 2024. Merchants can point the tool to their website, from which it will automatically fetch information such as photos, store hours, and menu items to create a listing on the app. Merchants can review and edit all of this information before publishing the listing.

#192
Frontier LLMs 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length~$γ$, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed~$γ$ (typically~4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model.

cs.LG cs.AI cs.CL cs.DC
#193
AI for Science 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Composite materials exhibit strongly hierarchical and anisotropic properties governed by coupled mechanisms spanning constituents, plies, laminates, structures, and manufacturing history. This intrinsic complexity makes predictive modeling of composites expensive, because repeated experiments and high-fidelity simulations are needed to cover large design spaces of material, structure, and manufacturing. Multi-fidelity surrogate modeling addresses this challenge by combining abundant, less expensive data with limited high-accuracy data to recover reliable high-fidelity predictions.

physics.comp-ph cs.LG
#194
Robotic Autonomy 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Despite significant advances in Reinforcement Learning (RL), model performance remains highly sensitive to algorithm and hyperparameter configurations, while generalization gaps across environments complicate real-world deployment. Although prior work has studied RL generalization, the relative contribution of specific configurations to the generalization gap has not been quantitatively decomposed and systematically leveraged for configuration selection. To address this limitation, we propose an explainable framework that evaluates RL performance across robotic environments using SHapley Additive exPlanations (SHAP) to quantify configuration impacts.

cs.LG cs.AI cs.RO
#195
Research 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Retraction-free approaches offer attractive low-cost alternatives to Riemannian methods on the Stiefel manifold, but they are often first-order, which may limit the efficiency under high-accuracy requirements. To this end, we propose a second-order method landing on the Stiefel manifold without invoking retractions, which is proved to enjoy local quadratic (or superlinear for its inexact variant) convergence. The update consists of the sum of (i) a component tangent to the level set of the constraint-defining function that aims to reduce the objective and (ii) a component normal to the same level set that reduces the infeasibility.

math.OC cs.AI cs.LG math.NA
#196
Research 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

We consider the infinite-width limit of a fully connected deep neural network with general weights, and we prove quantitative general bounds on the $2$-Wasserstein distance between the network and its infinite-width Gaussian limit, under appropriate regularity assumptions on the activation function. Our main tool is a Lindeberg principle for Deep Neural Networks, which we use to successively replace the weights on each layer by Gaussian random variables.

math.PR cs.LG stat.ML
#197
Research 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability.

math.OC cs.LG stat.ML
#198
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Across many scientific disciplines, multiple observations are collected from the same experimental units, and in modern datasets these observations often arise as non-Euclidean random objects. In such settings, the incorporation of random effects is a critical modeling step for efficient estimation and personalized prediction. Although mixed-effects models are well established for scalar outcomes and, more recently, for functional data in Hilbert spaces, general random-effects frameworks for objects in metric spaces remain underdeveloped.

stat.ML cs.LG stat.ME
#199
Research 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

This paper introduces an extension of generalised filtering for online applications. Generalised filtering refers to data assimilation schemes that jointly infer latent states, learn unknown model parameters, and estimate uncertainty in an integrated framework -- e.g., estimate state and observation noise -- at the same time (i.e., triple estimation). This framework appears across disciplines under different names, including variational Kalman-Bucy filtering in engineering, generalised predictive coding in neuroscience, and Dynamic Expectation Maximisation (DEM) in time-series analysis.

stat.ML cs.LG q-bio.NC
#200
Research 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

We consider selective classification with abstention in the fixed-pool (or transductive) setting, where the unlabeled pool is given beforehand and only a subset of points can be queried for labels. Our main insight is to view selective prediction through agreement: given queried labels and Lipschitz margin constraints in an embedding space, the version space of Lipschitz-consistent classification heads is well defined. We obtain upper and lower Lipschitz margin bounds that define, for each pool point, a set of certified valid labels containing the prediction of every head in the version space.

cs.LG
#201
Evaluations & Benchmarks 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Transformer-based models achieve state-of-the-art dependency parsing for high-resource languages, yet their advantage over simpler architectures in low-resource settings remains poorly understood. We evaluate four parsers -- the Biaffine LSTM, Stack-Pointer Network, AfroXLMR-large, and RemBERT -- across ten typologically diverse languages, with a focus on low-resource African languages. We find that the Biaffine LSTM consistently outperforms transformer models in low-resource regimes, with transformers recovering their advantage as training data increases.

cs.CL cs.AI cs.LG
#202
Research 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Fourier Neural Operators are deep learning models that learn mappings between function spaces and can be used to learn and solve partial differential equations (PDEs), in some cases significantly faster than traditional PDE solvers. Within the model are Fourier layers, which apply linear transformations directly to the Fourier modes, with parameters depending on the wave numbers. However, most physical systems are isotropic, with the results being independent of the coordinate system chosen, but the linear transformations do not necessarily respect these symmetries.

cs.LG
#203
Agents & Tool Use 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Large language models (LLMs) have shown promise as interactive agents that solve tasks through extended sequences of environment interactions. While prior work has primarily focused on system-level optimizations or algorithmic improvements, the role of task horizon length in shaping training dynamics remains poorly understood. In this work, we present a systematic empirical study that examines horizon length through controlled task constructions.

cs.AI cs.LG
#204
Efficiency 2026-05-04 arXiv cs.LG 3.6 3.7/4.0/3.0

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient.

cs.LG cs.AI stat.ML
#205
Evaluations & Benchmarks 2026-05-04 arXiv cs.CL 3.6 3.7/4.0/3.0

Modern fuzzers increasingly use Large Language Models (LLMs) to generate structured inputs, but LLM-driven fuzzing is sensitive to prompt initialization and sampling variance, which can reduce exploration efficiency and lead to redundant inputs. We present FunFuzz, a multi-island evolutionary fuzzing framework that runs several isolated searches in parallel and periodically migrates high-value candidates to maintain diversity. FunFuzz derives initial generation prompts from documentation and initializes islands with topic-specific instructions, then continuously adapts prompts using feedback-guided selection.

cs.CR cs.CL
#206
Multimodal 2026-05-04 arXiv cs.CL 3.6 3.7/4.0/3.0

Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images.

cs.CV cs.CL
#207
Frontier LLMs 2026-05-04 arXiv cs.CL 3.6 3.7/4.0/3.0

Understanding how online narratives travel through coalitions is critical for identifying information disorder, yet computational analyses often rely on conservative network constructions that erase initially sparse but salient signals. This paper proposes a novel multi-layer framework that captures low-frequency signals of emerging information disorder allowing for locating where online discourse is reframed and amplified over time. The use case is 14 years of Italian discourse on X regarding the Human Papillomavirus (HPV) vaccine across three pivotal epochs (2010-2024).

cs.CL cs.CY cs.NI
#208
Frontier LLMs 2026-05-04 arXiv cs.CL 3.6 3.7/4.0/3.0

Semantic Role Labeling (SRL) provides an explicit representation of predicate-argument structure, capturing linguistically grounded relations such as who did what to whom. While recent NLP progress has been dominated by large language models (LLMs), these systems often rely on implicit semantic representations, often lacking explicit structural constraints and systematic explanatory mechanisms. Traditionally, SRL systems have often relied on AllenNLP; however, the framework entered maintenance mode in December 2022, limiting compatibility with evolving encoder architectures and modern inference requirements.

cs.CL
#209
Frontier LLMs 2026-05-04 arXiv cs.CL 3.6 3.7/4.0/3.0

Information disorder is a challenging phenomenon that affects society at large. This phenomenon entails the diffusion of misleading, misinforming, and hateful content online. In different contexts, one aspect of the problem may prevail, but overall, this is a broad problem that requires comprehensive solutions.

cs.CL
#210
Frontier LLMs 2026-05-04 arXiv cs.CL 3.6 3.7/4.0/3.0

Reflective thinking is a key competency in education, but assessing reflective writing remains a time-consuming and subjective task for education experts. While automated reflective analysis has been explored in several languages, Hungarian language was not researched extensively. In this paper, we present the first comprehensive study on automatic reflection level classification in Hungarian student essays.

cs.CL cs.AI
#211
Frontier LLMs 2026-05-04 arXiv cs.CL 3.6 3.7/4.0/3.0

Against the backdrop of rapid advances in artificial intelligence, legal argument mining has emerged as an important research area linking legal texts with intelligent analysis, carrying significant theoretical and practical implications. Existing studies have primarily developed along three dimensions: data, technology, and theory. At the data level, raw legal texts and annotated corpora constitute the foundational resources.

cs.CL
#212
Evaluations & Benchmarks 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Blind face restoration is highly ill-posed under severe degradation, where identity-critical details may be missing from the degraded input. Same-identity references reduce this ambiguity, but mismatched pose, expression, illumination, age, makeup, or local facial states can lead to overuse of reference appearance. We propose \textbf{IConFace}, a unified reference-aware and no-reference framework with identity--structure asymmetric conditioning.

cs.CV cs.AI
#213
Research 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

This paper compares agency in humans with potential agency in AI programs. Human agency takes many years to develop, as the frontal lobe is activated. Early attempts to endow LLMs agency have met serious obstacles. Progress requires a new architecture where actions and plans are formulated jointly with the human actors in each real world setting.

cs.AI
#214
Research 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

SHACL (Shapes Constraint Language) expresses constraints on RDF data by means of so-called shapes. Its central service is validation: verifying whether a data graph complies with a SHACL document. But so far, there are no static analysis services to compare documents.

cs.LO cs.AI
#215
Generative Media 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Diffusion models have recently demonstrated strong performance for image restoration tasks, including super-resolution. However, their large model size and iterative sampling procedures make them computationally expensive for practical deployment. In this work, we present TOC-SR, a framework for building efficient one-step super-resolution models by first discovering a compact diffusion backbone.

cs.CV cs.AI
#216
Agents & Tool Use 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability. This paper presents a systematic audit of technical debt in AI-generated software, revealing that AI does not eliminate flaws but rather introduces a distinct machine signature of defects. Our multi-scale analysis, spanning single-file algorithmic tasks and complex, agent generated systems, identifies a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code.

cs.SE cs.AI
#217
Agents & Tool Use 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Authorizing Large Language Model (LLM)-driven agents to dynamically invoke tools and access protected resources introduces significant security risks, and the risks grow dramatically as agents engage in multi-turn conversations and scale toward distributed collaboration. A compromised or malicious agentic application can tamper with tool calls, falsify results, or request permissions beyond the scope of the subject's intended tasks, which could go unnoticed with current delegated authorization flows given their lack of visibility into the original subject's intent. In light of this, we make the following contributions towards Continuous Agent Semantic Authorization (CASA).

cs.AI
#218
Research 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Shortcut learning causes deep learning models to rely on non-essential features within the data. However, its formation in deep neural network training still lacks theoretical understanding. In this paper, we provide a formal definition of core and shortcut features and employ evolutionary game theory to analyze the origins of shortcut bias by modeling data samples as players and their corresponding neural tangent features as strategies, assuming the existence of core and shortcut subnetworks.

cs.AI
#219
Research 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), is increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and explainability, are hard to achieve simultaneously, especially while preserving utility. This position paper argues that causality is necessary to understand and balance trade-offs in performance and multiple objectives of trustworthy AI.

cs.AI
#220
Multimodal 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity supervision. To reduce such reliance, we explore CRMOT under weak supervision by leveraging the capabilities of foundation models.

cs.CV cs.AI
#221
Research 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Capturing semantic consistency among nodes is crucial for effective graph representation learning. Existing approaches typically rely on $k$-nearest neighbors ($k$NN) or other node-level full search algorithms (FSA) to mine semantic relationships via exhaustive pairwise similarity computation, which suffer from high computational complexity and rigid neighbor selection, limiting scalability and introducing noisy connections. In this paper, we propose the Semantic Consistency enhanced Graph Neural Network (SCGNN), a novel plug-and-play framework that leverages granular-ball computing (GBC) to efficiently capture semantic consistency in a scalable manner.

cs.AI
#222
Multimodal 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Artificial intelligence (AI) is becoming a clinical tool for prostate pathology, but generalization across variations in sample preparation and preservation over prolonged time periods remains poorly understood. We evaluated GleasonAI, an end-to-end attention-based multiple instance learning model, on an independent validation cohort comprising 10,366 biopsy cores from 1,028 patients across 14 Swedish regions, using archival diagnostic specimens from the ProMort cohorts collected between 1998-2015. The model achieved an overall quadratic-weighted kappa of 0.86 for core-level ISUP grading, comparable to several experienced pathologists and consistent across geographic regions.

cs.CV cs.AI
#223
Frontier LLMs 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

Automated planning traditionally assumes that all aspects of a planning task (initial state, goals, and available actions) are fully specified in advance, an approach well-suited to domains with fixed rules and deterministic execution. However, real-world planning often requires flexibility, allowing for deviations from the original task parameters in response to unforeseen circumstances or to improve outcomes. This paper surveys existing works on counterfactual reasoning in automated planning, categorizing them by what elements are changed, when the reasoning is triggered, and why and how these changes are made.

cs.AI
#224
Robotic Autonomy 2026-05-04 arXiv cs.AI 3.6 3.7/4.0/3.0

While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI).

cs.RO cs.AI
#225
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.6 3.7/4.0/3.0

Long-term autonomy requires robust navigation in environments subject to dynamic and static changes, as well as adverse weather conditions. Teach-and-Repeat (T\&R) navigation offers a reliable and cost-effective solution by avoiding the need for consistent global mapping; however, existing T\&R systems lack a systematic solution to tackle various environmental variations such as weather degradation, ephemeral dynamics, and structural changes. This work proposes LTR$^2$, the first cross-modal, cross-platform LiDAR-Teach-and-Radar-Repeat system that systematically addresses these challenges.

cs.RO
#226
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.6 3.7/4.0/3.0

Despite the advancement in robotic grasping and dexterity through haptic information, affective social touch, such as handshaking or reassuring stroking, remains a major challenge in Human-Robot-Interaction. This position paper examines current progress and limitations across artificial intelligence, haptics and robotics research, and proposes a novel multi-model architecture to address these gaps. Drawing inspiration from neurobiology, we decompose affective touch into distinct, specialized subtasks models.

cs.HC cs.RO
#228
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.6 3.7/4.0/3.0

This paper investigates singular configurations of planar 3-RPR parallel manipulators, which result from applying the averaging technique to solution pairs of their direct kinematic problem. Without computing the zeros of the corresponding degree 6 polynomial we parametrize the input pairs and determine their relative orientation in a way that the flexion order of the averaged configurations increases. Moreover, the obtained results are visualized for concrete examples.

cs.RO math.AG
#229
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.6 3.7/4.0/3.0

Shared autonomy (SA) enables robots to infer human intent and assist in its achievement. While most research focuses on improving intent inference, it overlooks whether humans can understand the robot's intent in return. Without such mutual understanding, collaboration becomes less effective, degrading user experience and task performance.

cs.RO cs.HC
#230
Robotic Autonomy 2026-05-04 arXiv cs.RO 3.6 3.7/4.0/3.0

This paper presents a novel model predictive control (MPC) approach for autonomous pick-and-place between moving platforms with a hook-equipped aerial manipulator. First, for accurate and rapid modeling of the complex dynamics, a digital twin model of the quadcopter equipped with a hook-based gripper, implemented in MuJoCo, is constructed and used as the predictive model for the MPC. To handle uncertainties of the predictive model (e.g.

cs.RO eess.SY
#231
Evaluations & Benchmarks 2026-05-04 arXiv cs.RO 3.6 3.7/4.0/3.0

Bayesian filtering is a cornerstone of state estimation in complex systems such as aerospace systems, yet exact solutions are available only for linear Gaussian models. In practice,nonlinear systems are handled through tractable approximations,with Gaussian filters such as the extended and unscented Kalman filters being among the most widely used methods. This tutorial revisits Gaussian filtering from an information-geometric perspective, viewing the prediction and measurement update steps as inference procedures over state distributions.

cs.RO eess.SY
#232
Multimodal 2026-05-04 arXiv cs.CV 3.6 3.7/4.0/3.0

Rural thematic road network construction aims to extract topological road structures from movement trajectory images of agricultural machinery. However, this task faces challenges where downsampling methods commonly used in existing studies tend to blur the sparse high-frequency road structures, and the heavy noise from dense field operations often leads to fragmented or redundant topologies in the extracted networks. To address these challenges, we propose LFINet, a Laplacian Frequency Interaction Network.

cs.CV
#233
Multimodal 2026-05-04 arXiv cs.CV 3.6 3.7/4.0/3.0

Traditional image quality assessment (IQA) methods rely on mean opinion scores (MOS), which are resource-intensive to collect and fail to provide interpretable, localized feedback on specific image distortions. We overcome these limitations by shifting from absolute quality prediction to a relational and directional assessment. Our approach utilizes a self-supervised synthetic distortion engine to generate training data, eliminating the need for manual annotation.

cs.CV
#234
Multimodal 2026-05-04 arXiv cs.CV 3.6 3.7/4.0/3.0

We propose a modular framework for hybrid image restoration that integrates transformer and state-space model (SSM) blocks with a focus on improving runtime efficiency on edge hardware. While transformers provide strong global modeling through self-attention, their attention kernels incur substantial latency on mobile devices, especially for high-resolution inputs. In contrast, SSMs such as Mamba offer lineartime sequence modeling with lower runtime overhead but may underperform on fine grained restoration tasks.

cs.CV
#235
Research 2026-05-04 arXiv stat.ML 3.6 3.7/4.0/3.0

Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs from the environment state.

stat.ML cs.LG
#236
Reinforcement Learning 2026-05-04 arXiv stat.ML 3.6 3.7/4.0/3.0

In this work, we formulate a new multi-task active learning setting in which the learner's goal is to solve multiple matrix completion problems simultaneously. At each round, the learner can choose from which matrix it receives a sample from an entry drawn uniformly at random. Our main practical motivation is market segmentation, where the matrices represent different regions with different preferences of the customers.

stat.ML cs.LG
Items
236
Multi-source
4
Long-form (≥7.5)
4
Sources OK / attempted
128 / 130
Top category
Research
54 items