← Archive / All Digests
A wolf in round glasses reading a book, wrapped in a golden ribbon, in a sunlit forest.

Wolf Digest — Saturday, June 20, 2026

Coverage window: 2026-06-19 03:29 ET2026-06-20 03:03 ET
Press play to listen
Saturday, June 20, 2026
16m 12s · top-4 narrated briefing
#1 · Government & Defense
US Commerce Department forces Anthropic to pull Fable 5 and Mythos 5 under export-control directive
The biggest story of the week is regulatory rather than technical. As last week closed, the US Commerce Department sent Anthropic a letter invoking an export-control directive that bars non-Americans, including Anthropic's own non-US employees, from accessing Claude Fable 5 and C…
8.3 · 2 srcs
#2 · Efficiency
Subquadratic shares independent eval for SubQ, its content-sparse subquadratic-attention LLM
Miami-based Subquadratic, which came out of stealth in early May with twenty-nine million dollars in seed funding and an unusually large claim, has begun publishing evidence for it. The company says its model, SubQ, is the first frontier-scale language model that abandons quadrat…
7.7 · 1 srcs
#3 · Efficiency
Shrinkage Bias in FP4 pretraining: a geometric account of E2M1's systematic error, and the UFP4 fix
FP4 pretraining is the current frontier of low-precision training, promising large reductions in memory and compute, and the hardware paths that matter, NVIDIA's Blackwell and Rubin-class systems and AMD's MI350 series, are all built around the E2M1 four-bit element. This paper i…
7.6 · 3 srcs
6.5
#1
Government & Defense 2026-06-19 TechCrunch — AIStratechery 8.3 8.5/8.7/7.7

The biggest story of the week is regulatory rather than technical. As last week closed, the US Commerce Department sent Anthropic a letter invoking an export-control directive that bars non-Americans, including Anthropic's own non-US employees, from accessing Claude Fable 5 and Claude Mythos 5, the two most capable models the company had ever shipped. Anthropic had positioned Fable 5 as a new "Mythos-class" tier with capabilities it said exceed anything it had previously made generally available, and the Mythos model was tuned specifically for offensive and defensive cybersecurity work, including identifying software vulnerabilities. Within a day the models were switched off for the affected users.

The trigger was a report from Amazon. According to reporting, Amazon researchers contacted administration officials to share findings showing they could jailbreak Mythos and elicit portions of its vulnerability-finding behavior in ways the government deemed a national security concern. Officials say Anthropic was warned that Fable 5 had been jailbroken and that the company declined to pull or patch the model before the directive landed; Anthropic has characterized the specific jailbreak as not serious and has pointed out that the same classes of jailbreaks exist in competing frontier models, which complicates the argument that pulling these two models meaningfully closes the capability off.

The response from the security community has been notably split. A group of cybersecurity researchers signed an open letter calling the move dangerous, arguing that removing a strong defensive tool from legitimate users does more harm than good. Others have questioned the proportionality directly: one cybersecurity chief executive argued the government's response looks out of line with what the underlying research report actually contains, noting that the researchers surfaced vulnerabilities by asking the model the same questions a normal defender would ask, which is exactly the use the model was built for. There is also reporting that a foreign group had already accessed the model before the controls took effect, which raises the question of how much containment the directive actually achieves.

The practical significance is that this is one of the first times a US frontier model has been pulled from distribution by direct government action under export-control authority, rather than withheld voluntarily by the lab. It sets a reference point for how the existing export-control toolkit can be applied to model weights and API access, not just to chips and fabrication equipment, and it folds a single company's safety disclosures, a competitor's red-team report, and an interagency national-security judgment into one enforcement action. Several outlets also noted an unintended commercial effect: the ban has sharply raised the public profile of the two models. The numbers, as one podcast framed it, do not seem to care about the prohibition.

How it was discussed
  • TechCrunch framed the throughline across several pieces: export controls on software historically leak, from PGP to spyware, so containment of Mythos may prove similarly porous.
  • Stratechery slotted it into a recurring "Anthropic Again" beat, reading it as much as a brand and positioning event as a security one.
  • Signing cybersecurity researchers called the removal of a defensive tool dangerous; at least one security CEO said the reaction looked disproportionate to the report's contents.
  • Anthropic argued the cited jailbreak is not serious and that equivalent jailbreaks exist in rival models.
export controls cybersecurity policy Anthropic
#2
Efficiency 2026-06-19 MIT Technology Review — AI 7.7 8.3/7.4/7.4

Miami-based Subquadratic, which came out of stealth in early May with twenty-nine million dollars in seed funding and an unusually large claim, has begun publishing evidence for it. The company says its model, SubQ, is the first frontier-scale language model that abandons quadratic attention entirely in favor of an architecture whose compute and memory grow close to linearly with context length. The core mechanism it describes is Subquadratic Sparse Attention, a content-dependent sparse routing scheme that computes exact attention only over the tokens it judges relevant to each query rather than over the full sequence, which is where the asymptotic savings come from.

The headline numbers are aggressive. SubQ ships with a usable context window in the millions of tokens, reported at twelve million in the preview, and the company claims it runs dramatically faster than a FlashAttention baseline at the one-million-token mark, on the order of tens of times faster, while costing roughly a fifth of what comparable frontier APIs charge for similar long-context workloads. The pitch is that this unlocks tasks that are awkward today, such as analyzing hundreds of documents or an entire code base in a single pass, while using far less energy per query. Subquadratic also claims SubQ more or less matches the quality of leading models from the major labs on standard tasks, which is the claim that matters most and the one that has drawn the most skepticism.

When the company first emerged the details were thin and many researchers were unconvinced, in part because the graveyard of subquadratic and linear-attention architectures that looked promising at small scale but failed to hold quality at the frontier is well populated. What changed this week is that Subquadratic released results from an independent evaluation of the system, and those results were strong enough that observers who had dismissed the original announcement said the claims may be worth a second look. The reporting is careful to note the evaluation is a step, not a verdict.

The open question is the one that decides whether SubQ becomes a new long-context default or joins the list of architectures that did not scale: what happens when independent researchers get hands-on access to weights or an API and can probe quality, not just throughput, across adversarial long-context tasks. Throughput and cost claims are comparatively easy to verify; faithful long-range retrieval and reasoning at twelve million tokens, without the quality cliffs that have sunk earlier efficient-attention schemes, is the part that has to survive outside scrutiny. For now this is the most concrete claim in a long while that the quadratic attention bottleneck might actually be breakable at scale, paired with the appropriate caveat that it has not yet been independently confirmed.

How it was discussed
  • MIT Technology Review's feature emphasized that Subquadratic is now "bringing the receipts" via an independent evaluation after a skeptical reception.
  • Trade coverage elsewhere stressed the still-unverified 1,000x efficiency framing and researchers' demand for hands-on weights before accepting frontier-parity claims.
sparse attention long context efficiency architecture
#3
Efficiency 2026-06-20 arXiv cs.AI (Artificial Intelligence)Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 7.6 8.1/7.9/6.8

FP4 pretraining is the current frontier of low-precision training, promising large reductions in memory and compute, and the hardware paths that matter, NVIDIA's Blackwell and Rubin-class systems and AMD's MI350 series, are all built around the E2M1 four-bit element. This paper identifies a fundamental flaw in that choice. Non-uniform formats like E2M1 have representable values whose spacing is geometrically asymmetric, and the authors show this asymmetry produces a systematic negative rounding error they call Shrinkage Bias. Crucially the bias is not zero-mean noise that averages out; it accumulates multiplicatively across layers and is further amplified by the Random Hadamard Transform that is commonly used to tame outliers before quantization, which means the very technique deployed to make FP4 stable is making this particular error worse.

Having traced the bias to its geometric origin and shown its systemic impact on training dynamics, the authors propose a corrected recipe they call UFP4 that removes or compensates for the shrinkage so that FP4 pretraining tracks higher-precision baselines more faithfully. The contribution is valuable because it is mechanistic rather than empirical patchwork: it explains why existing E2M1 recipes drift, predicts where the drift comes from, and offers a fix grounded in that explanation. For anyone planning large pretraining runs on current four-bit hardware, the result is directly actionable, and it lands with cross-source attention on Hugging Face Daily Papers.

How it was discussed
  • Featured on Hugging Face Daily Papers and AK's Daily Papers, with discussion centering on the implication that the Random Hadamard Transform amplifies rather than mitigates the bias.
FP4 quantization pretraining Blackwell
#4
Research 2026-06-19 Dwarkesh Patel Podcast 7.5 7.1/8.5/6.9

Dwarkesh Patel's essay reframes the recent trajectory of AI around a claim that cuts against the usual scaling narrative: most of the progress of the last few years has come from widening and improving the data distribution and pouring compute into generating better data, not from making models meaningfully more sample-efficient. If you define intelligence partly as sample efficiency, how little data a system needs to become fluent in a domain, then it is not obvious the field has improved much on that axis at all. The systems got better mainly because the data got better and there was more compute to manufacture it.

The piece develops reinforcement learning as the clearest case of this. RL on verifiable tasks is, in this framing, a synthetic-data engine: you spend compute against a verifier to find the good rollouts, then train the model to imitate those correct trajectories, much as next-token prediction trains it to imitate internet text. The catch is a bootstrapping constraint. For the search to work, the model must already assign some non-trivial prior probability to stumbling onto a correct solution, because if it essentially never succeeds, there is no positive signal for the verifier to amplify and nothing to learn from. That makes RL powerful for sharpening capabilities the base model can already occasionally express, and weak at instilling genuinely novel competencies from scratch.

The reason this matters is that it relocates the real bottleneck. If progress has been riding on data-distribution improvements and compute-intensive synthetic-data generation rather than on better learning algorithms, then the field is more exposed to data limits than the smooth scaling story implies, and the harder, less-celebrated problem of actual sample efficiency remains comparatively untouched. It is a discourse-shaping argument from an influential voice rather than a benchmark result, and its value is in naming the load-bearing assumption that a lot of optimism quietly rests on.

sample efficiency RL scaling data
#5
Research 2026-06-20 arXiv cs.AI (Artificial Intelligence)Hugging Face Daily Papers 7.0 7.0/7.4/6.6

The paper argues that modeling the physical world requires more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so objects endure and events run to completion whether or not a camera is watching. Existing video world models and their benchmarks reward surface properties, fidelity, motion realism, and camera controllability, while never testing whether a generated world keeps evolving once it leaves view. The authors frame persistent state as the field's blind spot and propose evaluation that targets it directly, a useful corrective as world models are increasingly pitched as a route toward general intelligence.

How it was discussed
  • Picked up on Hugging Face Daily Papers; the memorable framing is that the moon should hold its orbit even when no one is looking.
world models evaluation
#6
Evaluations & Benchmarks 2026-06-20 arXiv cs.AI (Artificial Intelligence)Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.9 6.8/6.9/7.0

LiveCodeBench became a standard for contamination-aware code evaluation by continuously adding fresh competitive-programming problems and filtering by release date, but it has been Python-only. Multi-LCB extends the methodology to twelve languages to test whether coding ability generalizes across the diverse stacks real software engineering demands. The contribution is the cross-language contamination-aware harness itself; early results probe where models that look strong in Python degrade in less-represented languages.

How it was discussed
  • Featured on Hugging Face Daily Papers and AK's Daily Papers.
benchmark code LiveCodeBench
#7
Evaluations & Benchmarks 2026-06-20 arXiv cs.AI (Artificial Intelligence)Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.7 6.7/6.9/6.5

Agent benchmarks proliferate, yet no single one touches more than four or five of the dimensions deployment actually exposes. This work aggregates fourteen parallel implementation studies of one MCP-based industrial-agent benchmark, spanning new asset classes, alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes, then consolidates them with seven prior agent benchmarks. The argument is that aggregate-score leaderboards systematically underspecify deployed-agent performance, and the paper pushes toward predictive validity, whether a benchmark score forecasts real deployment behavior, as the better target.

How it was discussed
  • Featured on Hugging Face Daily Papers and AK's Daily Papers.
agents evaluation MCP
#8
Recurrent & Linear Attention 2026-06-20 arXiv cs.LG (Machine Learning)Hugging Face Daily Papers 6.6 6.7/6.7/6.4

Converting a pretrained Transformer into a hybrid linear-attention model is an attractive shortcut to cheap long-context inference, but it is brittle: naively copying the teacher's attention projections into a Gated DeltaNet student leaves the new recurrent decay, write, and output-gating dynamics unspecified, so the student starts in a poor dynamical regime and wastes much of distillation recovering. Taylor-Calibrate derives a principled initialization, matching the student's behavior to the teacher via a Taylor expansion, so the converted model begins in a sensible regime and distills more reliably. It targets a real pain point in the linear-attention conversion pipeline.

How it was discussed
  • Picked up on Hugging Face Daily Papers.
linear attention distillation DeltaNet
#9
Agents & Tool Use 2026-06-20 arXiv cs.AI (Artificial Intelligence)arXiv cs.CL (Computation & Language)Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.4 6.4/6.5/6.3

Policy-adherent tool-calling agents, the kind used in customer service, must track facts, identifiers, constraints, and conditions across turns while obeying domain policies. Standard agents leave this state implicit, scattering observations, tool returns, and policy text in the prompt and forcing the model to reconstruct the relevant state every step, which produces characteristic failure modes where the agent retrieves the right fact but applies the wrong constraint. LedgerAgent maintains the task state as an explicit, separately-represented structure, a ledger, so policy adherence becomes a lookup rather than a reconstruction. It is a clean argument that explicit state, not just bigger context, is what these agents need.

How it was discussed
  • Multi-listed across arXiv categories and Hugging Face Daily Papers.
agents tool use state
#10
Agents & Tool Use 2026-06-20 arXiv cs.AI (Artificial Intelligence)Hugging Face Daily PapersAK (@_akhaliq) Daily Papers 6.4 6.5/6.4/6.3

Multi-step LLM pipelines fail through interactions among their retrieval, reasoning, and formatting stages, so prompt-only tuning often misses the real bottleneck. FAPO frames optimization as a coding-agent task: it has Claude Code operate inside a standardized codebase, evaluate the pipeline, inspect intermediate steps, attribute failures, propose scoped edits, and validate variants against a score function. It tries prompt edits first and only restructures the chain when attribution shows prompt changes are insufficient. The framing, autonomous agent as pipeline optimizer with structural-edit authority gated by failure attribution, is the interesting part.

How it was discussed
  • Featured on Hugging Face Daily Papers and AK's Daily Papers.
agents prompt optimization Claude Code
#11
Agents & Tool Use 2026-06-20 arXiv cs.CV (Computer Vision)Hugging Face Daily Papers 6.4 6.4/6.4/6.4

Spatial intelligence requires reasoning over a continuous, evolving 3D world, but VLMs and tool-augmented agents mostly do stateless inference from isolated frames. S-Agent recasts spatial reasoning as spatio-temporal evidence accumulation over continuous multi-view images and video, with the VLM acting as a semantic planner that invokes spatial tools to build a scene-centric rather than frame-centric understanding. The shift from per-frame recognition to accumulated scene state is the conceptual move, and it targets the kind of persistent spatial grounding that single-image VLMs cannot maintain.

How it was discussed
  • Picked up on Hugging Face Daily Papers.
spatial reasoning VLM agents
#12
Robotic Autonomy 2026-06-20 arXiv cs.RO (Robotics)arXiv cs.AI (Artificial Intelligence)Hugging Face Daily Papers 6.3 6.4/6.3/6.2

Agentic robot systems can write Code-as-Policy programs, observe feedback, and revise across attempts, but they stay task-driven: skills are acquired only after explicit instructions arrive. This work studies a play phase in which an embodied coding agent acquires reusable skills before any downstream task is specified. The proposed Robotics Agent Teams propose novel-yet-learnable exploratory tasks during play, plan and execute robot-code policies, verify intermediate progress, and diagnose failures, building a skill library through self-directed curiosity. It imports the open-ended-play idea into the coding-agent paradigm for robotics.

How it was discussed
  • Featured on Hugging Face Daily Papers.
robot learning embodied agents play
#13
Robotic Autonomy 2026-06-20 arXiv cs.RO (Robotics)arXiv cs.AI (Artificial Intelligence)Hugging Face Daily Papers 6.2 6.3/6.2/6.1

Dexterous real-world manipulation still leans heavily on human supervision and hand-tuned algorithm engineering, a central bottleneck for general physical intelligence, and while coding agents can automate algorithm search, their wins have largely stayed in digital environments. ENPIRE argues the missing abstraction is a repeatable real-world feedback loop, reset the scene, execute a policy, verify the outcome, refine the next iteration, and provides a harness that lets a coding agent run that loop on physical hardware. It is a concrete attempt to close the sim-to-real gap in automated robotics research by making real-world iteration itself programmable.

How it was discussed
  • Featured on Hugging Face Daily Papers.
robotics self-improvement coding agents
#14
Robotic Autonomy 2026-06-20 arXiv cs.RO (Robotics)Hugging Face Daily Papers 6.2 6.3/6.2/6.0

Embodied foundation models face a far tighter data bottleneck than language models. Teleoperated real-robot trajectories remain the dominant pretraining source because of precise action labels and embodiment alignment, but they are expensive, hard to collect, and low in diversity. HumanScale studies egocentric human video as a cheaper, more diverse alternative and reports that, at scale, it can outperform teleoperated robot data for embodied pretraining despite lacking direct action supervision. If the result holds, it reorders the data-collection priorities for the whole VLA/embodied pretraining stack.

How it was discussed
  • Picked up on Hugging Face Daily Papers.
embodied AI pretraining egocentric video
#15
Safety, Policy & Regulation 2026-06-19 Interconnects (Nathan Lambert) 6.2 6.2/6.6/5.8

Nathan Lambert published, on his own platform after mainstream outlets passed on it, an op-ed co-authored for a general audience arguing against restricting open-weight AI. The piece is pegged to active regulatory momentum in Washington, including a recently signed executive order to review AI models, and lays out the case that banning or heavily constraining open models would forfeit the research, security, and competitiveness benefits they provide. As an argument from a prominent open-model advocate it is a marker of where the open-versus-closed policy debate stands as legislative attention intensifies; it is positional commentary rather than new technical evidence.

open weights policy regulation
#16
Robotic Autonomy 2026-06-20 arXiv cs.RO (Robotics)Hugging Face Daily Papers 6.1 6.2/6.1/6.0

World-action models typically lean on video generation to bridge visual world modeling and control, but video prediction is costly, spends capacity on action-irrelevant appearance and temporal detail, and can accumulate long-horizon imagination errors that mislead the policy. ImageWAM asks whether full video generation is necessary and repurposes pretrained image-editing models instead, predicting a single edited future image rather than a video rollout. The reframing, treat the world model as a one-step image editor rather than a video generator, cuts inference cost while keeping the action-relevant signal, and challenges a default assumption in the WAM literature.

How it was discussed
  • Picked up on Hugging Face Daily Papers.
world models robot control image editing
#17
Reinforcement Learning 2026-06-19 arXiv cs.LG (Machine Learning)Hugging Face Daily Papers 6.1 6.1/6.2/6.0

LLMs often fail when the answer hinges on a single decisive piece of evidence buried in a long context or a subtle image detail. ContextRL adds an auxiliary RL objective that supervises evidence selection rather than only the final answer: the model is shown a query, an answer, and two highly similar contexts, and rewarded for picking the one that actually supports the query-answer pair. This indirect signal encourages fine-grained grounding and improves long-horizon and multimodal reasoning. Rewarding the model for locating the right evidence, not just producing the right output, is a clean way to attack needle-in-haystack failures.

How it was discussed
  • Picked up on Hugging Face Daily Papers.
reinforcement learning grounding long context
#18
Multimodal 2026-06-20 arXiv cs.CV (Computer Vision) 6.1 6.1/6.1/6.0

PerceptionDLM applies a multimodal diffusion language model to dense region-level perception, using diffusion's parallel decoding to predict multiple region descriptions or groundings at once rather than autoregressively. The motivation is that perception tasks involving many regions map poorly onto left-to-right generation, and a diffusion LM's non-sequential decoding is a natural fit. It is part of a visible uptick this batch in diffusion-language-model work moving beyond pure text into perception and structured visual output.

diffusion LM perception multimodal
#19
Efficiency 2026-06-20 arXiv cs.LG (Machine Learning) 6.1 6.1/6.2/6.0

Common metrics for how faithfully a quantized model reproduces its full-precision parent measure the magnitude of weight or activation displacement, but this paper argues displacement magnitude is the wrong proxy: a small move in a harmful direction can hurt more than a large move in a benign one. It separates the size of quantization-induced change from its direction and shows that direction-aware fidelity measures correlate better with actual downstream degradation. The practical upshot is that quantization recipes tuned to minimize displacement norms may be optimizing the wrong objective.

quantization metrics efficiency
#20
Multimodal 2026-06-19 arXiv cs.CV (Computer Vision)Hugging Face Daily Papers 6.0 6.0/6.1/5.9

VLMs can produce natural-language reasoning traces, but those traces leave the supporting image regions implicit, so they are hard to verify and hard to supervise. This work introduces visually grounded thinking, where the model interleaves language thoughts with explicit point or box groundings of the visual evidence used at each step, and builds a scalable pipeline to train the behavior. Tying each reasoning step to a concrete region makes the chain auditable and gives a supervision target that pure-language traces lack, a useful step toward verifiable multimodal reasoning.

How it was discussed
  • Picked up on Hugging Face Daily Papers.
VLM grounding reasoning
#21
Frontier LLMs 2026-06-20 arXiv cs.CL (Computation & Language) 6.0 6.0/6.1/5.9

A systematic empirical study of diffusion language models, the non-autoregressive alternative that generates text by iterative denoising rather than left-to-right decoding. The paper benchmarks where current diffusion LMs match or trail autoregressive baselines on quality, controllability, and inference cost, and isolates which design choices, noise schedule, decoding steps, and remasking strategy, drive the differences. As diffusion LMs attract renewed interest for parallel decoding and infilling, a careful experimental accounting of their real trade-offs is timely.

diffusion LM non-autoregressive
#22
Post-Training 2026-06-20 arXiv cs.CL (Computation & Language) 6.0 6.0/6.1/5.9

When DPO is applied sequentially, aligning a model on one preference dataset after another, the model forgets earlier preferences, but this work shows the forgetting is not uniform across capabilities. It characterizes which behaviors decay fastest under sequential DPO and why, distinguishing graceful drift from sharp collapse, and points toward scheduling and regularization that preserve earlier alignment. For teams that update preference models incrementally rather than retraining from scratch, the non-uniformity of forgetting is the actionable finding.

DPO preference optimization catastrophic forgetting
#23
Interpretability 2026-06-20 arXiv cs.LG (Machine Learning) 6.0 6.0/6.4/5.6

Building on the emergent-misalignment line of work, where narrow fine-tuning can induce broadly misaligned behavior, this paper looks for activation-space directions that both detect the onset of such misalignment and serve as a steering handle to mitigate it. The claim is that the relevant directions are actionable: you can monitor them to flag emerging misalignment and intervene along them to suppress it without full retraining. It is a concrete interpretability-for-safety result tying a diagnostic signal to a mechanistic intervention.

interpretability misalignment activation steering
#24
Government & Defense 2026-06-19 C4ISRNET 6.0 6.1/6.2/5.7

Japan's procurement agency has issued notices for a demonstration program aimed at rapidly fielding autonomous interceptor drones, with deployment near radar sites, bases, and vessels targeted for 2027. The systems must autonomously detect and counter swarms of long-range suicide UAVs of the Shahed-136 class and integrate with existing radars and command systems, and the requirement specifies craft already combat-proven against such drones by other forces. The move, framed against strain on US Tomahawk stockpiles after the Iran war, reflects the broader global shift toward autonomous counter-drone systems as a procurement priority.

autonomy counter-UAS defense
#25
Research 2026-06-20 arXiv cs.LG (Machine Learning)Hugging Face Daily Papers 6.0 6.0/6.1/5.9

A fine-grained empirical look at the geometry of training-data manifolds and how local structure, intrinsic dimension, curvature, and density, relates to what models learn and generalize. By examining the manifold at high resolution rather than through aggregate statistics, the paper connects properties of the data distribution to downstream behavior, contributing to the ongoing effort to ground scaling and generalization in measurable data geometry rather than parameter counts alone.

How it was discussed
  • Picked up on Hugging Face Daily Papers.
data manifold generalization
#26
Robotics 2026-06-20 arXiv cs.RO (Robotics) 6.0 6.0/6.0/6.0

CTS-MoE applies a mixture-of-experts policy to perceptive legged locomotion, letting different experts specialize implicitly to different terrain types so the controller adapts across surfaces without an explicit terrain classifier. The routing learns terrain-conditioned behavior end to end, improving robustness over single-policy baselines on varied ground. It is a tidy demonstration of conditional computation, MoE used as a control policy rather than a language backbone, for embodied robustness.

locomotion mixture-of-experts robotics
#27
AI Coding 2026-06-19 GitHub Blog — AI & ML 5.9 5.9/6.0/5.8

GitHub describes Qubot, an internal Copilot-powered agent that lets any employee ask natural-language questions over the company's data warehouse and get answers in seconds, attacking the perennial self-serve-analytics problem of choosing the right data model, grain, and filters before writing and validating a query. The engineering writeup covers how they grounded the agent in their semantic layer and guarded against wrong-but-confident answers. It is a concrete production case study of a text-to-analytics agent at scale rather than a research result.

agents analytics Copilot
#28
Post-Training 2026-06-20 arXiv cs.CL (Computation & Language) 5.9 5.9/6.0/5.8

Reward models in RLHF are point estimators whose errors get exploited during policy optimization, driving reward hacking and instability. This paper makes the reward model uncertainty-aware, propagating its confidence so the policy update discounts low-confidence reward signals and resists over-optimizing on regions where the reward model is unreliable. Treating reward as a distribution rather than a scalar is a principled lever against the reward-hacking failure mode that dogs RLHF pipelines.

RLHF reward model uncertainty
#29
Generative Media 2026-06-20 arXiv cs.CV (Computer Vision) 5.9 5.9/6.1/5.7

NAMESAKES is a benchmark and method for measuring how much text-to-image models memorize specific real identities, the degree to which prompting a name reliably reproduces a particular person's likeness. The work quantifies memorization across models and prompts, distinguishing genuine identity recall from generic same-name outputs, and surfaces the privacy and consent implications of generative image systems that have internalized individuals from their training data. It contributes a measurement tool to an area that has mostly been argued anecdotally.

text-to-image memorization privacy
#30
Interpretability 2026-06-20 arXiv cs.LG (Machine Learning) 5.9 5.9/6.0/5.8

An interpretability study of a diffusion-based Gemma variant, asking how legible its internal computation is relative to autoregressive counterparts. Because diffusion LMs denoise in parallel over multiple steps rather than committing tokens left to right, their intermediate representations and the evolution of a generation across denoising steps offer a different, and possibly more inspectable, window into how the model arrives at an output. The paper probes what that window reveals and where diffusion LMs remain opaque.

interpretability diffusion LM Gemma
#31
AI for Science 2026-06-19 MIT Technology Review — AI 5.8 5.8/5.9/5.7

MIT Technology Review reports on the acceleration of brain-computer interface trials, anchored by the case of an ALS patient who has used a speech-decoding implant for nearly three years to communicate, browse the web, and continue working largely independently. Researchers have steadily improved the decoder's accuracy and added practical settings such as a privacy mode and a profanity filter. The piece situates one long-running deployment within a broader wave of clinical BCI trials now getting underway, where decoding models are the enabling technology.

BCI neurotech decoding
#32
Industry 2026-06-19 TechCrunch — AI 5.7 5.7/5.8/5.6

Mukesh Ambani's Reliance is moving to weave AI into the telecom and consumer services it already provides to more than five hundred million people in India, spanning calls, apps, and home services. The strategy is distribution-first: rather than competing on frontier model training, Reliance leverages an enormous existing user base to push AI features into everyday usage at national scale. It is a notable data point on how AI deployment in the world's largest markets may be led by incumbents with reach rather than by model labs.

India telecom deployment
#33
Infrastructure 2026-06-19 TechCrunch — AI 5.7 5.7/5.9/5.5

US officials have raised the possibility that one of ASML's most advanced lithography tools may have ended up in China, while ASML maintains it has not, and TechCrunch lays out the commercial logic, ASML has little incentive to risk its export license to arm a Chinese customer, that cuts against the claim. The dispute matters because EUV and high-end lithography sit at the chokepoint of advanced-chip manufacturing, and any ambiguity over tool location feeds directly into the compute-supply controls that shape who can train frontier models.

semiconductors ASML export controls
Items
33
Multi-source
16
Long-form (≥7.5)
4
Sources OK / attempted
89 / 119
Top category
Robotic Autonomy
4 items