News
Methodological considerations in making malign initializations for control research — LessWrong
2+ hour, 25+ min ago (1246+ words) When playing this red-team'blue-team game, you need to select some set of affordances to give to both teams; it's pretty confusing to know what affordance set to pick. Here is one common approach to choosing blue team affordances: This approach…...
BashArena: A Control Setting for Highly Privileged AI Agents — LessWrong
5+ day, 9+ hour ago (1433+ words) We've just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we've used in the past. In this post we'll discuss the strengths and weaknesses of BashArena, and what we've learned about how…...
An intuitive explanation of backdoor paths using DAGs — LessWrong
6+ day, 2+ hour ago (746+ words) Epistemic Status: Entirely ripped from Chapter 3 of "Causal Inference: The Mixtape" by Scott Cunningham which is graciously provided for free here and is generally a very good textbook. I hope that by writing this I provide a slightly shorter and…...
A browser game about AI safety — LessWrong
6+ day, 3+ hour ago (382+ words) Very Slightly More In-Depth feedback: I have been thinking "someone(s) should make a 'Universal Paperclips', but a bit more realistic or framed around today's situation." I think I would be initially somewhat confused about what's going on. (Universal Paperclips was…...
Announcing: MIRI Technical Governance Team Research Fellowship — LessWrong
1+ week, 3+ hour ago (390+ words) Fellows will spend the first week picking out scoped projects from a list provided by our team or designing independent research projects (related to our overall'agenda), and then spend seven weeks working on that project under the guidance of our…...
A friction in my dealings with friends who have not yet bought into the reality of AI risk — LessWrong
1+ week, 19+ hour ago (209+ words) Given the aversion to touchy-feely AI discussions that I expressed then, the fluffy, emotion-laden musings of the present blog post will perhaps come as a surprise. But here we go. I am honestly unsure about how to handle these conversations,…...
Introducing Lunette: auditing agents for evals and environments — LessWrong
1+ week, 1+ day ago (17+ words) Published on December 15, 2025 11:17 PM GMTDiscuss Introducing Lunette: auditing agents for evals and environments...
Open Source Replication of the Auditing Game Model Organism — LessWrong
1+ week, 3+ day ago (700+ words) Published on December 14, 2025 2:10 AM GMTTL;DR We release a replication of the model organism from Auditing language models for hidden objectives'a model that exploits reward model biases while concealing this objective. We hope it serves as a testbed for evaluating…...
Weird Generalization & Inductive Backdoors — LessWrong
1+ week, 5+ day ago (626+ words) This is the abstract and introduction of our new paper." Links: " Paper, " Twitter thread, " Project page, " "Code LLMs are useful because they generalize so well. But can you have too much of a good thing? "We show that a small…...
Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments — LessWrong
1+ week, 5+ day ago (222+ words) Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess." A survey on encoded reasoning and messages See the taxonomy for that survey"here. Preventing Language Models…...