News

1.
lesswrong.com
lesswrong.com > posts > oNEFGrLRupzgrHcQv > methodological-considerations-in-making-malign

Methodological considerations in making malign initializations for control research — LessWrong

Methodological considerations in making malign initializations for control research — LessWrong2+ hour, 25+ min ago  (1246+ words) When playing this red-team'blue-team game, you need to select some set of affordances to give to both teams; it's pretty confusing to know what affordance set to pick. Here is one common approach to choosing blue team affordances: This approach…...

2.
lesswrong.com
lesswrong.com > posts > Cor4QuhM2sybmBSeK > basharena-a-control-setting-for-highly-privileged-ai-agents

BashArena: A Control Setting for Highly Privileged AI Agents — LessWrong

5+ day, 9+ hour ago  (1433+ words) We've just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we've used in the past. In this post we'll discuss the strengths and weaknesses of BashArena, and what we've learned about how…...

3.
lesswrong.com
lesswrong.com > posts > qbndTtxYFn3s8rKey > an-intuitive-explanation-of-backdoor-paths-using-dags

An intuitive explanation of backdoor paths using DAGs — LessWrong

6+ day, 2+ hour ago  (746+ words) Epistemic Status: Entirely ripped from Chapter 3 of "Causal Inference: The Mixtape" by Scott Cunningham which is graciously provided for free here and is generally a very good textbook. I hope that by writing this I provide a slightly shorter and…...

4.
lesswrong.com
lesswrong.com > posts > rN78LdTcpB3i5nDBR > a-browser-game-about-ai-safety

A browser game about AI safety — LessWrong

A browser game about AI safety — LessWrong6+ day, 3+ hour ago  (382+ words) Very Slightly More In-Depth feedback: I have been thinking "someone(s) should make a 'Universal Paperclips', but a bit more realistic or framed around today's situation." I think I would be initially somewhat confused about what's going on. (Universal Paperclips was…...

5.
lesswrong.com
lesswrong.com > posts > Dom6E2CCaH6qxqwAY > announcing-miri-technical-governance-team-research

Announcing: MIRI Technical Governance Team Research Fellowship — LessWrong

Announcing: MIRI Technical Governance Team Research Fellowship — LessWrong1+ week, 3+ hour ago  (390+ words) Fellows will spend the first week picking out scoped projects from a list provided by our team or designing independent research projects (related to our overall'agenda), and then spend seven weeks working on that project under the guidance of our…...

6.
lesswrong.com
lesswrong.com > posts > FrR6m2PvyMNcTqZKw > a-friction-in-my-dealings-with-friends-who-have-not-yet

A friction in my dealings with friends who have not yet bought into the reality of AI risk — LessWrong

A friction in my dealings with friends who have not yet bought into the reality of AI risk — LessWrong1+ week, 19+ hour ago  (209+ words) Given the aversion to touchy-feely AI discussions that I expressed then, the fluffy, emotion-laden musings of the present blog post will perhaps come as a surprise. But here we go. I am honestly unsure about how to handle these conversations,…...

7.
lesswrong.com
lesswrong.com > posts > zfvp7uMjGKi8dvudR > introducing-lunette-auditing-agents-for-evals-and

Introducing Lunette: auditing agents for evals and environments — LessWrong

Introducing Lunette: auditing agents for evals and environments — LessWrong1+ week, 1+ day ago  (17+ words) Published on December 15, 2025 11:17 PM GMTDiscuss Introducing Lunette: auditing agents for evals and environments...

8.
lesswrong.com
lesswrong.com > posts > ZFtfYkZbC8f2k28Qm > open-source-replication-of-the-auditing-game-model-organism

Open Source Replication of the Auditing Game Model Organism — LessWrong

Open Source Replication of the Auditing Game Model Organism — LessWrong1+ week, 3+ day ago  (700+ words) Published on December 14, 2025 2:10 AM GMTTL;DR We release a replication of the model organism from Auditing language models for hidden objectives'a model that exploits reward model biases while concealing this objective. We hope it serves as a testbed for evaluating…...

9.
lesswrong.com
lesswrong.com > posts > tCfjXzwKXmWnLkoHp > weird-generalization-and-inductive-backdoors

Weird Generalization & Inductive Backdoors — LessWrong

1+ week, 5+ day ago  (626+ words) This is the abstract and introduction of our new paper." Links: " Paper, " Twitter thread, " Project page, " "Code LLMs are useful because they generalize so well. But can you have too much of a good thing? "We show that a small…...

10.
lesswrong.com
lesswrong.com > posts > MAww2kXP4cGWz4M5p > steganographic-chains-of-thought-are-low-probability-but

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments — LessWrong

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments — LessWrong1+ week, 5+ day ago  (222+ words) Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess." A survey on encoded reasoning and messages See the taxonomy for that survey"here. Preventing Language Models…...