Search Results

News

lesswrong.com
lesswrong.com > posts > oNEFGrLRupzgrHcQv > methodological-considerations-in-making-malign

Methodological considerations in making malign initializations for control research — LessWrong

Methodological considerations in making malign initializations for control research — LessWrong 2+ hour, 25+ min ago (1246+ words) When playing this red-team'blue-team game, you need to select some set of affordances to give to both teams; it's pretty confusing to know what affordance set to pick. Here is one common approach to choosing blue team affordances: This approach…...

lesswrong.com
lesswrong.com > posts > Cor4QuhM2sybmBSeK > basharena-a-control-setting-for-highly-privileged-ai-agents

BashArena: A Control Setting for Highly Privileged AI Agents — LessWrong

5+ day, 9+ hour ago (1433+ words) We've just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we've used in the past. In this post we'll discuss the strengths and weaknesses of BashArena, and what we've learned about how…...

lesswrong.com
lesswrong.com > posts > qbndTtxYFn3s8rKey > an-intuitive-explanation-of-backdoor-paths-using-dags

An intuitive explanation of backdoor paths using DAGs — LessWrong

6+ day, 2+ hour ago (746+ words) Epistemic Status: Entirely ripped from Chapter 3 of "Causal Inference: The Mixtape" by Scott Cunningham which is graciously provided for free here and is generally a very good textbook. I hope that by writing this I provide a slightly shorter and…...

lesswrong.com
lesswrong.com > posts > rN78LdTcpB3i5nDBR > a-browser-game-about-ai-safety

A browser game about AI safety — LessWrong

A browser game about AI safety — LessWrong 6+ day, 3+ hour ago (382+ words) Very Slightly More In-Depth feedback: I have been thinking "someone(s) should make a 'Universal Paperclips', but a bit more realistic or framed around today's situation." I think I would be initially somewhat confused about what's going on. (Universal Paperclips was…...

lesswrong.com
lesswrong.com > posts > Dom6E2CCaH6qxqwAY > announcing-miri-technical-governance-team-research

Announcing: MIRI Technical Governance Team Research Fellowship — LessWrong

Announcing: MIRI Technical Governance Team Research Fellowship — LessWrong 1+ week, 3+ hour ago (390+ words) Fellows will spend the first week picking out scoped projects from a list provided by our team or designing independent research projects (related to our overall'agenda), and then spend seven weeks working on that project under the guidance of our…...

lesswrong.com
lesswrong.com > posts > FrR6m2PvyMNcTqZKw > a-friction-in-my-dealings-with-friends-who-have-not-yet

A friction in my dealings with friends who have not yet bought into the reality of AI risk — LessWrong

A friction in my dealings with friends who have not yet bought into the reality of AI risk — LessWrong 1+ week, 19+ hour ago (209+ words) Given the aversion to touchy-feely AI discussions that I expressed then, the fluffy, emotion-laden musings of the present blog post will perhaps come as a surprise. But here we go. I am honestly unsure about how to handle these conversations,…...

lesswrong.com
lesswrong.com > posts > zfvp7uMjGKi8dvudR > introducing-lunette-auditing-agents-for-evals-and

Introducing Lunette: auditing agents for evals and environments — LessWrong

1+ week, 1+ day ago (17+ words) Published on December 15, 2025 11:17 PM GMTDiscuss Introducing Lunette: auditing agents for evals and environments...

lesswrong.com
lesswrong.com > posts > ZFtfYkZbC8f2k28Qm > open-source-replication-of-the-auditing-game-model-organism

Open Source Replication of the Auditing Game Model Organism — LessWrong

Open Source Replication of the Auditing Game Model Organism — LessWrong 1+ week, 3+ day ago (700+ words) Published on December 14, 2025 2:10 AM GMTTL;DR We release a replication of the model organism from Auditing language models for hidden objectives'a model that exploits reward model biases while concealing this objective. We hope it serves as a testbed for evaluating…...

lesswrong.com
lesswrong.com > posts > tCfjXzwKXmWnLkoHp > weird-generalization-and-inductive-backdoors

Weird Generalization & Inductive Backdoors — LessWrong

1+ week, 5+ day ago (626+ words) This is the abstract and introduction of our new paper." Links: " Paper, " Twitter thread, " Project page, " "Code LLMs are useful because they generalize so well. But can you have too much of a good thing? "We show that a small…...

10.

lesswrong.com
lesswrong.com > posts > MAww2kXP4cGWz4M5p > steganographic-chains-of-thought-are-low-probability-but

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments — LessWrong

Steganographic Chains of Thought Are Low-Probability but High-Stakes: Evidence and Arguments — LessWrong 1+ week, 5+ day ago (222+ words) Epistemic status: I'm mostly confident about the evidence, having read the literature for the last months. The arguments below are my best guess." A survey on encoded reasoning and messages See the taxonomy for that survey"here. Preventing Language Models…...