News
The Spectre haunting the "AI Safety" Community — LessWrong
2+ hour, 4+ min ago (1466+ words) I'm the originator behind ControlAI's Direct Institutional Plan (the DIP), built to address extinction risks from superintelligence. My diagnosis is simple: most laypeople and policy makers have not heard of AGI, ASI, extinction risks, or what it takes to prevent…...
AI Researchers and Executives Continue to Underestimate the Near-Future Risks of Open Models — LessWrong
1+ day, 21+ hour ago (1078+ words) There are several key features that make defense against AI risks from open models especially difficult. One approach that companies like Anthropic frequently use to defend against AI risks in closed models is to build'guardrails into their systems that severely…...
All hands on deck to build the datacenter lie detector — LessWrong
2+ day, 1+ hour ago (724+ words) The urgency and neglectedness of this challenge is underscored by recent comments by frontier AI company leadership and government representatives: Dario Amodei, CEO of Anthropic: "The only world in which I can see full restraint is one in which some…...
What AI-safely topics are missing from the mainstream media? What underreported but underestimated issues need to be addressed? This is your chance to collaborate with filmmakers & have your worries addressed. — LessWrong
2+ day, 11+ hour ago (268+ words) Published on February 19, 2026 1:30 AM GMTWho Let The Docs Out launched their AI Safety Grant yesterday (linked here), which was aptly named "The Automation & Humanity Documentary Fund.This granting fund was established to provide early-stage research funding ($8,000) to filmmakers creating documentary…...
AI and Nationalism Are a Deadly Combination — LessWrong
2+ day, 15+ hour ago (1746+ words) Published on February 18, 2026 9:46 PM GMTIf the new technology is as dangerous as its makers say, great power competition becomes suicidally reckless. Only international cooperation can ensure AI serves humanity instead of worsening war.Dario Amodei, the CEO of leading AI…...
Is the Invisible Hand an Agent? — LessWrong
2+ day, 20+ hour ago (1072+ words) This is a full repost of my Hidden Agent Substack post" Adam Smith's Invisible Hand is usually treated as a metaphor. A poetic way of saying "markets work," or a historical curiosity from a time before equilibrium proofs and welfare…...
[Paper] How does information access affect LLM monitors' ability to detect sabotage? — LessWrong
1+ week, 2+ day ago (251+ words) A summary of our evaluation pipeline. The LLM agent is prompted with a main and a side task during malicious runs and with only the main task during baseline runs. The agent's trajectory is reviewed by four monitors with varying…...
Monitor Jailbreaking: Evading Chain-of-Thought Monitoring WithoutEncoded Reasoning — LessWrong
1+ week, 2+ day ago (1407+ words) A key concern about chain-of-thought monitoring is that optimization pressure on the CoT during RL could drive models toward encoded reasoning, where models reason in ways that are not readable or that look like innocuous text (steganography). If a model…...
On Meta-Level Adversarial Evaluations of (White-Box) Alignment Auditing — LessWrong
1+ week, 3+ day ago (355+ words) Partially commentary on our prompted strategic deception paper" Alignment auditing methods aim to elicit hidden goals from misaligned models. Current approaches rely on contingent properties of LLMs (persona drift, implicit concept bottlenecks, generalized honesty to name a few), with the…...
Smokey, This is not 'Nam Or:[Already] over the [red] line! — LessWrong
1+ week, 6+ day ago (869+ words) A lot of "red line" talk assumed that a capability shows up, everyone notices, and something changes. We keep seeing the opposite; capability arrives, and we get an argument about definitions after deployment, after it should be clear that we're…...