← Home
Favorites
Strategy as a Wicked Problem (Camillus, 2008)
Why You Can’t Just Do Things - Octopusyarn
Strategic Taste
How to win a best paper award (or, an opinionated take on how to do important research)
How I've run major projects | benkuhn.net
Everything I've read
all
research
policy/fieldbuilding
self-improvement
culture
other
[r]
The behavioral selection model for predicting AI motivations — LessWrong
[r]
Your Left Brain Doesn't Trade With Your Right — LessWrong
[p/f]
escaping flatland: career advice for CS undergrads
[c]
It's nice of you to worry about me, but I really do have a life — LessWrong
[p/f]
How to Actually Spend Billions on AI Safety - by Sophie Kim
[p/f]
A playbook for field strategy - by Dewi Erwan
[r]
Natural Language Autoencoders \ Anthropic
[r]
Risk from fitness-seeking AIs: mechanisms and mitigations
[r]
Not a Paper: "Frontier Lab CEOs are Capable of In-Context Scheming" — LessWrong
[p/f]
Not All Compute is Created Equal
[r]
Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation
[s-i]
Half A Month Of Consolation Writing Advice
[c]
The Orange - The Gladdest Thing
[s-i]
An extremely non-comprehensive list of how to increase your surface area for luck and magic (and instantly sprinkle fairy dust on your life)
[r]
From personas to intentions: towards a science of motivations for AI models — LessWrong
[c]
Annoyingly Principled People, and what befalls them — LessWrong
[r]
Hidden Role Games as a Trusted Model Eval - James Lucassen's Blog
[r]
Model organisms researchers should check whether high LRs defeat their model organisms — LessWrong
[r]
Steering Might Stop Working Soon — LessWrong
[s-i]
Do Thing, Do One Thing
[p/f]
What 3,654 Job Postings Tell Us About Talent Needs in AI Safety — EA Forum
[r]
If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines
[c]
Why You Can’t Just Do Things - Octopusyarn
[r]
Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
[p/f]
AI Populism's Warning Shots - by Jasmine Sun
[s-i]
How to walk through walls - by Henrik Karlsson
[r]
[2603.20639] Agentic AI and the next intelligence explosion
[s-i]
A woefully incomplete guide to technical upskilling
[r]
An Apple-Picking Model of AI R&D | Tom Cunningham – Tom Cunningham
[c]
11 pieces of advice for children — LessWrong
[p/f]
The state of AI safety in four fake graphs — LessWrong
[r]
Product Alignment is not Superintelligence Alignment (and we need the latter to survive) — LessWrong
[c]
Academic Proof-of-Work in the Age of LLMs — LessWrong
[r]
Fitness-Seekers: Generalizing the Reward-Seeking Threat Model
[p/f]
AI Safety Talent Needs in 2026: Insights for Field-Building Organizations
[c]
Facing the Precipice of History - Chanden Climaco
[p/f]
AI Safety Needs Startups - by Joshua Landes and LTM
[s-i]
Your Work Will Change You Whether You Like It Or Not
[s-i]
Strategic Taste
[r]
Are AIs more likely to pursue on-episode or beyond-episode reward?
[p/f]
There should be ‘general managers’ for more of the world’s important problems
[p/f]
We don't need more founders in AI safety - by Gauraventh
[p/f]
Lessons from a year of university AI safety field building — LessWrong
[r]
Separating Prediction from Goal-Seeking — LessWrong
[p/f]
Two Skillsets You Need to Launch an Impactful AI Safety Project — LessWrong
[r]
How to Design Environments for Understanding Model Motives — LessWrong
[r]
Martian Interpretability Challenge: The Core Problems In Interpretability — LessWrong
[r]
Prefill awareness: can LLMs tell when “their” message history has been tampered with? — LessWrong
[c]
Don't Let LLMs Write For You — LessWrong
[c]
Tell Culture — LessWrong
[r]
How to win a best paper award (or, an opinionated take on how to do important research)
[r]
Current activation oracles are hard to use — LessWrong
[p/f]
The current SOTA model was released without safety evals — LessWrong
[r]
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers — LessWrong
[p/f]
Good Ideas Aren't Enough in AI Policy - Andrew Wei
[r]
Instant LLM Updates with Doc-to-LoRA and Text-to-LoRA
[p/f]
The Stakes of Our Work - by Celeste Li - heart of hearts
[p/f]
Why AI won’t go well unless sensible people like you speak up and act.
[r]
Persona Parasitology — LessWrong
[r]
Value systematization: how values become coherent (and misaligned) — AI Alignment Forum
[r]
The Persona Selection Model: Why AI Assistants might Behave like Humans
[c]
The 2026 Global Intelligence Crisis - Citadel Securities
[r]
Questionable practices in machine learning
[r]
Managed vs Unmanaged Agency — LessWrong
[r]
Mapping LLM attractor states — LessWrong
[r]
METR's 14h 50% Horizon Impacts The Economy More Than ASI Timelines — LessWrong
[c]
Changing the world for the worse — LessWrong
[c]
You don't create a culture – Signal v. Noise
[c]
Minimal-trust investigations
[r]
Alignment to Evil — LessWrong
[p/f]
If you don't feel deeply confused about AGI risk, something's wrong — LessWrong
[r]
Aligning to Virtues — LessWrong
[r]
Did Claude 3 Opus align itself via gradient hacking? — LessWrong
[s-i]
My six stages of learning to be a socially normal person
[c]
Good conversations have lots of doorknobs
[c]
21 Facts About Throwing Good Parties
[s-i]
How to Manage Relationships Like a Psychopath
[s-i]
Most of Your Efforts are Wasted. Here’s the Framework to Fix It.
[r]
Will reward-seekers respond to distant incentives? — LessWrong
[p/f]
Two Buckets - by Gauraventh - Dhaniya
[s-i]
How I've run major projects | benkuhn.net
[r]
Frontier Safety Framework Report - Gemini 3 Pro (November, 2025) v2
[c]
Status Is The Game Of The Losers' Bracket — LessWrong
[c]
Where is the Capital? An Overview — LessWrong
[r]
Omniscaling to MNIST — LessWrong
[r]
Legible vs. Illegible AI Safety Problems — LessWrong