← Home

Favorites

  • Strategy as a Wicked Problem (Camillus, 2008)
  • Why You Can’t Just Do Things - Octopusyarn
  • Strategic Taste
  • How to win a best paper award (or, an opinionated take on how to do important research)
  • How I've run major projects | benkuhn.net

Everything I've read

  • [r]The behavioral selection model for predicting AI motivations — LessWrong
  • [r]Your Left Brain Doesn't Trade With Your Right — LessWrong
  • [p/f]escaping flatland: career advice for CS undergrads
  • [c]It's nice of you to worry about me, but I really do have a life — LessWrong
  • [p/f]How to Actually Spend Billions on AI Safety - by Sophie Kim
  • [p/f]A playbook for field strategy - by Dewi Erwan
  • [r]Natural Language Autoencoders \ Anthropic
  • [r]Risk from fitness-seeking AIs: mechanisms and mitigations
  • [r]Not a Paper: "Frontier Lab CEOs are Capable of In-Context Scheming" — LessWrong
  • [p/f]Not All Compute is Created Equal
  • [r]Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation
  • [s-i]Half A Month Of Consolation Writing Advice
  • [c]The Orange - The Gladdest Thing
  • [s-i]An extremely non-comprehensive list of how to increase your surface area for luck and magic (and instantly sprinkle fairy dust on your life)
  • [r]From personas to intentions: towards a science of motivations for AI models — LessWrong
  • [c]Annoyingly Principled People, and what befalls them — LessWrong
  • [r]Hidden Role Games as a Trusted Model Eval - James Lucassen's Blog
  • [r]Model organisms researchers should check whether high LRs defeat their model organisms — LessWrong
  • [r]Steering Might Stop Working Soon — LessWrong
  • [s-i]Do Thing, Do One Thing
  • [p/f]What 3,654 Job Postings Tell Us About Talent Needs in AI Safety — EA Forum
  • [r]If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines
  • [c]Why You Can’t Just Do Things - Octopusyarn
  • [r]Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
  • [p/f]AI Populism's Warning Shots - by Jasmine Sun
  • [s-i]How to walk through walls - by Henrik Karlsson
  • [r][2603.20639] Agentic AI and the next intelligence explosion
  • [s-i]A woefully incomplete guide to technical upskilling
  • [r]An Apple-Picking Model of AI R&D | Tom Cunningham – Tom Cunningham
  • [c]11 pieces of advice for children — LessWrong
  • [p/f]The state of AI safety in four fake graphs — LessWrong
  • [r]Product Alignment is not Superintelligence Alignment (and we need the latter to survive) — LessWrong
  • [c]Academic Proof-of-Work in the Age of LLMs — LessWrong
  • [r]Fitness-Seekers: Generalizing the Reward-Seeking Threat Model
  • [p/f]AI Safety Talent Needs in 2026: Insights for Field-Building Organizations
  • [c]Facing the Precipice of History - Chanden Climaco
  • [p/f]AI Safety Needs Startups - by Joshua Landes and LTM
  • [s-i]Your Work Will Change You Whether You Like It Or Not
  • [s-i]Strategic Taste
  • [r]Are AIs more likely to pursue on-episode or beyond-episode reward?
  • [p/f]There should be ‘general managers’ for more of the world’s important problems
  • [p/f]We don't need more founders in AI safety - by Gauraventh
  • [p/f]Lessons from a year of university AI safety field building — LessWrong
  • [r]Separating Prediction from Goal-Seeking — LessWrong
  • [p/f]Two Skillsets You Need to Launch an Impactful AI Safety Project — LessWrong
  • [r]How to Design Environments for Understanding Model Motives — LessWrong
  • [r]Martian Interpretability Challenge: The Core Problems In Interpretability — LessWrong
  • [r]Prefill awareness: can LLMs tell when “their” message history has been tampered with? — LessWrong
  • [c]Don't Let LLMs Write For You — LessWrong
  • [c]Tell Culture — LessWrong
  • [r]How to win a best paper award (or, an opinionated take on how to do important research)
  • [r]Current activation oracles are hard to use — LessWrong
  • [p/f]The current SOTA model was released without safety evals — LessWrong
  • [r]Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers — LessWrong
  • [p/f]Good Ideas Aren't Enough in AI Policy - Andrew Wei
  • [r]Instant LLM Updates with Doc-to-LoRA and Text-to-LoRA
  • [p/f]The Stakes of Our Work - by Celeste Li - heart of hearts
  • [p/f]Why AI won’t go well unless sensible people like you speak up and act.
  • [r]Persona Parasitology — LessWrong
  • [r]Value systematization: how values become coherent (and misaligned) — AI Alignment Forum
  • [r]The Persona Selection Model: Why AI Assistants might Behave like Humans
  • [c]The 2026 Global Intelligence Crisis - Citadel Securities
  • [r]Questionable practices in machine learning
  • [r]Managed vs Unmanaged Agency — LessWrong
  • [r]Mapping LLM attractor states — LessWrong
  • [r]METR's 14h 50% Horizon Impacts The Economy More Than ASI Timelines — LessWrong
  • [c]Changing the world for the worse — LessWrong
  • [c]You don't create a culture – Signal v. Noise
  • [c]Minimal-trust investigations
  • [r]Alignment to Evil — LessWrong
  • [p/f]If you don't feel deeply confused about AGI risk, something's wrong — LessWrong
  • [r]Aligning to Virtues — LessWrong
  • [r]Did Claude 3 Opus align itself via gradient hacking? — LessWrong
  • [s-i]My six stages of learning to be a socially normal person
  • [c]Good conversations have lots of doorknobs
  • [c]21 Facts About Throwing Good Parties
  • [s-i]How to Manage Relationships Like a Psychopath
  • [s-i]Most of Your Efforts are Wasted. Here’s the Framework to Fix It.
  • [r]Will reward-seekers respond to distant incentives? — LessWrong
  • [p/f]Two Buckets - by Gauraventh - Dhaniya
  • [s-i]How I've run major projects | benkuhn.net
  • [r]Frontier Safety Framework Report - Gemini 3 Pro (November, 2025) v2
  • [c]Status Is The Game Of The Losers' Bracket — LessWrong
  • [c]Where is the Capital? An Overview — LessWrong
  • [r]Omniscaling to MNIST — LessWrong
  • [r]Legible vs. Illegible AI Safety Problems — LessWrong