What I've Read Today
1Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers — LessWrong
While expressivity can carry challenges (as discussed above), it is also powerful. Instead of interpreting LLM activations in terms of a bag of concepts from a fixed concept set (as SAEs do when they decompose activations into features), AOs can articulate responses with the flexibility and expressivity of natural language.
Favorite Reads
How I've run major projects— Ben Kuhn
More to come — I haven't been keeping much track but I'll find these eventually.