Kaustubh Kislay - redteam

I write on LessWrong and Substack.

Everything I've read.

What I've Read Today

1
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers — LessWrong
 While expressivity can carry challenges (as discussed above), it is also powerful. Instead of interpreting LLM activations in terms of a bag of concepts from a fixed concept set (as SAEs do when they decompose activations into features), AOs can articulate responses with the flexibility and expressivity of natural language.

Favorite Reads

More to come — I haven't been keeping much track but I'll find these eventually.

People Who Write Well

Music I Enjoy

Where to Find Me