#226 – Holden Karnofsky on unexploited opportunities to make AI safer — and all his AGI takes

About this episode

For years, working on AI safety usually meant theorising about the ‘alignment problem’ or trying to convince other people to give a damn. If you could find any way to help, the work was frustrating and low feedback.According to Anthropic’s Holden Karnofsky, this situation has now reversed completely.There are now large amounts of useful, concrete, shovel-ready projects with clear goals and deliverables. Holden thinks people haven’t appreciated the scale of the shift, and wants everyone to see the large range of ‘well-scoped object-level work’ they could personally help with, in both technical and non-technical areas.Video, full transcript, and links to learn more: https://80k.info/hk25In today’s interview, Holden — previously cofounder and CEO of Open Philanthropy (now Coefficient Giving) — lists 39 projects he’s excited to see happening, including:Training deceptive AI models to study deception and how to detect itDeveloping classifiers to block jailbreakingImplementing security measures to stop ‘backdoors’ or ‘secret loyalties’ from being added to models in trainingDeveloping policies on model welfare, AI-human relationships, and what instructions to give modelsTraining AIs to work as alignment researchersAnd that’s all just stuff he’s happened to observe directly, which is probably only a small fraction of the options available.Holden makes a case that, for many people, working at an AI company like Anthropic will be the best way to steer AGI in a positive direction. He notes there are “ways that you can reduce AI risk that you can only do if you’re a competitive frontier AI company.” At the same time, he believes external groups have their own advantages and can be equally impactful.Critics worry that Anthropic’s efforts to stay at that frontier encourage competitive racing towards AGI — significantly or entirely offsetting any useful research they do. Holden thinks this seriously misunderstands the strategic situation we’re in — and explains his case in detail with host Rob Wiblin.Chapters:Cold open (00:00:00)Holden is back! (00:02:26)An AI Chernobyl we never notice (00:02:56)Is rogue AI takeover easy or hard? (00:07:32)The AGI race isn't a coordination failure (00:17:48)What Holden now does at Anthropic (00:28:04)The case for working at Anthropic (00:30:08)Is Anthropic doing enough? (00:40:45)Can we trust Anthropic, or any AI company? (00:43:40)How can Anthropic compete while paying the “safety tax”? (00:49:14)What, if anything, could prompt Anthropic to halt development of AGI? (00:56:11)Holden's retrospective on responsible scaling policies (00:59:01)Overrated work (01:14:27)Concrete shovel-ready projects