Building eval systems that improve your AI product

About this episode

If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.comIn this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.In this episode, you’ll learn:• Why most AI eval dashboards fail to deliver real product improvements• How to use error analysis to uncover your product’s most critical failure modes• The role of a “principal domain expert” in setting a consistent quality bar• Techniques for transforming messy error notes into a clean taxonomy of failures• When to use code-based checks vs. LLM-as-a-judge evaluators• How to build trust in your evals with human-labeled ground-truth datasets• Why binary pass/fail labels outperform Likert scales in practice• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvementReferences:• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272• Aman Khan: https://www.linkedin.com/in/amanberkeley/• Anthropic: https://www.anthropic.com/• Arize Phoenix: https://phoenix.arize.com/• Braintrust: https://www.braintrust.dev/• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev