Publishing Details
About This Podcast
Behind every reliable software system, there are people working hard to keep it online.
Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly.
From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.
https://rootly.com/humans-of-reliability
Explore Statistics
Recent Episodes
Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)
Every pilot who's never had an engine failure is still ready for one. The same can't be said for most software engineers facing their first major incident. Hamed Silatani, co-founder and CEO of…
LLM Observability: Lessons From MLOps w/ Maria Vechtomova (Cauchy)
For nine years, Maria Vechtomova was shouting about monitoring. Nobody cared, until LLMs arrived. As co-founder of Cauchy, Databricks MVP, and one of the most followed voices in MLOps, Maria has…
The Golden Hour: Why the First 15 Minutes of an Incident Decide Everything w/ Gandhi M. N. Kumar (Twillio)
Most incident response advice focuses on tools, alerts, and post-mortems. Gandhi Mathi Nathan Kumar, Principal Incident Commander at Twilio, with 14 years running calls that have pulled in up to 100…
From 600 to 6,000: Federating Incident Response w/ Cliff Snyder (ex-LinkedIn SRE)
A centralized SRE team of 600 engineers as the first line of defense for every incident works - until the business asks you to spread that responsibility across 6,000. Cliff Snyder, senior SRE at…
AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)
Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-Founder and CTO of Cortex, has spent the better part of a decade…
Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)
The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the biggest barrier left is cultural. Dana Lawson, CTO at…
S2026E4 The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)
Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis, the autonomous testing platform that just raised…
S2026E3 Burnout Doesn't Ask Permission: Recognizing, Recovering, and Rebuilding w/ Stephen Townsend
Burnout doesn't announce itself. For Stephen Townsend, SRE team lead and host of the Slight Reliability podcast, it crept in over months of mounting pressure on a massive transformation program, and…
S2026E2 Code Is Cheap, Reliability Isn’t: Owning Production in the AI era w/ Swizec Teller
Code has never been easier to write. With AI copilots and agentic coding tools, spinning up features feels almost effortless. But production systems don’t run on vibes, they run on reliability.In…
S2026E1 Democratizing Reliability: Empowering Non-Devs with Dileshni Jayasinghe (commonsku)
Many companies don’t invest in incident management until something goes wrong. commonsku took a different path.In this episode of Humans of Reliability, Sylvain sits down with Dileshni Jayasingha, VP…
S1E23 99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Tomás Hernando Koffman (Not Diamond)
Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while…
S1E22 The Reality of GenAI in Production with Eduardo Ordax (AWS)
GenAI demos are easy. Production is where everything breaks. In this episode, Eduardo Ordax, Principal GTM GenAI at AWS, breaks down what actually stops companies from shipping reliable AI systems,…
S1E21 It’s Never Different This Time: LLM Reliability Without the Hype with Julien Simon
In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed.Julien breaks down why calling “the same…
S1E20 You Can’t Fix What You Don’t Measure: Observability in the Age of AI with Conor Bronsdon
Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to…
S1E19 The End of “Good Code”? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber
Is “good code” still the right measure of engineering success in an AI-driven world? In this episode of Humans of Reliability, Rob Zuber, CircleCI CTO, joins Sylvain to explore how coding assistants…
S1E18 Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)
What does it really take to move from firefighting incidents to building reliability at scale? In this episode of Humans of Reliability, Shery Brauner (Razor, ex-Zalando) shares her unique journey…
S1E17 Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)
Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the intersection of traditional finance and crypto…
S1E16 Command Under Pressure: David Owczarek on Incident Leadership and Human-Centered Reliability
Incident response is as much about people as it is about systems. In this episode, David Owczarek, a veteran engineer leader and seasoned incident commander, joins Silvan Kalache to unpack the human…
S1E15 AI at the Frontlines of Healthcare Reliability with Ryan Lockard (CVS Health)
AI is transforming reliability work—from reactive firefighting to proactive engineering. In this episode, Ryan Lockard, VP of Platform Engineering and AI Enablement at CVS Health, joins Sylvain…
S1E14 Trust Is the Product: Building Reliable Billing in the AI Era with Cosmo Wolfe (Metronome)
In this episode, we sit down with Cosmo Wolfe, Head of Technology at Metronome, to unpack how reliability, trust, and architecture intersect in one of the most critical and overlooked parts of the AI…
Frequently Asked Questions
Humans of Reliability has published 33 episodes since January 2025, covering topics in Technology.
Humans of Reliability is currently highly active with new episodes every 2 weeks. Average episode length is 27m.