Humans of Reliability

Rootly

English Technology

Apple Podcasts Website RSS

Episodes 33

Avg. Duration 27m

Activity Highly Active

Apple Rating ★ 5.0 (3)

Since Jan 2025

Latest Episode May 2026

Publishing Details

Schedule

Every 2 Weeks

Format

Episodic

Consistency

58%

Hosting

rss.buzzsprout.com

About This Podcast

Behind every reliable software system, there are people working hard to keep it online.

Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly.

From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.

https://rootly.com/humans-of-reliability

Explore Statistics

English Podcasts Report Technology Report English Technology Report

Recent Episodes

Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

May 28, 2026 28m

Every pilot who's never had an engine failure is still ready for one. The same can't be said for most software engineers facing their first major incident. Hamed Silatani, co-founder and CEO of…

LLM Observability: Lessons From MLOps w/ Maria Vechtomova (Cauchy)

May 14, 2026 20m

For nine years, Maria Vechtomova was shouting about monitoring. Nobody cared, until LLMs arrived. As co-founder of Cauchy, Databricks MVP, and one of the most followed voices in MLOps, Maria has…

The Golden Hour: Why the First 15 Minutes of an Incident Decide Everything w/ Gandhi M. N. Kumar (Twillio)

Apr 28, 2026 29m

Most incident response advice focuses on tools, alerts, and post-mortems. Gandhi Mathi Nathan Kumar, Principal Incident Commander at Twilio, with 14 years running calls that have pulled in up to 100…

From 600 to 6,000: Federating Incident Response w/ Cliff Snyder (ex-LinkedIn SRE)

Apr 22, 2026 26m

A centralized SRE team of 600 engineers as the first line of defense for every incident works - until the business asks you to spread that responsibility across 6,000. Cliff Snyder, senior SRE at…

AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

Apr 09, 2026 30m

Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-Founder and CTO of Cortex, has spent the better part of a decade…

Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)

Mar 31, 2026 29m

The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the biggest barrier left is cultural. Dana Lawson, CTO at…

S2026E4 The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)

Mar 17, 2026 29m

Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis, the autonomous testing platform that just raised…

S2026E3 Burnout Doesn't Ask Permission: Recognizing, Recovering, and Rebuilding w/ Stephen Townsend

Mar 04, 2026 31m

Burnout doesn't announce itself. For Stephen Townsend, SRE team lead and host of the Slight Reliability podcast, it crept in over months of mounting pressure on a massive transformation program, and…

S2026E2 Code Is Cheap, Reliability Isn’t: Owning Production in the AI era w/ Swizec Teller

Feb 16, 2026 29m

Code has never been easier to write. With AI copilots and agentic coding tools, spinning up features feels almost effortless. But production systems don’t run on vibes, they run on reliability.In…

S2026E1 Democratizing Reliability: Empowering Non-Devs with Dileshni Jayasinghe (commonsku)

Jan 14, 2026 22m

Many companies don’t invest in incident management until something goes wrong. commonsku took a different path.In this episode of Humans of Reliability, Sylvain sits down with Dileshni Jayasingha, VP…

S1E23 99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Tomás Hernando Koffman (Not Diamond)

Dec 22, 2025 30m

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while…

S1E22 The Reality of GenAI in Production with Eduardo Ordax (AWS)

Dec 12, 2025 27m

GenAI demos are easy. Production is where everything breaks. In this episode, Eduardo Ordax, Principal GTM GenAI at AWS, breaks down what actually stops companies from shipping reliable AI systems,…

S1E21 It’s Never Different This Time: LLM Reliability Without the Hype with Julien Simon

Nov 19, 2025 30m

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed.Julien breaks down why calling “the same…

S1E20 You Can’t Fix What You Don’t Measure: Observability in the Age of AI with Conor Bronsdon

Nov 05, 2025 31m

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to…

S1E19 The End of “Good Code”? AI, Throughput, and Reliability with CircleCI CTO Rob Zuber

Sep 10, 2025 37m

Is “good code” still the right measure of engineering success in an AI-driven world? In this episode of Humans of Reliability, Rob Zuber, CircleCI CTO, joins Sylvain to explore how coding assistants…

S1E18 Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)

Aug 20, 2025 31m

What does it really take to move from firefighting incidents to building reliability at scale? In this episode of Humans of Reliability, Shery Brauner (Razor, ex-Zalando) shares her unique journey…

S1E17 Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)

Jul 03, 2025 13m

Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the intersection of traditional finance and crypto…

S1E16 Command Under Pressure: David Owczarek on Incident Leadership and Human-Centered Reliability

Jun 17, 2025 23m

Incident response is as much about people as it is about systems. In this episode, David Owczarek, a veteran engineer leader and seasoned incident commander, joins Silvan Kalache to unpack the human…

S1E15 AI at the Frontlines of Healthcare Reliability with Ryan Lockard (CVS Health)

May 30, 2025 24m

AI is transforming reliability work—from reactive firefighting to proactive engineering. In this episode, Ryan Lockard, VP of Platform Engineering and AI Enablement at CVS Health, joins Sylvain…

S1E14 Trust Is the Product: Building Reliable Billing in the AI Era with Cosmo Wolfe (Metronome)

May 26, 2025 20m

In this episode, we sit down with Cosmo Wolfe, Head of Technology at Metronome, to unpack how reliability, trust, and architecture intersect in one of the most critical and overlooked parts of the AI…

Frequently Asked Questions

How many episodes does Humans of Reliability have?

Humans of Reliability has published 33 episodes since January 2025, covering topics in Technology.

Is Humans of Reliability still active?

Humans of Reliability is currently highly active with new episodes every 2 weeks. Average episode length is 27m.