About
This skill equips Claude with the expertise required to build and maintain resilient production systems. It provides structured patterns for defining Service Level Objectives (SLOs), implementing the three pillars of observability (metrics, logs, and traces), and managing the end-to-end incident lifecycle. By leveraging standardized runbooks, postmortem templates, and chaos engineering principles, this skill helps developers and SREs minimize downtime, optimize error budgets, and cultivate a culture of reliability throughout the software development lifecycle.