Incident Runbook Templates

Create structured incident response runbooks with step-by-step procedures, escalation paths…

A set of production-ready incident response runbook templates that turn 3 a.m. panic into step-by-step procedure. Each template carries severity levels, triage decision trees, copy-paste-ready mitigation commands, escalation matrices, and communication scripts so on-call engineers act on procedure, not guesswork.

$15 one-time
Add to a kit →

Prices include 20% VAT. · Forged on real agency work · one-time, no lock-in

  • Type Skill
  • Category DevOps & Infra
  • Delivery Email · instant
  • License One-time
Run preview
forgehouse, incident-runbook-templates

Inside the run · no black box

See the actual work before you buy it.

Runbooks here are written for a 3 AM brain: severity decided by table, triage in copy-paste commands, rollback before root cause. Every move below is pre-decided so nobody improvises mid-outage.

  1. Classifies severity first against a fixed table: SEV1 complete outage gets a 15-minute response clock, SEV2 major degradation 30 minutes, down to SEV4 next business day, so the response effort matches the blast radius.
  2. Runs the first-5-minutes triage with copy-paste commands: pod status, recent deploy history, error-rate query, plus a symptom-to-section decision table (all requests failing means service down, high latency means database or dependency) that removes guessing.
  3. Mitigates rollback-first: the opening move is always returning to the last known good state (kubectl rollout undo, feature flag off, DB migration rollback). Root cause hunting is deliberately deferred to after service is restored.
  4. Executes the scenario-specific procedure: four pre-written paths for full outage, high latency, partial failures and traffic surge, each a numbered command sequence including scaling, killing slow queries, circuit breakers and rate limits.
  5. Verifies recovery with concrete checks: health endpoint, error-rate back under threshold in Prometheus, p99 latency query, then a smoke test of the critical flows before anyone says resolved.
  6. Communicates on a fixed cadence with three ready templates (initial, status update every 15 minutes on SEV1, resolution), and escalates by rule, not by mood: 15 minutes unresolved SEV1 goes to the engineering manager, suspected data breach goes straight to security.
Use cases · what happens when you plug it in

One power source. 6 lines out.

incident-runbook-templates · core

core active · 6 lines

  1. Writing service-specific runbooks for outages, latency spikes, and traffic surges

    ✓ writing service-specific…
  2. Responding to an active production incident under time pressure

    ✓ responding to an active
  3. Establishing escalation paths and severity definitions (SEV1 to SEV4)

    ✓ establishing escalation…
  4. Onboarding new on-call engineers with a repeatable playbook

    ✓ onboarding new on-call e…
  5. Documenting database incidents like connection-pool exhaustion and replication lag

    ✓ documenting database inc…
  6. Standardizing internal status updates and resolution notifications

    ✓ standardizing internal s…
Benefits · what you walk away with

Yours to keep.

Drag time forward. Watch what stays.

Forever

That's what owning means.

The rented stack

ai writing tool: subscription

expired · access lost

analytics suite: subscription

expired · access lost

design platform: subscription

expired · access lost

(nothing left)

Your forge

  1. Cut mean time to recovery by leading with rollback before root-cause hunting

    license: perpetual
  2. Reduce on-call stress with decision trees that remove thinking load mid-crisis

    license: perpetual
  3. Keep stakeholders calm with timed, pre-written communication cadence

    license: perpetual
  4. Onboard new responders faster because every step is written for the 3 a.m. brain

    license: perpetual

subscriptions expire · deeds don't

What's included · the full manifest

Everything in the box.

Pick a piece up. Watch it work.

Service outage runbook with triage, mitigation, verification, and rollback sections

part 01 of 06 · in the box

6 parts · one working system · ships instantly by email

Who it's for

This wasn't forged for everyone.

  • Not for you if you'd rather rent a tool than own one.
  • Not for you if you want someone else to run your stack.
  • Not for you if you're happy guessing.
Still here? Good.

SRE, DevOps, and platform on-call engineers who need reliable, repeatable incident response procedures for production systems.

then this was forged for you.

Works with

Universal by design: these run in any AI. Delivered in the open Agent Skills + MCP format (native in Claude); ChatGPT, Gemini, Cursor and Copilot adapt the same files their own way.

  • Claude Native format
  • ChatGPT Adapts via open standards
  • Gemini Adapts via open standards
  • Cursor Adapts via open standards
  • Copilot Adapts via open standards
Questions · still in the air

Catch what's on your mind.

the air is clear. nothing between you and the forge.
catch a spark: the forge will answer

  1. Our stack is mostly managed services. Do service-specific runbooks still fit?

    Yes. The templates are organized by incident class, not vendor: outage, latency spike, connection-pool exhaustion, replication lag. Command sections are placeholders you fill with your own tooling; the triage flow, severity matrix, and escalation paths stay the same.

  2. We already have an incident wiki page. What's different here?

    A wiki explains; these templates execute. Severity definitions, triage decision trees, copy-paste mitigation commands, and timed communication scripts are written for the 3 a.m. brain, and the structure leads with rollback before root-cause hunting, which is what actually shortens recovery.

  3. Will it detect or auto-resolve incidents?

    No. It's procedure, not monitoring. Your alerting decides when an incident starts; the runbooks tell the human what to do next, step by step, once the pager goes off.

  4. How is it delivered?

    By email right after purchase: ready to run, downloaded instantly, no setup wait.

  5. One-time or subscription?

    A one-time purchase; no subscription or hidden fees. VAT (20%) is included.

  6. Can I get a refund?

    As a digital product, it can’t be refunded once downloaded. That’s why we show exactly what’s inside and who it’s for, right here.