Production Runbooks for Network & Infra Teams
Step-by-step operational playbooks for IT, NOC, and DevOps engineers. Real commands, real configs.
Browse Categories
All categoriesFeatured Runbook

Linux Network Unreachable Troubleshooting
Learn how to systematically diagnose and fix Linux Network is unreachable errors, covering missing default routes, downed interfaces, wrong IP assignments, firewall blocks, and DNS failures.
Command Reference Preview
! BGP Troubleshooting - Cisco IOS-XE
R1# show ip bgp summary
BGP router identifier 10.0.0.1, local AS number 65001
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.0.0.2 4 65002 1024 1020 45 0 0 01:23:45 12
10.0.0.3 4 65003 0 5 0 0 0 00:02:11 Active
R1# show ip bgp neighbors 10.0.0.3 | inc BGP state
BGP state = Active, unread input bytes = 0
R1# debug ip bgp 10.0.0.3 events
R1# clear ip bgp 10.0.0.3 softAll Runbooks
14 entriesArista EOS High CPU Troubleshooting
High CPU on an Arista EOS switch can drop BGP sessions, stall OSPF convergence, and make the CLI unusable. This guide walks through every major root cause with real commands and fixes.
Traefik Sticky Sessions Not Working
Sticky sessions in Traefik fail silently and for a surprising number of distinct reasons. This runbook walks through every common root cause with real commands, log output, and concrete fixes.
Kubernetes Ingress with TLS Setup Guide
A practical runbook for configuring Kubernetes Ingress with TLS using the NGINX Ingress Controller and cert-manager with Let's Encrypt. Covers installation, ClusterIssuer setup, full working configuration, verification, and the mistakes that waste your afternoon.
Envoy Retry Policy Not Working
Envoy retry policies fail silently in ways that are hard to diagnose. This guide walks through every common root cause — wrong retry conditions, exhausted budgets, per-try timeouts, mismatched status codes, and missing host selection plugins — with real commands and fixes.
Juniper Chassis Cluster Failover Issues
A practical troubleshooting guide for Juniper SRX chassis cluster failover failures, covering fabric link problems, split brain, priority misconfiguration, missing preemption, and interface monitoring gaps.
FortiGate SD-WAN Not Routing Correctly
A hands-on troubleshooting guide for FortiGate SD-WAN routing failures, covering performance SLA configuration, rule priority, health check probes, interface membership, and bandwidth measurement issues with real CLI commands and fixes.
Kubernetes Helm Deployment Failing
A practical troubleshooting guide for Kubernetes Helm deployment failures covering values mismatches, chart version conflicts, missing CRDs, RBAC errors, and rollout timeouts with real CLI commands and fixes.
Prometheus High Cardinality Issues
High cardinality in Prometheus causes memory exhaustion, query timeouts, and cascading monitoring failures. This runbook covers every root cause with real commands and actionable fixes.
MySQL Deadlock Analysis
A hands-on guide to diagnosing and fixing MySQL deadlocks using real InnoDB output, covering circular lock dependencies, missing indexes, long-running transactions, explicit table locks, and autocommit pitfalls.
WAF False Positive Troubleshooting
A hands-on guide to diagnosing and fixing WAF false positives in ModSecurity and OWASP CRS — covering SQL injection rule triggers, file upload blocks, overly broad signatures, missing exceptions, and paranoia level tuning.
Kubernetes HPA Not Scaling
Your Kubernetes HPA isn't scaling pods even under load? This runbook covers the most common root causes — from missing metrics server to misconfigured targets — with real CLI commands and fixes.
AWS Auto Scaling Not Triggering
Diagnose and fix AWS Auto Scaling groups that refuse to scale out, covering CloudWatch alarm failures, policy misconfigurations, capacity limits, launch template errors, and health check issues with real CLI commands.
For NOC & DevOps Teams
Standardize your infra operations
Production-tested runbooks for every incident. Stop guessing, start executing.