InfraRunBook
    NOC & DevOps Knowledge Platform

    Production Runbooks for Network & Infra Teams

    Step-by-step operational playbooks for IT, NOC, and DevOps engineers. Real commands, real configs.

    Quick:
    199 Runbooks21 CategoriesLast Updated: Trending: Cisco, Nginx

    Browse Categories

    All categories

    Featured Runbook

    Linux Network Unreachable Troubleshooting
    Linux Featured

    Linux Network Unreachable Troubleshooting

    Learn how to systematically diagnose and fix Linux Network is unreachable errors, covering missing default routes, downed interfaces, wrong IP assignments, firewall blocks, and DNS failures.

    Command Reference Preview

    bgp-troubleshoot.sh
    ! BGP Troubleshooting - Cisco IOS-XE
    R1# show ip bgp summary
    BGP router identifier 10.0.0.1, local AS number 65001
    Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
    10.0.0.2        4 65002    1024    1020        45    0    0 01:23:45        12
    10.0.0.3        4 65003       0       5         0    0    0 00:02:11 Active
    
    R1# show ip bgp neighbors 10.0.0.3 | inc BGP state
    BGP state = Active, unread input bytes = 0
    
    R1# debug ip bgp 10.0.0.3 events
    R1# clear ip bgp 10.0.0.3 soft

    All Runbooks

    14 entries
    Arista

    Arista EOS High CPU Troubleshooting

    High CPU on an Arista EOS switch can drop BGP sessions, stall OSPF convergence, and make the CLI unusable. This guide walks through every major root cause with real commands and fixes.

    arista eos high cpuarista eos troubleshootingeos bgp high cpu
    Apr 17
    Traefik

    Traefik Sticky Sessions Not Working

    Sticky sessions in Traefik fail silently and for a surprising number of distinct reasons. This runbook walks through every common root cause with real commands, log output, and concrete fixes.

    traefik sticky sessionstraefik load balancer cookietraefik session persistence
    Apr 17
    Kubernetes

    Kubernetes Ingress with TLS Setup Guide

    A practical runbook for configuring Kubernetes Ingress with TLS using the NGINX Ingress Controller and cert-manager with Let's Encrypt. Covers installation, ClusterIssuer setup, full working configuration, verification, and the mistakes that waste your afternoon.

    kubernetes ingress tlscert-manager kubernetesnginx ingress controller
    Apr 17
    Envoy

    Envoy Retry Policy Not Working

    Envoy retry policies fail silently in ways that are hard to diagnose. This guide walks through every common root cause — wrong retry conditions, exhausted budgets, per-try timeouts, mismatched status codes, and missing host selection plugins — with real commands and fixes.

    envoy retry policyenvoy retry not workingenvoy retry_on misconfiguration
    Apr 17
    Juniper

    Juniper Chassis Cluster Failover Issues

    A practical troubleshooting guide for Juniper SRX chassis cluster failover failures, covering fabric link problems, split brain, priority misconfiguration, missing preemption, and interface monitoring gaps.

    juniper chassis cluster failoverSRX split brainchassis cluster troubleshooting
    Apr 16
    Fortinet

    FortiGate SD-WAN Not Routing Correctly

    A hands-on troubleshooting guide for FortiGate SD-WAN routing failures, covering performance SLA configuration, rule priority, health check probes, interface membership, and bandwidth measurement issues with real CLI commands and fixes.

    FortiGate SD-WANSD-WAN routing issuesFortiGate troubleshooting
    Apr 16
    CI/CD

    Kubernetes Helm Deployment Failing

    A practical troubleshooting guide for Kubernetes Helm deployment failures covering values mismatches, chart version conflicts, missing CRDs, RBAC errors, and rollout timeouts with real CLI commands and fixes.

    kubernetes helm deployment failinghelm upgrade failedhelm installation failed
    Apr 16
    Logging

    Loki Ingestion Issues

    Diagnose and fix the most common Loki log ingestion failures, from Promtail not running to rate limit errors, out-of-order entries, and label mismatches.

    lokipromtaillog ingestion
    Apr 16
    Monitoring

    Prometheus High Cardinality Issues

    High cardinality in Prometheus causes memory exhaustion, query timeouts, and cascading monitoring failures. This runbook covers every root cause with real commands and actionable fixes.

    prometheus high cardinalityprometheus memory pressureprometheus query timeout
    Apr 16
    Databases

    MySQL Deadlock Analysis

    A hands-on guide to diagnosing and fixing MySQL deadlocks using real InnoDB output, covering circular lock dependencies, missing indexes, long-running transactions, explicit table locks, and autocommit pitfalls.

    mysql deadlockinnodb deadlockmysql lock wait timeout
    Apr 16
    Security

    WAF False Positive Troubleshooting

    A hands-on guide to diagnosing and fixing WAF false positives in ModSecurity and OWASP CRS — covering SQL injection rule triggers, file upload blocks, overly broad signatures, missing exceptions, and paranoia level tuning.

    WAF false positiveModSecurityOWASP CRS
    Apr 16
    F5

    F5 SSL Profile Client and Server Setup

    A practical, senior-engineer walkthrough of configuring F5 BIG-IP Client SSL and Server SSL profiles — from certificate import through verification and common production mistakes.

    F5 BIG-IP SSL profileclient SSL profileserver SSL profile
    Apr 16
    Kubernetes

    Kubernetes HPA Not Scaling

    Your Kubernetes HPA isn't scaling pods even under load? This runbook covers the most common root causes — from missing metrics server to misconfigured targets — with real CLI commands and fixes.

    kubernetes hpa not scalinghpa unknown metricsmetrics server kubernetes
    Apr 16
    Cloud

    AWS Auto Scaling Not Triggering

    Diagnose and fix AWS Auto Scaling groups that refuse to scale out, covering CloudWatch alarm failures, policy misconfigurations, capacity limits, launch template errors, and health check issues with real CLI commands.

    aws auto scaling not triggeringauto scaling group not scalingcloudwatch alarm not firing
    Apr 16

    For NOC & DevOps Teams

    Standardize your infra operations

    Production-tested runbooks for every incident. Stop guessing, start executing.