AWS NAT Gateway Failures

Symptoms

When AWS NAT Gateway failures occur, the symptoms typically manifest on EC2 instances running in private subnets that depend on the NAT Gateway for outbound internet access. You may observe one or more of the following:

EC2 instances in private subnets cannot reach the internet or public AWS service endpoints
SSH sessions to package mirrors time out when running yum, apt, or pip install
Application logs show connection timed out or no route to host errors for outbound API calls
AWS Systems Manager Session Manager fails to connect because the SSM endpoint is unreachable
CloudWatch agent stops shipping metrics — the agent loses its HTTPS path to the CloudWatch regional endpoint
curl or wget from a private instance returns:
curl: (6) Could not resolve host
or
curl: (28) Connection timed out after 5000 milliseconds
VPC Flow Logs show REJECT on outbound traffic or packets that never leave the subnet interface
The AWS Console shows the NAT Gateway state as failed, deleting, or the Elastic IP field is blank

These symptoms are commonly mistaken for DNS failures, security group misconfigurations, or application-layer bugs. The first diagnostic step is always to isolate whether the private instance can reach a known-good IP address directly — if that also times out, the problem almost certainly lies in routing or the NAT Gateway itself rather than in the application.

Root Cause 1: Route Table Missing NAT Gateway Entry

Why It Happens

Every private subnet must have a route table entry that sends non-VPC-local traffic (0.0.0.0/0) to the NAT Gateway. This entry is not added automatically when a NAT Gateway is created — AWS provisions the gateway but leaves route table association entirely to the operator. Engineers frequently create a NAT Gateway, confirm it reaches Available state, and then forget to update the route table of each private subnet that requires outbound access. The same failure pattern occurs after a Terraform or CloudFormation refactor that deletes and recreates the route table: the new table starts empty, the NAT Gateway association is lost, and instances go dark with no AWS-side error raised.

How to Identify It

Check the route table associated with the private subnet using its subnet ID:

aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-0a1b2c3d4e5f6a7b8" \
  --query "RouteTables[*].Routes" \
  --output table

A broken route table shows only the local VPC route — no NAT Gateway entry:

------------------------------------------------------------
|                     DescribeRouteTables                  |
+----------------------+-------------+-------------------+
| DestinationCidrBlock | GatewayId   | State             |
+----------------------+-------------+-------------------+
|  10.0.0.0/16        |  local      |  active           |
+----------------------+-------------+-------------------+

A correctly configured table includes the NAT Gateway as the default route target:

------------------------------------------------------------
|                     DescribeRouteTables                  |
+----------------------+------------------+---------------+
| DestinationCidrBlock | NatGatewayId     | State         |
+----------------------+------------------+---------------+
|  10.0.0.0/16        |  local           |  active       |
|  0.0.0.0/0          |  nat-0a1b2c3d4e  |  active       |
+----------------------+------------------+---------------+

How to Fix It

Add the default route pointing at your NAT Gateway:

aws ec2 create-route \
  --route-table-id rtb-0a1b2c3d4e5f6a7b8 \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-0a1b2c3d4e5f6a7b8

Verify the route was added successfully:

aws ec2 describe-route-tables \
  --route-table-ids rtb-0a1b2c3d4e5f6a7b8 \
  --query "RouteTables[*].Routes[?DestinationCidrBlock=='0.0.0.0/0']"

If you manage infrastructure as code, ensure the Terraform

aws_route

resource or the CloudFormation route definition explicitly references the NAT Gateway ID and is associated with every private subnet route table that needs outbound internet access.

Root Cause 2: NAT Gateway Created in the Wrong Subnet

Why It Happens

A NAT Gateway must reside in a public subnet — one that has a route to an Internet Gateway (IGW). When engineers provision a NAT Gateway under time pressure or while following an ambiguous runbook, they sometimes place it in the private subnet that requires outbound access. The NAT Gateway reaches Available state regardless of where it is placed; AWS does not validate or prevent this misconfiguration at creation time. Traffic from private instances correctly routes to the NAT Gateway, but the NAT Gateway itself has no path to the internet, so all outbound packets are silently dropped at the subnet boundary.

How to Identify It

Find the subnet in which your NAT Gateway was created:

aws ec2 describe-nat-gateways \
  --nat-gateway-ids nat-0a1b2c3d4e5f6a7b8 \
  --query "NatGateways[*].{ID:NatGatewayId,Subnet:SubnetId,State:State}"

Output:

[
    {
        "ID": "nat-0a1b2c3d4e5f6a7b8",
        "Subnet": "subnet-0private111222333",
        "State": "available"
    }
]

Now verify whether that subnet has a route to an Internet Gateway:

aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-0private111222333" \
  --query "RouteTables[*].Routes[?GatewayId && GatewayId!='local']"

If the result is empty or returns no IGW entry, the NAT Gateway is sitting in a private subnet with no internet path. That is the root cause.

How to Fix It

You cannot relocate a NAT Gateway between subnets after creation. You must create a replacement in the correct public subnet and update your route tables to point to it:

# Step 1 — Get the EIP allocation ID from the existing NAT Gateway
aws ec2 describe-nat-gateways \
  --nat-gateway-ids nat-0a1b2c3d4e5f6a7b8 \
  --query "NatGateways[*].NatGatewayAddresses[*].AllocationId"

# Step 2 — Create a new NAT Gateway in the correct PUBLIC subnet
aws ec2 create-nat-gateway \
  --subnet-id subnet-0public444555666 \
  --allocation-id eipalloc-0a1b2c3d4e5f6a7b8

# Step 3 — Update private subnet route tables to the new NAT GW
aws ec2 replace-route \
  --route-table-id rtb-0a1b2c3d4e5f6a7b8 \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-NEWGATEWAYID

# Step 4 — Delete the misplaced NAT Gateway
aws ec2 delete-nat-gateway \
  --nat-gateway-id nat-0a1b2c3d4e5f6a7b8

Root Cause 3: Elastic IP Not Associated

Why It Happens

A NAT Gateway requires an Elastic IP address (EIP) to translate private RFC 1918 source addresses to a routable public IP for outbound internet traffic. If the EIP allocation was released externally — for example, by an operations engineer running a cleanup script against untagged EIPs — if the NAT Gateway creation failed partway through, or if the EIP was forcefully disassociated during an incident response, the NAT Gateway will appear in the console without a public IP and will be unable to forward any traffic to the internet. AWS marks the gateway as failed in this scenario, though the failure is not always immediately visible to operators who only check the gateway state at a high level.

How to Identify It

aws ec2 describe-nat-gateways \
  --nat-gateway-ids nat-0a1b2c3d4e5f6a7b8 \
  --query "NatGateways[*].{State:State,EIP:NatGatewayAddresses[0].PublicIp,AllocID:NatGatewayAddresses[0].AllocationId}"

A gateway with a missing EIP returns null for both fields and shows a failed state:

[
    {
        "State": "failed",
        "EIP": null,
        "AllocID": null
    }
]

Separately audit your account's EIPs to confirm whether the allocation still exists:

aws ec2 describe-addresses \
  --filters "Name=domain,Values=vpc" \
  --query "Addresses[*].{AllocationId:AllocationId,PublicIP:PublicIp,AssociationId:AssociationId}"

How to Fix It

A NAT Gateway in failed state cannot be repaired in place — it must be deleted and recreated with a new EIP allocation:

# Allocate a fresh EIP in the VPC domain
aws ec2 allocate-address --domain vpc

# Expected output:
# {
#     "PublicIp": "52.x.x.x",
#     "AllocationId": "eipalloc-0newalloc1234",
#     "Domain": "vpc"
# }

# Create a replacement NAT Gateway in the correct public subnet
aws ec2 create-nat-gateway \
  --subnet-id subnet-0public444555666 \
  --allocation-id eipalloc-0newalloc1234

# Tag for accountability and cleanup protection
aws ec2 create-tags \
  --resources nat-NEWGATEWAYID eipalloc-0newalloc1234 \
  --tags Key=Name,Value=nat-gw-prod \
         Key=ManagedBy,Value=infrarunbook-admin \
         Key=Environment,Value=production

Root Cause 4: Private Subnet Routing Wrong

Why It Happens

Even when the NAT Gateway exists in the correct public subnet with a valid EIP, private instances lose internet access if their subnet's route table points the default route at the wrong target. This misconfiguration takes several forms: the default route pointing at an Internet Gateway instead of the NAT Gateway (which works for public subnets but exposes private instances directly to the internet or simply fails when the instance has no public IP), the default route pointing to a Transit Gateway that has no internet breakout configured, or — most subtly — a pair of more-specific prefix routes (0.0.0.0/1 and 128.0.0.0/1) injected by a VPN or SD-WAN appliance that shadow and override the 0.0.0.0/0 NAT Gateway route.

How to Identify It

List all routes in the private subnet's route table and inspect what the default route target is:

aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-0private111222333" \
  --query "RouteTables[*].Routes" \
  --output json

Broken output — default route incorrectly targets the Internet Gateway:

[
    [
        {
            "DestinationCidrBlock": "10.0.0.0/16",
            "GatewayId": "local",
            "State": "active"
        },
        {
            "DestinationCidrBlock": "0.0.0.0/0",
            "GatewayId": "igw-0a1b2c3d4e5f",
            "State": "active"
        }
    ]
]

Split-tunnel VPN routes that silently override the NAT path:

[
    { "DestinationCidrBlock": "0.0.0.0/1",   "TransitGatewayId": "tgw-0a1b2c3d" },
    { "DestinationCidrBlock": "128.0.0.0/1", "TransitGatewayId": "tgw-0a1b2c3d" },
    { "DestinationCidrBlock": "0.0.0.0/0",  "NatGatewayId":    "nat-0a1b2c3d" }
]

In the split-tunnel case the /1 prefixes are more specific and win over /0 — the NAT Gateway route is effectively dead for all internet-bound traffic even though it appears valid.

How to Fix It

# Replace an incorrect IGW target with the NAT Gateway
aws ec2 replace-route \
  --route-table-id rtb-0a1b2c3d4e5f6a7b8 \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-0a1b2c3d4e5f6a7b8

# If VPN /1 routes are not intentional, delete them
aws ec2 delete-route \
  --route-table-id rtb-0a1b2c3d4e5f6a7b8 \
  --destination-cidr-block 0.0.0.0/1

aws ec2 delete-route \
  --route-table-id rtb-0a1b2c3d4e5f6a7b8 \
  --destination-cidr-block 128.0.0.0/1

# If the /1 routes are required for VPN split-tunnel,
# ensure the Transit Gateway attachment has a configured
# internet breakout (an egress VPC with its own NAT GW).

Root Cause 5: Bandwidth Limit Hit

Why It Happens

AWS NAT Gateways scale automatically up to 100 Gbps of aggregate bandwidth per gateway, but they enforce a hard ceiling of 55,000 simultaneous connections per unique destination IP and port combination. Additionally, the underlying packet-per-second (PPS) capacity can be saturated during sudden traffic bursts — for example, when hundreds of instances begin pulling software packages simultaneously at deployment time, or when a data pipeline fan-out floods a single NAT Gateway with concurrent S3 or external API connections. Unlike a routing failure, a bandwidth-limited NAT Gateway does not return an error to the source — it queues and then drops packets, producing application-level timeouts that are indistinguishable from a connectivity outage at the network layer.

How to Identify It

Query the

ErrorPortAllocation

CloudWatch metric — any non-zero value indicates the NAT Gateway ran out of source ports for a destination:

aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name ErrorPortAllocation \
  --dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6a7b8 \
  --start-time 2026-04-06T00:00:00Z \
  --end-time 2026-04-06T23:59:59Z \
  --period 300 \
  --statistics Sum \
  --output table

Check for packets being dropped by the gateway:

aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name PacketsDropCount \
  --dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6a7b8 \
  --start-time 2026-04-06T00:00:00Z \
  --end-time 2026-04-06T23:59:59Z \
  --period 60 \
  --statistics Sum

Inspect raw throughput to understand whether you are near the bandwidth ceiling:

aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6a7b8 \
  --start-time 2026-04-06T00:00:00Z \
  --end-time 2026-04-06T23:59:59Z \
  --period 60 \
  --statistics Sum

How to Fix It

The correct fix is to distribute outbound traffic across multiple NAT Gateways — one per Availability Zone. This both removes the single point of failure and multiplies total available bandwidth and connection capacity:

# Create a NAT Gateway in each AZ's public subnet
aws ec2 create-nat-gateway \
  --subnet-id subnet-public-us-east-1a \
  --allocation-id eipalloc-az1id

aws ec2 create-nat-gateway \
  --subnet-id subnet-public-us-east-1b \
  --allocation-id eipalloc-az2id

aws ec2 create-nat-gateway \
  --subnet-id subnet-public-us-east-1c \
  --allocation-id eipalloc-az3id

# Update each AZ's private subnet route table to use its local NAT GW
aws ec2 replace-route \
  --route-table-id rtb-private-us-east-1a \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-az1id

aws ec2 replace-route \
  --route-table-id rtb-private-us-east-1b \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-az2id

aws ec2 replace-route \
  --route-table-id rtb-private-us-east-1c \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-az3id

For workloads that make large volumes of connections to AWS service endpoints (S3, DynamoDB, SSM, CloudWatch), deploy VPC Endpoints to bypass the NAT Gateway entirely for that traffic. Gateway endpoints for S3 and DynamoDB are free and eliminate both port exhaustion risk and data processing costs for those destinations.

Root Cause 6: NAT Gateway in Failed or Deleting State

Why It Happens

NAT Gateways can enter a failed state due to internal AWS infrastructure issues, EIP allocation conflicts at creation time, or subnet IP address exhaustion. They can also become stuck in a deleting state when active connections prevent the gateway from draining, or when a dependent resource (such as a route table still referencing the gateway) blocks cleanup. Instances continue sending traffic to a gateway ENI that is no longer forwarding, resulting in complete outbound connectivity loss with no self-healing behavior.

How to Identify It

aws ec2 describe-nat-gateways \
  --filter "Name=vpc-id,Values=vpc-0a1b2c3d4e5f" \
  --query "NatGateways[*].{ID:NatGatewayId,State:State,Code:FailureCode,Message:FailureMessage}" \
  --output table

Example failed gateway output showing the specific failure reason:

-------------------------------------------------------------------------------------------
|                              DescribeNatGateways                                        |
+---------------------+---------+------------------------+-------------------------------+
|       ID            | State   | Code                   | Message                       |
+---------------------+---------+------------------------+-------------------------------+
|  nat-0a1b2c3d4e5f  | failed  | InsufficientFreeAddrs  | Subnet has insufficient       |
|                     |         |                        | free addresses.               |
+---------------------+---------+------------------------+-------------------------------+

How to Fix It

A failed NAT Gateway cannot be repaired — it must be replaced. If the failure reason is

InsufficientFreeAddrs

, free up IP space in the public subnet before creating the replacement:

# Check available IPs remaining in the public subnet
aws ec2 describe-subnets \
  --subnet-ids subnet-0public444555666 \
  --query "Subnets[*].AvailableIpAddressCount"

# Delete the failed gateway
aws ec2 delete-nat-gateway --nat-gateway-id nat-0a1b2c3d4e5f

# Wait for deletion to complete (state transitions to deleted)
aws ec2 wait nat-gateway-deleted \
  --nat-gateway-ids nat-0a1b2c3d4e5f

# Create replacement once subnet capacity is confirmed
aws ec2 create-nat-gateway \
  --subnet-id subnet-0public444555666 \
  --allocation-id eipalloc-0newalloc1234

Root Cause 7: Network ACL Blocking Return Traffic

Why It Happens

Network ACLs (NACLs) are stateless — unlike security groups, they evaluate inbound and outbound rules independently for every packet, with no connection tracking. A common misconfiguration is adding an outbound rule on the public subnet's NACL that allows traffic to the internet while omitting an inbound rule to allow the ephemeral port range (1024–65535) for TCP and UDP return traffic. The NAT Gateway successfully forwards the outbound SYN packet, the remote server responds, but the inbound ACK is dropped at the public subnet NACL before it reaches the NAT Gateway for translation back to the source instance. The session appears to hang indefinitely.

How to Identify It

# Find the NACL associated with the public subnet where the NAT GW resides
aws ec2 describe-network-acls \
  --filters "Name=association.subnet-id,Values=subnet-0public444555666" \
  --query "NetworkAcls[*].{ID:NetworkAclId,Entries:Entries}"

A broken NACL showing no inbound rule for ephemeral ports:

Inbound Rules:
  Rule 100 | ALLOW | TCP | 0.0.0.0/0 | Port 443
  Rule 200 | ALLOW | TCP | 0.0.0.0/0 | Port 80
  Rule *   | DENY  | ALL | 0.0.0.0/0 | ALL

# Missing: inbound ALLOW for TCP/UDP ports 1024-65535 (ephemeral return traffic)

How to Fix It

# Add inbound allow for TCP ephemeral return ports
aws ec2 create-network-acl-entry \
  --network-acl-id acl-0a1b2c3d4e5f6a7b \
  --rule-number 110 \
  --protocol 6 \
  --rule-action allow \
  --ingress \
  --cidr-block 0.0.0.0/0 \
  --port-range From=1024,To=65535

# Add inbound allow for UDP ephemeral return ports
aws ec2 create-network-acl-entry \
  --network-acl-id acl-0a1b2c3d4e5f6a7b \
  --rule-number 120 \
  --protocol 17 \
  --rule-action allow \
  --ingress \
  --cidr-block 0.0.0.0/0 \
  --port-range From=1024,To=65535

Prevention

The majority of NAT Gateway failures are preventable through disciplined infrastructure design, automated guardrails, and proactive monitoring. Apply the following controls in every production VPC:

Deploy one NAT Gateway per Availability Zone. Never share a single NAT Gateway across AZs. The cross-AZ failure blast radius is too wide, and cross-AZ data transfer incurs additional costs. Per-AZ deployment also increases aggregate bandwidth and connection capacity linearly.
Manage all routing through Infrastructure as Code exclusively. Route table associations between private subnets and NAT Gateways must be defined in Terraform or CloudFormation and never modified manually. Manual changes are the primary source of the missing-route failure class. Use a
create_before_destroy
lifecycle rule on
aws_nat_gateway
and
aws_route
resources to ensure zero-downtime replacements.
Enforce mandatory tagging via AWS Config rules. Untagged NAT Gateways and EIPs are the resources most commonly deleted during cleanup operations. Require
Name
,
Environment
, and
ManagedBy
tags on all NAT Gateway and EIP resources. Configure AWS Config to flag and alert on untagged resources before any deletion scripts run.
Set CloudWatch alarms on ErrorPortAllocation and PacketsDropCount. A non-zero
ErrorPortAllocation
value is always actionable. Route these alarms to your on-call channel so bandwidth saturation is detected before it becomes a user-impacting outage.
Use VPC Endpoints for all AWS service traffic. Gateway endpoints for S3 and DynamoDB are free and eliminate NAT Gateway processing costs and port exhaustion risk for those high-volume destinations. Interface endpoints for other AWS services (SSM, CloudWatch, ECR) reduce both NAT Gateway load and data transfer costs significantly.
Audit EIPs regularly. Run a weekly check for EIPs in your account that are not associated with any NAT Gateway or EC2 instance. An unassociated EIP either costs money or represents a NAT Gateway that silently moved to a failed state without operator awareness.
Store NAT Gateway IDs in AWS Systems Manager Parameter Store. CI/CD pipelines and monitoring systems can query these values and alert if a referenced NAT Gateway ID no longer exists in an Available state, catching silent failures immediately.
Enable VPC Flow Logs on all subnets. Flow logs on the private subnet interface of the NAT Gateway ENI make it possible to distinguish between packets that never left the private subnet (routing failure) versus packets that left but were rejected at the NACL (stateless ACL misconfiguration).

Symptoms

Root Cause 1: Route Table Missing NAT Gateway Entry

Why It Happens

How to Identify It

How to Fix It

Root Cause 2: NAT Gateway Created in the Wrong Subnet

Why It Happens

How to Identify It

How to Fix It

Root Cause 3: Elastic IP Not Associated

Why It Happens

How to Identify It

How to Fix It

Root Cause 4: Private Subnet Routing Wrong

Why It Happens

How to Identify It

How to Fix It

Root Cause 5: Bandwidth Limit Hit

Why It Happens

How to Identify It

How to Fix It

Root Cause 6: NAT Gateway in Failed or Deleting State

Why It Happens

How to Identify It

How to Fix It

Root Cause 7: Network ACL Blocking Return Traffic

Why It Happens

How to Identify It

How to Fix It

Prevention

Related Articles

Frequently Asked Questions

How do I quickly test if outbound internet access is working from a private instance?

Can I use a single NAT Gateway for multiple VPCs?

What happens to existing TCP sessions when a NAT Gateway is deleted or replaced?

Why does my NAT Gateway show Available state but instances still cannot reach the internet?

Does a NAT Gateway replace a security group?

How much does a NAT Gateway cost, and how can I reduce it?

What is the difference between a NAT Gateway and a NAT instance?

Can I assign multiple Elastic IPs to a single NAT Gateway?

How do I monitor NAT Gateway health proactively before failures impact users?

My Terraform plan wants to recreate the NAT Gateway — will that cause downtime?

Related Articles