Symptoms
When AWS NAT Gateway failures occur, the symptoms typically manifest on EC2 instances running in private subnets that depend on the NAT Gateway for outbound internet access. You may observe one or more of the following:
- EC2 instances in private subnets cannot reach the internet or public AWS service endpoints
- SSH sessions to package mirrors time out when running yum, apt, or pip install
- Application logs show connection timed out or no route to host errors for outbound API calls
- AWS Systems Manager Session Manager fails to connect because the SSM endpoint is unreachable
- CloudWatch agent stops shipping metrics — the agent loses its HTTPS path to the CloudWatch regional endpoint
- curl or wget from a private instance returns:
curl: (6) Could not resolve host
orcurl: (28) Connection timed out after 5000 milliseconds
- VPC Flow Logs show REJECT on outbound traffic or packets that never leave the subnet interface
- The AWS Console shows the NAT Gateway state as failed, deleting, or the Elastic IP field is blank
These symptoms are commonly mistaken for DNS failures, security group misconfigurations, or application-layer bugs. The first diagnostic step is always to isolate whether the private instance can reach a known-good IP address directly — if that also times out, the problem almost certainly lies in routing or the NAT Gateway itself rather than in the application.
Root Cause 1: Route Table Missing NAT Gateway Entry
Why It Happens
Every private subnet must have a route table entry that sends non-VPC-local traffic (0.0.0.0/0) to the NAT Gateway. This entry is not added automatically when a NAT Gateway is created — AWS provisions the gateway but leaves route table association entirely to the operator. Engineers frequently create a NAT Gateway, confirm it reaches Available state, and then forget to update the route table of each private subnet that requires outbound access. The same failure pattern occurs after a Terraform or CloudFormation refactor that deletes and recreates the route table: the new table starts empty, the NAT Gateway association is lost, and instances go dark with no AWS-side error raised.
How to Identify It
Check the route table associated with the private subnet using its subnet ID:
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-0a1b2c3d4e5f6a7b8" \
--query "RouteTables[*].Routes" \
--output tableA broken route table shows only the local VPC route — no NAT Gateway entry:
------------------------------------------------------------
| DescribeRouteTables |
+----------------------+-------------+-------------------+
| DestinationCidrBlock | GatewayId | State |
+----------------------+-------------+-------------------+
| 10.0.0.0/16 | local | active |
+----------------------+-------------+-------------------+A correctly configured table includes the NAT Gateway as the default route target:
------------------------------------------------------------
| DescribeRouteTables |
+----------------------+------------------+---------------+
| DestinationCidrBlock | NatGatewayId | State |
+----------------------+------------------+---------------+
| 10.0.0.0/16 | local | active |
| 0.0.0.0/0 | nat-0a1b2c3d4e | active |
+----------------------+------------------+---------------+How to Fix It
Add the default route pointing at your NAT Gateway:
aws ec2 create-route \
--route-table-id rtb-0a1b2c3d4e5f6a7b8 \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-0a1b2c3d4e5f6a7b8Verify the route was added successfully:
aws ec2 describe-route-tables \
--route-table-ids rtb-0a1b2c3d4e5f6a7b8 \
--query "RouteTables[*].Routes[?DestinationCidrBlock=='0.0.0.0/0']"If you manage infrastructure as code, ensure the Terraform
aws_routeresource or the CloudFormation route definition explicitly references the NAT Gateway ID and is associated with every private subnet route table that needs outbound internet access.
Root Cause 2: NAT Gateway Created in the Wrong Subnet
Why It Happens
A NAT Gateway must reside in a public subnet — one that has a route to an Internet Gateway (IGW). When engineers provision a NAT Gateway under time pressure or while following an ambiguous runbook, they sometimes place it in the private subnet that requires outbound access. The NAT Gateway reaches Available state regardless of where it is placed; AWS does not validate or prevent this misconfiguration at creation time. Traffic from private instances correctly routes to the NAT Gateway, but the NAT Gateway itself has no path to the internet, so all outbound packets are silently dropped at the subnet boundary.
How to Identify It
Find the subnet in which your NAT Gateway was created:
aws ec2 describe-nat-gateways \
--nat-gateway-ids nat-0a1b2c3d4e5f6a7b8 \
--query "NatGateways[*].{ID:NatGatewayId,Subnet:SubnetId,State:State}"Output:
[
{
"ID": "nat-0a1b2c3d4e5f6a7b8",
"Subnet": "subnet-0private111222333",
"State": "available"
}
]Now verify whether that subnet has a route to an Internet Gateway:
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-0private111222333" \
--query "RouteTables[*].Routes[?GatewayId && GatewayId!='local']"If the result is empty or returns no IGW entry, the NAT Gateway is sitting in a private subnet with no internet path. That is the root cause.
How to Fix It
You cannot relocate a NAT Gateway between subnets after creation. You must create a replacement in the correct public subnet and update your route tables to point to it:
# Step 1 — Get the EIP allocation ID from the existing NAT Gateway
aws ec2 describe-nat-gateways \
--nat-gateway-ids nat-0a1b2c3d4e5f6a7b8 \
--query "NatGateways[*].NatGatewayAddresses[*].AllocationId"
# Step 2 — Create a new NAT Gateway in the correct PUBLIC subnet
aws ec2 create-nat-gateway \
--subnet-id subnet-0public444555666 \
--allocation-id eipalloc-0a1b2c3d4e5f6a7b8
# Step 3 — Update private subnet route tables to the new NAT GW
aws ec2 replace-route \
--route-table-id rtb-0a1b2c3d4e5f6a7b8 \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-NEWGATEWAYID
# Step 4 — Delete the misplaced NAT Gateway
aws ec2 delete-nat-gateway \
--nat-gateway-id nat-0a1b2c3d4e5f6a7b8Root Cause 3: Elastic IP Not Associated
Why It Happens
A NAT Gateway requires an Elastic IP address (EIP) to translate private RFC 1918 source addresses to a routable public IP for outbound internet traffic. If the EIP allocation was released externally — for example, by an operations engineer running a cleanup script against untagged EIPs — if the NAT Gateway creation failed partway through, or if the EIP was forcefully disassociated during an incident response, the NAT Gateway will appear in the console without a public IP and will be unable to forward any traffic to the internet. AWS marks the gateway as failed in this scenario, though the failure is not always immediately visible to operators who only check the gateway state at a high level.
How to Identify It
aws ec2 describe-nat-gateways \
--nat-gateway-ids nat-0a1b2c3d4e5f6a7b8 \
--query "NatGateways[*].{State:State,EIP:NatGatewayAddresses[0].PublicIp,AllocID:NatGatewayAddresses[0].AllocationId}"A gateway with a missing EIP returns null for both fields and shows a failed state:
[
{
"State": "failed",
"EIP": null,
"AllocID": null
}
]Separately audit your account's EIPs to confirm whether the allocation still exists:
aws ec2 describe-addresses \
--filters "Name=domain,Values=vpc" \
--query "Addresses[*].{AllocationId:AllocationId,PublicIP:PublicIp,AssociationId:AssociationId}"How to Fix It
A NAT Gateway in failed state cannot be repaired in place — it must be deleted and recreated with a new EIP allocation:
# Allocate a fresh EIP in the VPC domain
aws ec2 allocate-address --domain vpc
# Expected output:
# {
# "PublicIp": "52.x.x.x",
# "AllocationId": "eipalloc-0newalloc1234",
# "Domain": "vpc"
# }
# Create a replacement NAT Gateway in the correct public subnet
aws ec2 create-nat-gateway \
--subnet-id subnet-0public444555666 \
--allocation-id eipalloc-0newalloc1234
# Tag for accountability and cleanup protection
aws ec2 create-tags \
--resources nat-NEWGATEWAYID eipalloc-0newalloc1234 \
--tags Key=Name,Value=nat-gw-prod \
Key=ManagedBy,Value=infrarunbook-admin \
Key=Environment,Value=productionRoot Cause 4: Private Subnet Routing Wrong
Why It Happens
Even when the NAT Gateway exists in the correct public subnet with a valid EIP, private instances lose internet access if their subnet's route table points the default route at the wrong target. This misconfiguration takes several forms: the default route pointing at an Internet Gateway instead of the NAT Gateway (which works for public subnets but exposes private instances directly to the internet or simply fails when the instance has no public IP), the default route pointing to a Transit Gateway that has no internet breakout configured, or — most subtly — a pair of more-specific prefix routes (0.0.0.0/1 and 128.0.0.0/1) injected by a VPN or SD-WAN appliance that shadow and override the 0.0.0.0/0 NAT Gateway route.
How to Identify It
List all routes in the private subnet's route table and inspect what the default route target is:
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=subnet-0private111222333" \
--query "RouteTables[*].Routes" \
--output jsonBroken output — default route incorrectly targets the Internet Gateway:
[
[
{
"DestinationCidrBlock": "10.0.0.0/16",
"GatewayId": "local",
"State": "active"
},
{
"DestinationCidrBlock": "0.0.0.0/0",
"GatewayId": "igw-0a1b2c3d4e5f",
"State": "active"
}
]
]Split-tunnel VPN routes that silently override the NAT path:
[
{ "DestinationCidrBlock": "0.0.0.0/1", "TransitGatewayId": "tgw-0a1b2c3d" },
{ "DestinationCidrBlock": "128.0.0.0/1", "TransitGatewayId": "tgw-0a1b2c3d" },
{ "DestinationCidrBlock": "0.0.0.0/0", "NatGatewayId": "nat-0a1b2c3d" }
]In the split-tunnel case the /1 prefixes are more specific and win over /0 — the NAT Gateway route is effectively dead for all internet-bound traffic even though it appears valid.
How to Fix It
# Replace an incorrect IGW target with the NAT Gateway
aws ec2 replace-route \
--route-table-id rtb-0a1b2c3d4e5f6a7b8 \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-0a1b2c3d4e5f6a7b8
# If VPN /1 routes are not intentional, delete them
aws ec2 delete-route \
--route-table-id rtb-0a1b2c3d4e5f6a7b8 \
--destination-cidr-block 0.0.0.0/1
aws ec2 delete-route \
--route-table-id rtb-0a1b2c3d4e5f6a7b8 \
--destination-cidr-block 128.0.0.0/1
# If the /1 routes are required for VPN split-tunnel,
# ensure the Transit Gateway attachment has a configured
# internet breakout (an egress VPC with its own NAT GW).Root Cause 5: Bandwidth Limit Hit
Why It Happens
AWS NAT Gateways scale automatically up to 100 Gbps of aggregate bandwidth per gateway, but they enforce a hard ceiling of 55,000 simultaneous connections per unique destination IP and port combination. Additionally, the underlying packet-per-second (PPS) capacity can be saturated during sudden traffic bursts — for example, when hundreds of instances begin pulling software packages simultaneously at deployment time, or when a data pipeline fan-out floods a single NAT Gateway with concurrent S3 or external API connections. Unlike a routing failure, a bandwidth-limited NAT Gateway does not return an error to the source — it queues and then drops packets, producing application-level timeouts that are indistinguishable from a connectivity outage at the network layer.
How to Identify It
Query the
ErrorPortAllocationCloudWatch metric — any non-zero value indicates the NAT Gateway ran out of source ports for a destination:
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name ErrorPortAllocation \
--dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6a7b8 \
--start-time 2026-04-06T00:00:00Z \
--end-time 2026-04-06T23:59:59Z \
--period 300 \
--statistics Sum \
--output tableCheck for packets being dropped by the gateway:
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name PacketsDropCount \
--dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6a7b8 \
--start-time 2026-04-06T00:00:00Z \
--end-time 2026-04-06T23:59:59Z \
--period 60 \
--statistics SumInspect raw throughput to understand whether you are near the bandwidth ceiling:
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name BytesOutToDestination \
--dimensions Name=NatGatewayId,Value=nat-0a1b2c3d4e5f6a7b8 \
--start-time 2026-04-06T00:00:00Z \
--end-time 2026-04-06T23:59:59Z \
--period 60 \
--statistics SumHow to Fix It
The correct fix is to distribute outbound traffic across multiple NAT Gateways — one per Availability Zone. This both removes the single point of failure and multiplies total available bandwidth and connection capacity:
# Create a NAT Gateway in each AZ's public subnet
aws ec2 create-nat-gateway \
--subnet-id subnet-public-us-east-1a \
--allocation-id eipalloc-az1id
aws ec2 create-nat-gateway \
--subnet-id subnet-public-us-east-1b \
--allocation-id eipalloc-az2id
aws ec2 create-nat-gateway \
--subnet-id subnet-public-us-east-1c \
--allocation-id eipalloc-az3id
# Update each AZ's private subnet route table to use its local NAT GW
aws ec2 replace-route \
--route-table-id rtb-private-us-east-1a \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-az1id
aws ec2 replace-route \
--route-table-id rtb-private-us-east-1b \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-az2id
aws ec2 replace-route \
--route-table-id rtb-private-us-east-1c \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-az3idFor workloads that make large volumes of connections to AWS service endpoints (S3, DynamoDB, SSM, CloudWatch), deploy VPC Endpoints to bypass the NAT Gateway entirely for that traffic. Gateway endpoints for S3 and DynamoDB are free and eliminate both port exhaustion risk and data processing costs for those destinations.
Root Cause 6: NAT Gateway in Failed or Deleting State
Why It Happens
NAT Gateways can enter a failed state due to internal AWS infrastructure issues, EIP allocation conflicts at creation time, or subnet IP address exhaustion. They can also become stuck in a deleting state when active connections prevent the gateway from draining, or when a dependent resource (such as a route table still referencing the gateway) blocks cleanup. Instances continue sending traffic to a gateway ENI that is no longer forwarding, resulting in complete outbound connectivity loss with no self-healing behavior.
How to Identify It
aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=vpc-0a1b2c3d4e5f" \
--query "NatGateways[*].{ID:NatGatewayId,State:State,Code:FailureCode,Message:FailureMessage}" \
--output tableExample failed gateway output showing the specific failure reason:
-------------------------------------------------------------------------------------------
| DescribeNatGateways |
+---------------------+---------+------------------------+-------------------------------+
| ID | State | Code | Message |
+---------------------+---------+------------------------+-------------------------------+
| nat-0a1b2c3d4e5f | failed | InsufficientFreeAddrs | Subnet has insufficient |
| | | | free addresses. |
+---------------------+---------+------------------------+-------------------------------+How to Fix It
A failed NAT Gateway cannot be repaired — it must be replaced. If the failure reason is
InsufficientFreeAddrs, free up IP space in the public subnet before creating the replacement:
# Check available IPs remaining in the public subnet
aws ec2 describe-subnets \
--subnet-ids subnet-0public444555666 \
--query "Subnets[*].AvailableIpAddressCount"
# Delete the failed gateway
aws ec2 delete-nat-gateway --nat-gateway-id nat-0a1b2c3d4e5f
# Wait for deletion to complete (state transitions to deleted)
aws ec2 wait nat-gateway-deleted \
--nat-gateway-ids nat-0a1b2c3d4e5f
# Create replacement once subnet capacity is confirmed
aws ec2 create-nat-gateway \
--subnet-id subnet-0public444555666 \
--allocation-id eipalloc-0newalloc1234Root Cause 7: Network ACL Blocking Return Traffic
Why It Happens
Network ACLs (NACLs) are stateless — unlike security groups, they evaluate inbound and outbound rules independently for every packet, with no connection tracking. A common misconfiguration is adding an outbound rule on the public subnet's NACL that allows traffic to the internet while omitting an inbound rule to allow the ephemeral port range (1024–65535) for TCP and UDP return traffic. The NAT Gateway successfully forwards the outbound SYN packet, the remote server responds, but the inbound ACK is dropped at the public subnet NACL before it reaches the NAT Gateway for translation back to the source instance. The session appears to hang indefinitely.
How to Identify It
# Find the NACL associated with the public subnet where the NAT GW resides
aws ec2 describe-network-acls \
--filters "Name=association.subnet-id,Values=subnet-0public444555666" \
--query "NetworkAcls[*].{ID:NetworkAclId,Entries:Entries}"A broken NACL showing no inbound rule for ephemeral ports:
Inbound Rules:
Rule 100 | ALLOW | TCP | 0.0.0.0/0 | Port 443
Rule 200 | ALLOW | TCP | 0.0.0.0/0 | Port 80
Rule * | DENY | ALL | 0.0.0.0/0 | ALL
# Missing: inbound ALLOW for TCP/UDP ports 1024-65535 (ephemeral return traffic)How to Fix It
# Add inbound allow for TCP ephemeral return ports
aws ec2 create-network-acl-entry \
--network-acl-id acl-0a1b2c3d4e5f6a7b \
--rule-number 110 \
--protocol 6 \
--rule-action allow \
--ingress \
--cidr-block 0.0.0.0/0 \
--port-range From=1024,To=65535
# Add inbound allow for UDP ephemeral return ports
aws ec2 create-network-acl-entry \
--network-acl-id acl-0a1b2c3d4e5f6a7b \
--rule-number 120 \
--protocol 17 \
--rule-action allow \
--ingress \
--cidr-block 0.0.0.0/0 \
--port-range From=1024,To=65535Prevention
The majority of NAT Gateway failures are preventable through disciplined infrastructure design, automated guardrails, and proactive monitoring. Apply the following controls in every production VPC:
- Deploy one NAT Gateway per Availability Zone. Never share a single NAT Gateway across AZs. The cross-AZ failure blast radius is too wide, and cross-AZ data transfer incurs additional costs. Per-AZ deployment also increases aggregate bandwidth and connection capacity linearly.
- Manage all routing through Infrastructure as Code exclusively. Route table associations between private subnets and NAT Gateways must be defined in Terraform or CloudFormation and never modified manually. Manual changes are the primary source of the missing-route failure class. Use a
create_before_destroy
lifecycle rule onaws_nat_gateway
andaws_route
resources to ensure zero-downtime replacements. - Enforce mandatory tagging via AWS Config rules. Untagged NAT Gateways and EIPs are the resources most commonly deleted during cleanup operations. Require
Name
,Environment
, andManagedBy
tags on all NAT Gateway and EIP resources. Configure AWS Config to flag and alert on untagged resources before any deletion scripts run. - Set CloudWatch alarms on ErrorPortAllocation and PacketsDropCount. A non-zero
ErrorPortAllocation
value is always actionable. Route these alarms to your on-call channel so bandwidth saturation is detected before it becomes a user-impacting outage. - Use VPC Endpoints for all AWS service traffic. Gateway endpoints for S3 and DynamoDB are free and eliminate NAT Gateway processing costs and port exhaustion risk for those high-volume destinations. Interface endpoints for other AWS services (SSM, CloudWatch, ECR) reduce both NAT Gateway load and data transfer costs significantly.
- Audit EIPs regularly. Run a weekly check for EIPs in your account that are not associated with any NAT Gateway or EC2 instance. An unassociated EIP either costs money or represents a NAT Gateway that silently moved to a failed state without operator awareness.
- Store NAT Gateway IDs in AWS Systems Manager Parameter Store. CI/CD pipelines and monitoring systems can query these values and alert if a referenced NAT Gateway ID no longer exists in an Available state, catching silent failures immediately.
- Enable VPC Flow Logs on all subnets. Flow logs on the private subnet interface of the NAT Gateway ENI make it possible to distinguish between packets that never left the private subnet (routing failure) versus packets that left but were rejected at the NACL (stateless ACL misconfiguration).
Frequently Asked Questions
Q: How do I quickly test if outbound internet access is working from a private instance?
A: From the private instance (accessed via SSM Session Manager or a bastion host), run
curl -I https://api.solvethenetwork.com --connect-timeout 5. If it times out, the NAT Gateway path is broken. Also test a raw IP like
curl -I http://1.1.1.1 --connect-timeout 5— if that works but DNS names do not, the issue is DNS rather than NAT routing.
Q: Can I use a single NAT Gateway for multiple VPCs?
A: No. A NAT Gateway is scoped to a single VPC. For multi-VPC architectures, each VPC that requires internet egress needs its own NAT Gateways, or you must route traffic through a centralized egress VPC via Transit Gateway where the NAT Gateways reside. The centralized model reduces EIP count and cost but introduces a single point of failure if not deployed across multiple AZs.
Q: What happens to existing TCP sessions when a NAT Gateway is deleted or replaced?
A: All active sessions are immediately terminated with no graceful drain. To minimize disruption, always create the replacement NAT Gateway first, update the route table to point to it, confirm connectivity, then delete the old gateway. Deploy during a maintenance window and notify dependent application teams so they can implement reconnect logic.
Q: Why does my NAT Gateway show Available state but instances still cannot reach the internet?
A: The Available state only confirms the gateway resource itself is healthy — it provides no information about route table configuration, NACL rules, or instance-level security groups. Run through the route table check first (Root Causes 1 and 4), then check NACLs on both the private and public subnets for missing ephemeral port return rules, then verify instance security groups allow outbound traffic on the required ports.
Q: Does a NAT Gateway replace a security group?
A: No. NAT Gateways handle address translation and routing only. Security groups and NACLs remain fully in effect. An instance must have outbound security group rules allowing the required protocols and ports, and the public subnet NACL must include inbound allow rules for ephemeral port return traffic (TCP/UDP 1024–65535).
Q: How much does a NAT Gateway cost, and how can I reduce it?
A: NAT Gateways are billed per hour of existence (approximately $0.045/hour per gateway, region-dependent) plus per GB of data processed ($0.045/GB in us-east-1). Reduce costs by deploying VPC Gateway Endpoints for S3 and DynamoDB (free, bypasses NAT Gateway processing charges), using Interface Endpoints for other frequently-called AWS services, and ensuring per-AZ NAT Gateway deployment so instances do not generate cross-AZ data transfer charges by traversing a gateway in a different AZ.
Q: What is the difference between a NAT Gateway and a NAT instance?
A: A NAT Gateway is a fully managed AWS service — no AMI patching, no instance type management, and automatic scaling. A NAT instance is an EC2 instance running software NAT (iptables MASQUERADE) that requires source/destination check disabled on the ENI. NAT instances are legacy and should only be used if you specifically require port forwarding, custom routing logic, or need to operate in a severely cost-constrained environment where the per-hour gateway cost is prohibitive.
Q: Can I assign multiple Elastic IPs to a single NAT Gateway?
A: Yes. You can associate up to 8 EIPs with a single NAT Gateway. Each additional EIP expands the available source port pool, which helps avoid the 55,000 simultaneous connection limit when many instances are making concurrent connections to the same destination IP and port combination — common in environments hitting external APIs or NTP servers at high frequency.
Q: How do I monitor NAT Gateway health proactively before failures impact users?
A: Create CloudWatch alarms on the
AWS/NATGatewaynamespace: alarm on any non-zero
ErrorPortAllocation(port exhaustion), alarm when
PacketsDropCountsustains above baseline for two consecutive periods, and alert when
ActiveConnectionCountapproaches 50,000 per destination. Also alarm on
IdleTimeoutCountspikes, which can indicate upstream resource exhaustion causing connections to stall.
Q: My Terraform plan wants to recreate the NAT Gateway — will that cause downtime?
A: Yes, unless you use
create_before_destroy = truelifecycle rules on both the
aws_nat_gatewayand
aws_routeresources. With that setting Terraform provisions the replacement gateway first, updates the route, then destroys the old gateway. Without it, Terraform destroys the existing gateway first, leaving a gap of several minutes while the replacement gateway initializes and reaches Available state.
Q: Why do I see ErrorPortAllocation errors even with only a few instances running?
A: Port exhaustion can occur with very few instances if each instance opens a large number of concurrent connections to the same destination IP and port tuple. Each NAT Gateway allows 55,000 simultaneous connections per destination. A handful of instances running aggressive connection pools, batch download jobs, or high-frequency polling loops targeting a single endpoint can exhaust this limit rapidly. Mitigate by reducing connection concurrency, implementing connection pooling, adding additional EIPs to the NAT Gateway, or switching to a VPC Endpoint if the destination is an AWS service.
Q: Can VPC Flow Logs help diagnose NAT Gateway problems?
A: Yes, and they are one of the most reliable diagnostic tools available. Enable flow logs on the entire VPC or specifically on the NAT Gateway's ENI. Filter for REJECT actions to identify NACL-blocked traffic. Compare traffic arriving on the private subnet interface of the NAT Gateway against traffic departing on its public interface — if inbound traffic is present but outbound traffic for the same session is absent, the NAT Gateway itself is dropping the packet, which indicates a bandwidth or connection limit issue rather than a routing problem.
