We are seeking an experienced AWS Networking Specialist to diagnose and resolve an ongoing VPN connection issue.
Our multi-account environment uses a Transit Gateway with lab and production VPCs, and the VPN has been running smoothly for over a year.
Recently, however, the connection has started resetting every hour without a clear cause.
1.
AWS periodically updates the firmware and software of managed services (including Site-to-Site VPN and Transit Gateway).
During one of these updates, AWS may have changed cryptographic policy parameters or security association (SA) lifetimes.
2.
AWS periodically publishes updated cryptographic recommendations, such as moving to more secure IKEv2 algorithms and retiring weak ciphers.
If your infrastructure automatically follows these best practices (e.g., through Terraform/CloudFormation with security policy compliance checks), then updating the templates may change the VPN settings.
3.
If your administrator has enabled new features (for example, advanced monitoring, additional VPN sessions, or High Availability functions), some parts of the VPN connection settings might revert to default values that differ from the previous configuration.
This is especially likely in multi-account environments, where a change made in one place (for example, via AWS Organizations or Resource Access Manager) can affect all associated Transit Gateway settings.
4.
For IPsec tunnels, periodic rekeying is normal behavior.
However, if the Phase 1 or Phase 2 lifetime is suddenly shortened after an update, the session may be reestablished more frequently.
If there is a desynchronization with the local side, the tunnel may drop or reset every hour or half hour.
5.
Your administrator might have enabled (or changed) the parameter «rekey margin time», which defines how long before the SA ends renegotiation starts.
6.
In a multi-account environment, it is not uncommon for multiple people to have access (through different roles or organizations).
It is possible that someone has temporarily changed the settings (e.g. to test another tunnel) and then partially reverted them, resulting in some incompatibility with the local VPN gateway.
7.
If you have updated the software (for example, the firmware of a firewall, router, or VPN gateway) or replaced any equipment, the IPsec parameters or keep-alive mechanisms may have changed, resulting in disconnections.
8.
In an AWS Transit Gateway environment, especially with multiple VPCs and accounts in use, random or scheduled changes (for example, adding or removing route tables, prefixes, or attachments) can lead to improper routing.
9.
Periodic network outages occurring every 60 minutes may be due to DHCP leases or NAT timeouts at an internet service provider (ISP).
10.
If your local (on-premises) gateway (or router) is configured to obtain an external IP address via DHCP from the provider, and the provider issues a short-term lease (for example, exactly 60 minutes), then the following scenario is possible:
- When the lease is renewed at the 50th or 55th minute (depending on the mechanism), the router can break the session for a fraction of a second or completely change the IP address (if the pool of dynamic addresses changes).
- Because of this, the IPsec tunnel loses the current peer IP (or NAT mapping) and reinitializes itself.
In practice, this looks like 1–2 minutes of VPN downtime every hour.
11.
If there is another NAT between your VPN gateway and the public Internet (for example, a home router, NAT from a 4G/LTE provider, or a corporate firewall), then sometimes the NAT session is dropped when there is no active traffic or when the specified maximum session time is reached.
12.
In some IPsec/VPN configurations, NAT-Traversal (NAT-T) is not enabled, and your traffic is encapsulated in ESP (Encapsulating Security Payload).
If the provider is still applying NAT along the route, the tunnel may become unstable or fail to reestablish after the first rekey.
13.
Sometimes there is a mixed scenario.
For example, your VPN configuration might have a Phase 1 lifetime set to 28800 seconds (8 hours) and a Phase 2 lifetime set to 3600 seconds (1 hour).
At the same time, the NAT gateway resets the session every 60 minutes.
It is necessary to check that these 2 events do not overlap:
13.1) Rekey (IPsec key regeneration) every hour: if one side is set to 3600 seconds and the other to 4800 seconds, a conflict may occur.
13.2) NAT timeout is also one hour: if a rekey is attempted, the NAT session may be dropped.
14.
Increasing or decreasing the frequency of keep-alive messages, or disabling them, can lead to the tunnel being considered dead, resulting in its reestablishment.