Checkpoint/Azure S2S VPN intermittent one way issues

Okay so I am kind of at my wits end, My coworker set up a site 2 site VPN between an on-premise Checkpoint firewall and an Azure site, but it has some issues.
Servers in the Azure site can ping and scan the on-prem servers, but the on-prem servers can only rarely ping/scan servers in the Azure environment, The patern runs like this: the on-prem servers can ping the Azure servers for 10 minutes where everything seems fine, then 90-120 minutes of no pings going through.

i have the things on this troubleshooting guide:
https://learn.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-troubleshoot-site-to-site-disconnected-intermittently

But none of that seems to work, and it seems that atleast most of the things in the guide refer to if the entire vpn is having issues, but the oneway connection from azure to on-prem is having no issue.

I have checked the security logs which just says that the icmp packets get encrypted in the Azure community. the VPN is Policy-based
I just can’t find any theories as to why i would always work one way and sometimes work the otherway, everything i can think of should have either bigger or smaller consequences. So i was hoping one of you could help before we have to tear it down and start over.

We had this, and I don’t know why but eventually it degraded so much we lost all connection to the Azure GW. Rebuilt it with the exact same specs and now it has ran flawlessly.

In other words… tearing it down and starting over seemed to do the trick. We talked to IBM (consultant contractors for us) who stated that was actually pretty common that the Azure side could corrupt essentially.

This might sound odd, but check your encryption domains. We don’t use Checkpoint any more but we used to see one way issues occasionally and sometimes it was because the domains didn’t match exactly. When the other side negotiates, Checkpoint would stupidly accept even with the mismatch. But then when CP went to initiate traffic the other direction, it would not see the tunnel as valid and try to renegotiate again. the other side wouldn’t respond because it already saw a valid tunnel.

I’d like to think they’d fixed this in the couple of years since we moved off CP.

Have you applied MSS clamping on the Check Point?

A long, long time 15 years ago, I had a similar issue between a (very old and crusty even then) Check Point and a somewhat less crusty ASA.

I eventually determined that the Check Point end of things, unlike the majority of other vendors, was expiring the phase 1 SA but did not concurrently renegotiate the phase 2 SA. Meanwhile, the Cisco expected that. So traffic would drop until the phase 2 SA caught up, expired, and was renegotiated.

I wish I could tell you how to fix this on the Check Point, though. We ended up migrating that particular tunnel over to a NetScreen because the customer with the Check Point was adamant they weren’t possibly the problem.

We too had issues with the Encryption domains with our Checkpoint’s. We ended up redeploying the Vnet Gateways and it resolved the issue - only to rear its ugly head again. S2S with Azure Native is notoriously problematic.

We migrated that tunnel to our SD-WAN appliance (and away from Checkpoint) and never had issues with the Azure S2S tunnels again.

How many security associations do you have? Generally VPN gateways in public cloud only give you 1 SA per direction. This can cause interesting scenarios where sometimes one SA is active and then another one meaning that you can only reach certain parts of the network at different times.

I cannot help, but wanted you to know we see similar from our (sadly older) ASAs to the Azure Gateway.

It was bad for a long time, but now it just happens every now and then…

Back then you could edit a file to force CP to act like other vendors, renegotiating then expiring. don’t remember what file, might have been user.def, fwkern, or some other of the various files you often needed to modify.

I recall having issues of a similar vintage, connecting our old R77.30 Check Points to Azure VPN gateway maybe 10-11 years ago. However our failures seemed to happen after a certain amount of time with no data over the tunnel.

So our temporary fixes were:

- Run a ping -t in both directions to make sure there was always at least some traffic
- Move the VPN link to an old Fortinet 90C someone had in a drawer

And we only did these while waiting for the final fix, which was installation of our MPLS and ExpressRoute connections.

I ran a packet capture, and it shows the packets from the on-prem server, both going backwards and forwards, but all the pings from the on-prem server failed while all the ones from azure succeded.