Hello everyone,
we have a Traffic Manager profile that should distribute traffic between two Azure endpoints, let's call them "primary" and "secondary".
The two endpoints are both web apps deployed in two different regions: primary is in West Europe, secondary in France Central.
Yesterday we had an outage that lasted more than one hour because the secondary endpoint went crazy (very high cpu and memory usage) but the Traffic Manager continued distributing incoming requests to both endpoints; after I manually disabled the secondary endpoint through the Azure Portal everything went back to normal.
How do I prevent this from happening again? Why didn't the Traffic Manager detect slow response times in the secondary endpoint and disable it?
Our Traffic Manager configuration:
Routing method: Performance; DNS TTL: 30 seconds
[Monitor settings] Protocol: HTTPS; Port: 443; Path: /KeepAlive.aspx (this web page loads a record from the database using Entity Framework, in a normal situation the response time should be around 1 second)
[Fast endpoint failover settings] Probing interval: 30; Tolerated number of failures: 3; Probe timeout: 10 minutes