Gigalixir Status Page - GCP us-central1 Service Outage – Incident details

GCP us-central1 Service Outage

Resolved
Partial outage
Started 2 months agoLasted about 3 hours

Affected

US-Central1 (Google Cloud)

Partial outage from 11:31 AM to 2:57 PM

Updates
  • Resolved
    Resolved

    July 15-16, 2024 us-central Ingress

    Incident Report Updated July 18, 2024

    Description of Issue

    We received two separate occasions of SYN Flood attacks.

    SYN Flood attacks are when a source (in this case many) sends TCP SYN packets with illegitimate source addresses. 

    Servers attempt to respond to the illegitimate source addresses multiple times and hold onto these "half-open" connections waiting for a reply.  Given enough of these types of packets over a short period of time, it can overwhelm buffers in servers and prevent them from handling "proper" traffic.

    The source of the attack has not been identified.  The attackers were clever enough to use a distributed attack.

    This attack was targeted at our load balancers directly and was not an attack on Google Cloud.

    Scope of the Issue

    Due to the nature of the attack, it had an impact on the packet delivery of our network interfaces on our servers.  This means that all applications running on the impacted server(s) could experience connectivity issues (ingress and egress) as the network interfaces would become unusable at times.

    Though the attack came in on our default ingress load balancers, it was able to affect our dedicated ingress applications as well, since it was halting traffic across the entire network interface.

    Prevention Measures

    We have applied a handful of mitigations to prevent this exact type of attack from happening again, many detailed below.  Our system will now detect and mitigate heavy volumes of failed SYN retries and SYN ACK retries.  We have also increased our processing volume for network traffic at all of our servers.

    We have also set up more policies, via Google Cloud Armor and our own firewalls, to detect and mitigate this type of attack and similar DDoS style attacks.  Additionally, we are actively working with our Google partners to identify areas that we can improve even further.

    We are also looking into creating more isolation in our system to limit the scope of similar issues in the future. We are considering moving applications with dedicated ingress to isolated servers to further protect them against multiple types of attacks.

    Customer Recommendations

    For this particular type of issue, the prevention efforts largely fall on Gigalixir, as outlined in the previous section.

    However, for general application protection we would recommend the following:

    Run more than one Replica

    When you run more than one replica, we run them on multiple servers and across multiple zones.

    This helps prevent an issue on a single server or zone from completely taking your application offline.

    Consider Dedicated Ingress

    Dedicated Ingress gives your applications their own load balancer and ingress resources.

    In the near future, we are considering moving applications out of the common runtime server pool, which would provide further isolation, including for similar events.

    Consider DDoS Protections and/or WAF

    There are a handful of good products that offer protections for traffic coming through your domain and/or hostnames.

    (We can provide individual recommendations upon request.)

    One of our preferred solutions is to utilize CloudFlare for your DNS with a rule that applies a signature to all requests.  If you couple this with our Dedicated Ingress, we can limit all traffic through the ingress system to only traffic coming from your Cloudflare setup.  CloudFlare offers DDoS protections for all plan levels, including their free plan.

    Incident Timeline

    15 July - 13:57 UTC / 08:57 CDT

    Several of our health check alerts went off simultaneously and we began to investigate the cause.  We quickly found that many applications were experiencing intermittent connectivity issues.

    We subsequently identified a heavy volume of packets coming in through the load balancers for custom domains in the us-central1 region.  The volume of packets was 1000x normal traffic volume.

    We expected this would cause issues with the default ingress system, which the load balancers were pointing to.  However, this was not the case, as the ingress controllers were running normally.

    At this time, we continued to investigate the traffic.  Given the volume, our running assumption was that this was a DDoS attack.

    15 July - 14:11 UTC / 9:11 UTC

    The surge of traffic was over and network connectivity returned to normal operating levels.  

    During the attack, we layered in additional DDoS protection (via Google Cloud Armor settings), which we hoped had kicked in and resulted in the decrease in volume.  

    Unfortunately, this was not the case, and it appears the attacker had just given up for the day.

    15 July

    We continued to investigate the attack and preventive measures throughout the day.  

    At that time, we were able to determine the packets were failing at layer 4 (TCP), which means the packets never made it to our ingress system.  The flood of packets had overwhelmed the network interfaces themselves on our servers, which was causing sporadic network connectivity with heavy loads of dropped packets.

    We were then comfortable concluding this was a DDoS attack / SYN Flood attack.  We continued to work through the day to investigate the situation and discuss with our Google partners how traffic was making it through to us and what could be done to mitigate this particular issue and others like it.  They suggested we layer in some additional policies through Google Cloud Armor to our network setup to help with additional DDoS protections, which was put into place.

    16 July - 12:57 UTC / 07:57 CDT

    We started receiving alerts for various resource outages.  We identified that the issue was the same as the day before.  This time the attack was even stronger and did not stop after a short period of time.  We were seeing over 1500x the number of packets per second than we have on a normal day of operation.  Unfortunately, the attack was not coming from a single source IP or range, so blocking traffic by that source was not feasible.

    We spun up more servers and ingress controllers to attempt to "handle" the load.

    This had a positive impact on app connectivity, but the root cause was clearly still there.

    We needed to shed the incoming load.

    16 July - 13:29 UTC / 08:29 CDT

    We applied some more strict policies to our firewall and Google cloud armor to help mitigate the attack.  

    This had an immediate impact of bringing the packet load down to only 30x our normal network load.

    Applications largely started working again, but there was clearly degraded performance within our network.

    16 July - 14:33 UTC / 09:33 CDT

    To try to break up the inflow of traffic, we stopped processing all traffic to the affected load balancer.  This took applications offline that were on the default ingress system.  We applied some new changes to our firewall rules to attempt to solve the problem, which were ultimately unsuccessful.

    16 July - 14:43 UTC / 09:43 CDT

    We restarted traffic on the offending load balancer.  The attack was still present, but it did lessen the load down to about 28x from normal.

    We continued to dig through logs and monitors to try to find any way to filter out the traffic reasonably.

    16 July - 14:56 UTC / 09:56 CDT

    At this time, we were able to identify points where the system was handling the problem poorly.  The attack was taking advantage of spoofed IP addresses and heavy amounts of TCP retries.  

    That knowledge allowed us to apply changes to detect this situation and silently reject these TCP packets that were previously being sent back to bogus destinations.

    As we presumptively assume these attacks may persist in the future, we applied additional changes to our network to recognize and handle the same type of attack at a volume that is several orders of magnitude greater than the one we experienced during this timeline.

    16 July

    We continued to monitor and harden the system.  We added new policies to ensure all new systems would also have these rules applied at creation.  We added additional traffic alerts to our system to help identify similar situations more quickly in the future.

    Finally, we continue to speak with our Google partners Cloud Armor experts about the situation and get their advice on additional strategies to put in place.  We expected (and still expect) that Cloud Armor should be able to mitigate these types of attacks.  We will continue to work with our partners to apply any changes they recommend and improve our network protection.

  • Investigating
    Investigating

    We are experiencing a resource attack in the GCP us-central1 region. We are working to mitigate the issue.

    Applications are currently experiencing degraded performance and outages. We are working with our google partners to resolve this issue.