What we learned from the Brevo outage last week

Overview:

Major global outage on all our public services
Severity: High
Start time (UTC): 12-08-2021T04:00:00
End time (UTC): 12-08-2021T21:25:00
Duration: 17 hours and 25 minutes
Scope: Entire Brevo platform
Root cause: Core network – switch failure
Detection: Monitoring
Resolution: Switch the network traffic to new ports

Incident details:

On August 12, 2021, Brevo experienced a large-scale incident that impacted the availability of a large number of our services. These services included the API, client dashboard, certain web application interfaces, redirection links, and the sending of transactional and marketing emails resulting in the potential loss or delay of email delivery for up to 17 hours and 25 minutes.

At Brevo, we know how important timely and reliable communication is for our clients, and we sincerely apologize for the impact this has had on all of you who have placed your trust in our services. We acknowledge that this is unacceptable and our goal with the report is to explain exactly what happened, as well as what we will do to ensure that this does not happen again.

While the incident lasted for almost 18 hours, most of this time consisted of diminished performance while the network was recovering. Thanks to the hard work of the infrastructure team, we were able to replicate some of our critical services on different servers to bring back online these services. However, some data, especially around sending emails and API calls, was lost during the outage.

Background

The source of the incident is related to a part of our internally managed infrastructure that is composed of several availability zones, each of which includes a rack of servers linked by a series of switches.

6 months ago, we started maintenance on these switches in order to improve performance and reliability of our network at scale. This involved replacing our 1Gb switches with 10Gb switches for all of our availability zones. Each availability zone has a pair of switches (that use VLT to create a failsafe protocol) in order to survive a rack-level failure. In addition to this, each component is supported by two power supplies.

Over the past few months, we were able to successfully implement the new 10Gb switches on all 3 availability zones. In order to do this in a secure way, and to avoid any interruptions, we frequently ported critical services to racks that were not under maintenance to avoid the impact of an outage.

The last part of this maintenance had yet to be completed as there is still one 1Gb switch on each availability zone that lies between the pair of 10Gb switches and the rack of servers (unrelated to the incident being discussed).

2021-08-12 04:00 UTC: Switch failure detection

At 04:00 UTC, our infrastructure team received an alert from one of our monitoring processes notifying us that a network switch was not behaving normally and the servers were unreachable via ping.

The team quickly identified a global issue, but we were unable to determine exactly where the core issue was. This prompted us to open an incident on our status page at 04:11 UTC (https://status.sendinblue.com/pages/incident/586a5ae632dde2fc5b0013c1/61149f59c34053098f1c1b86) and started to investigate.

2021-08-12 04:28 UTC: Switch reboot

At 04:28 UTC, we asked our data center operator to reboot the faulty switch. The reboot was completed at 04:36 UTC, however the problem persisted and Brevo engineers made the decision to go to the data center in order to access the hardware directly and investigate further.

Fortunately, most of our infrastructure team is based in Paris, allowing us to be present in the data center within 1 hour when necessary.

2021-08-12 06:00 UTC: Network investigation

At 06:00 UTC, we were able to escalate the issue to our network engineering team, who until this point had no access to the network equipment remotely due to the outage, and started a deeper network analysis.

Our initial findings showed:

One port channel was down.
VLT protocol was in an unusual state and we were unable to display which switch was the primary and which was the secondary.

At 07:00 UTC, we knew that we had completely lost one availability zone. In order to mitigate the downtime and fix the issue, we performed the following:

Changed the VLT priority to try to force a normal state
Reboot switches
Review switch logs

Despite these steps, the issue persisted and connectivity to the downed availability zone was still not restored.

2021-08-12 08:00 UTC: Data center intervention and mitigation

At 08:00 UTC, half of the team was dedicated to mitigating application issues, while the other half was investigating the issue at the data center in order to find and fix the network issue.

At this time, we performed an inventory on all of the affected applications (and the reasons why they were being affected). Unfortunately, we determined we had 2 members of a 3-member cluster for one of our main databases temporarily located in the disconnected availability zone as a result of the ongoing network hardware upgrades.

Knowing that fixing the network issue could take a significant amount of time, the team began building a new database in the remaining availability zones in order to bring back a portion of our applications and import a backup on this database.

Meanwhile, at the data center, a visual check of the switches showed that the ports between the pair of 10Gb switches and the 1Gb switch were down, so we began with a replacement of the 1Gb switch that we had on hand. However, this yielded the same result with the ports remaining down.

2021-08-12 12:00 UTC: Recovery of our availability zone and a part of our applications

At 12:00 UTC, we determined that the faulty switch was not the 1Gb router, but rather the pair of 10Gb switches. This was due to the fact that we lost VLT, which is meant to ensure high availability between the pair of redundant switches.

After reconfiguring new interfaces on both 10Gb switches and physically re-plugging the 1Gb switch to a new port, we were able to successfully restore traffic on the downed availability zone.

Meanwhile, the second half of the team was able to successfully restore a part of our services for redirection links thanks to the restoration of our main database that started two hours prior.

2021-08-12 12:30 UTC: 3 availability zones are reachable

At 12:30 UTC, all 3 availability zones were reachable again, meaning that we were in the initial state (from a network point of view), meaning we could start the recovery of our applications.

One of the main services that Brevo provides is sending emails at scale. In order to manage the large volume of requests and different steps involved in sending emails, we have complex data pipelines that queue emails at each step (from the receipt via API or SMTP relay, to the coordination with MTA servers that deliver the emails to different ISPs, such as Gmail or Outlook). This is true for both marketing emails and transactional emails.

Because the main delivery system was unavailable, millions of emails were being queued in the pipelines and our capacities were filling up to unusually high levels.

Due to the fact that all of our applications are built to auto-scale in the case of fluctuations, bringing back all of the applications at once would have overloaded the available servers. Therefore, we took applications back online one by one by priority in order to avoid another system wide issue. This meant that delivery of emails was intentionally slowed to process the backlog at a rate that would limit the load of resources demanded.

2021-08-12 15:00 UTC: 95% of our applications UP

At 15:00 UTC, 95% of our applications were up and running. The remaining processes were small jobs, internal applications, and non-critical customer-facing applications.

We continued bringing the rest of these services online, but the email delivery system was still throttled and a large backlog of emails was still in the queue waiting to be delivered.

At 21:25 UTC, after nearly 18 hours, all emails in the backlog had been delivered and all applications were online and functioning normally.

Actions and next steps

As with any incident, and especially with one as critical as this, our number one priority is to understand the root of the issue and put in place measures that ensure this won’t happen again.

We have already started taking steps to ensure this. We have commissioned an external expert to perform an audit to identify all potential single-points-of-failure, as well as to investigate why the failover VLT mechanism failed.

Additionally, we are purchasing extra network switches in all of our data centers in order to have hardware on hand to replace any faulty switches immediately if another issue arises.

We will also periodically run redundancy tests (cutting the connection to one availability zone) to ensure that the system can perform normally with no impact to our users if a similar situation occurs.

Finally, we are expediting the process of moving more of our services to Cloud providers to ensure network reliability and redundancy at scale with more ease.