Service Outage in North American Environment
Incident Report for Cisco Secure Email Threat Defense
Postmortem

Root Cause Analysis (RCA) Report for Incident on June 13th, 2023

Incident Details

On June 13th at 19:09 UTC we were alerted by our internal systems that the unprocessed message queue was above threshold. We were also unable to log in to the Secure Email Threat Defense portal in the US region.

What was the root cause?

A major outage of AWS Lambda services in the us-east-1 region created an outage in the Secure Email Threat Defense service (Message processing and UI). During this time, we were also unable to log in to the AWS console.

We issue a sincere apology for the disruption this issue caused.

Incident Timeline

Date/Time (UTC) Comments
19:09 June 13, 2023 Issue detected by internal systems and alerted Secure Email Threat Defense engineer, who then acknowledged and started working on the issue.
20:07 June 13, 2023 Secure Email Threat Defense team identified the root cause and sent a customer email notification.
20:30 June 13, 2023 As there was no clear ETA from the AWS side on the resolution, Secure Email Threat Defense team initiated the Business Continuity Plan (BCP).
20:56 June 13, 2023 Secure Email Threat Defense team was able to log back into the Secure Email Threat Defense portal in the US region. We observed an improvement in Lambda error rates and processing Queues.
21:23 June 13, 2023 Secure Email Threat Defense team sent an updated email notification to customers.

What is Cisco doing to prevent this issue from happening again?

AWS experienced a region-level failure which impacted ETD's ability to process and remediate emails.

We are actively investigating the option of implementing an immediate failover strategy to a backup region in order to bolster the resilience of our systems. By doing so, we aim to effectively mitigate similar issues that may arise in the future.

Posted Jun 21, 2023 - 21:22 UTC

Resolved
The degradation announced earlier is now resolved. We have verified that no data was lost, but you may see a delay for some messages that were processed earlier today. Search and Report data is now up-to-date.
Posted Jun 14, 2023 - 02:55 UTC
Update
We are continuing to monitor for any further issues.
Posted Jun 14, 2023 - 02:55 UTC
Update
We are continuing to monitor for any further issues.
Posted Jun 14, 2023 - 02:54 UTC
Monitoring
Monitoring - The outage with our upstream cloud services is resolving, but we are still seeing a degradation in service. More information will be provided as it becomes available.
Posted Jun 13, 2023 - 21:07 UTC
Investigating
We are currently investigating an outage in our upstream cloud services that is causing some functionality to be interrupted. More information will be provided as it becomes available.
Posted Jun 13, 2023 - 20:07 UTC
This incident affected: North America (Management Portal).