Microsoft 365 Outage, March 15th 2021
Exoprise CloudReady provides early detection of mission-critical mail outages. On March 15, Microsoft had a service outage worldwide that impacted its services such as Teams AV, Yammer, OneDrive, and Azure Active Directory. Users reported not being able to login into either of these services and were getting timeout messages. Exoprise detected the issue earlier at 3 pm EST (40 mins before Microsoft reported it) and was able to immediately relay the news to its customer base.
Users may be unable to access multiple Microsoft 365 services
The following Microsoft Service Communication Message was received at Mon, 15 Mar 2021 19:40:05 +0000
Id: MO244568
Title: Users may be unable to access multiple Microsoft 365 services
WorkloadDisplayName: Microsoft 365 suite
Status: Investigating
AffectedTenantCount:
Classification: Incident
StartTime: Mon, 15 Mar 2021 19:34:22 +0000
EndTime:
FeatureDisplayName: Portal
ImpactDescription: Users may be unable to access multiple Microsoft 365 services.
LastUpdatedTime: Mon, 15 Mar 2021 19:40:05 +0000
Severity: Sev2
MessageType: Incident
Messages:
- PublishedTime:
Mon, 15 Mar 2021 19:39:14 +0000 - MessageText:
Title: Users may be unable to access multiple Microsoft 365 services User Impact: Users may be unable to access multiple Microsoft 365 services. More info: Initial reports indicate that primary impact is to Microsoft Teams; however, other services including
Exchange Online and Yammer are also impacted. Current status: We’re investigating a potential issue and checking for impact to your organization. We’ll provide an update within 30 minutes.
Exoprise Microsoft 365 Dashboard and Notice
Here is an example of how Exoprise is able to proactively capture outages and provide complete coverage. Integrated tweets in real-time help customers get updates and stay informed of the latest developments by Microsoft.
Early detection of an O365 service outage affecting Teams and Azure
M365 services (Teams, Yammer, OneDrive) impacted due to outage
Exoprise Teams AV Sensor Dashboard
Teams AV Stream Outage (Jitter and Packet Loss) started at 3 pm
Microsoft 365 Teams Outage affecting Login Time
Latest Updates
Title: Users may be unable to access multiple Microsoft 365 services
User Impact: Users may be unable to access multiple Microsoft 365 services.
More info: Any service that leverages Azure Active Directory (AAD) may be affected. This includes but is not limited to Microsoft Teams, Forms, Exchange Online, Intune and Yammer. Admins may also be unable to access the Service Health Dashboard.
Current status: We’ve identified the underlying cause of the problem and deployed an update to resolve the issue. The update has finished its deployment to all impacted regions. Microsoft 365 services continue the process of recovery and are showing decreasing error rates in telemetry. We’ll continue to monitor service health as availability is restored.
Scope of impact: This issue could affect any user. Next update by: Monday, March 15, 2021, 7:00 PM (11:00 PM UTC)
Preliminary Root Cause of the Microsoft 365 Outage
Microsoft recently updated the root cause for this outage and its to do with ongoing, enhanced security protection with Azure AD and the rotation of security keys. This is an excellent goal to pursue but, obviously, getting there can be a challenge. Read on for more insight into the cause and more detail can be found here: https://status.azure.com/en-us/status/history/
Preliminary RCA – Authentication errors across multiple Microsoft services (Tracking ID LN01-P8Z)
Summary of Impact: Starting approximately 19:00 UTC on March 15, 2021 customers may have encountered errors performing authentication operations for any Microsoft and third-party applications that depend on Azure Active Directory (Azure AD) for authentication.
The Azure Portal, Microsoft Teams, Exchange, Azure Key Vault, SharePoint and other applications have recovered. Other applications are in the process of recovering and impacted customers will continue to receive updates regarding these.
Preliminary Root Cause: The preliminary analysis of this incident shows that an error occurred in the rotation of keysused to support Azure AD’s use of OpenID, and other, Identity standard protocols or cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keysthat are no?longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.
Metadata?about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end-users were no longer able to access those applications.
Next Steps: We understand how incredibly impactful and unacceptable this is and apologize deeply. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.
Start a Free 15 Day Trial for Early Detection of Microsoft 365 Outages
You need uptime, no downtime. If you had Exoprise CloudReady earlier today, you’d have known about the outage hours in advance, communicated it to your users who might be waiting on that business-critical email. Rely on key evidence from Exoprise to make that next important decision. Invest in a pure-play Microsoft 365 monitoring tool that works hard to make sure your business is up and running. You need to witness detailed metrics to get a better grip of an outage so you can troubleshoot quickly and also recover service credits from Microsoft.
Other vendors simply blog about the outage from Microsoft’s portal and service health messages and not show how they actually captured the error and outage. Only Exoprise shows how it captures the errors in advance of Microsoft reporting the problem.
Better yet, start a free 15 day trial today, we’ve got your SaaS covered.