Best Practices for SaaS and Network Incident Management

By Simon Dion

Aug 1, 2023

4 minutes

Exoprise

Computer and network systems have (obviously) become vital to business operations. Occasionally, there are SaaS or network incidents and these systems do not operate as needed. Enterprises want to minimize the potential damage and get their systems back online ASAP. Integrated incident management and a strong End User Experience Management (EUEM) platform that provides synthetic and real-user monitoring is a foundation for meeting that objective.

With the advent of newer technologies, like cloud, digital, mobile, and social media, corporations have become increasingly dependent on their network infrastructure to get work done. However, network systems are imperfect, and sometimes, service, application, or network outages occur and damage the business or even affect customer relations. Take last month, June 2023, there were a series of outages across Microsoft 365 services:

When services that companies depend on fail, such as Microsoft Exchange, Outlook, Azure AD or Teams, employees are unable to work together or with customers. You want to manage these incidents, troubleshoot as quickly as possible, inform everyone, and get employees operating again as fast as possible.

Consequences of Enterprise Network and SaaS Incidents

Lost Productivity

When Microsoft Teams or WebEx is down, employees can’t connect and collaborate with their fellow employees. Seems simple enough to understand. But it can get worse than that depending on the network, service, and computer conditions.

With the rise, fall, and moderation of hybrid and at-home work, employees often experience poor Wi-Fi signals or at home network incidents and are at a loss for how to triage what could be going wrong. This will result in lost productivity and, most importantly, if that employee is customer facing, then the business will suffer exponentially.

Integrate Exoprise to Optimize Network Incident Response

Read this white paper on how to integrate proactive notifications and complete employee coverage with existing IT incident management systems like ServiceNow

The Cost of Downtime

Business opportunities fade away. Ponemon Institute’s Cost of Data Center Outages study found that the average cost of a data center outage is $740,357 and the price runs as high as $2.4 million.
A study by the ITIC (Information Technology Intelligence Consulting) added that a large enterprise loses from $100,000 to $500,000 for a single hour of downtime, depending on the nature of the business and the problem’s impact.

Take the aforementioned SaaS outages by Microsoft. There is a considerable cost for suffering Exchange Online or Microsoft Teams outages.

Reputational Damage

Soft costs occur as well. System downtime often permanently damages your brand image. If customers cannot access your services or products, then they will stop using them.

However, there are more potential problems. Nowadays, users have social media platforms where they share their bad experiences with friends and families. So, when an outage occurs, the information often goes viral. As a result, about 40% of disruptions lead to brand reputation damage, according to International Data Corp.

Negative Impact on Employee Morale

Employees may feel dispirited when an outage occurs. They come into work to complete various tasks and are hamstrung as long as the system is offline or performing poorly. Once it is restored, they feel overwhelmed by the workload and feel the need to catch up ASAP. Consequently, outages not only affect immediate productivity but also employee engagement, satisfaction, and ultimately retention. If such digital experience problems are frequent, they may simply look for another place to work where they believe the company compares more about their IT excellence.

What to Do About Downtime?

Downtime arises for unexpected reasons, so the first step is to be prepared. An incident management best practices checklist examines risk assessment. The goal is to identify potential outage sources, create a plan for recovery, and when problems arise, take steps to prevent similar incidents in the future. Ultimately, the plan reduces the damage downtime causes.

The checklist, sometimes called run book automation, outlines the steps the company follows when an unexpected problem arises. The blueprint helps teams improve response and recovery times, so business operations are restored quickly and effectively.

Put a Process in Place

The business must form an incident response team, which becomes responsible for crafting procedures. The team needs to be cross-functional and often includes third parties, such as service providers and consultants.
The group should have established routines for
- Detection. Here, the company discovers that an incident is occurring and starts collecting evidence and assessing the severity of the event.
- Containment. The business tries to limit the effect of an incident.
- Eradication. This step involves the removal of the root cause of the incident.
- Restoration. At this point, the affected systems and devices return to standard operations.

After a business continuity plan has been created, the company needs to test it. Periodically, at least annually, they run a test and determine how well they would respond to an outage.

Continuously Evaluate the Checklist

Today, applications are dynamic and constantly change. Therefore, an incident management best practices checklist must be constantly reviewed, evaluated, and updated to reflect changes to IT infrastructure, business operations, personnel, and an ever-expanding and changing infrastructure landscape. Outdated plans result confused and undermine effective incident responses.

Exoprise: The Best Option

The best way to address downtime is to proactively monitor Software-as-a-Service (SaaS) applications and the networks they depend on. This is what CloudReady code-free synthetics are for, and they provide advance notification of outages. Outages will be detected before employees know or are impacted.

Exoprise Service Watch gives you the confidence of complete coverage for every Desktop, UC and SaaS app that employees depend on. Securely deployed, Service Watch can be combined with CloudReady to enable IT to proactively increase uptime and reduce the impact of network incidents.

Exoprise’s CloudReady and Service Watch solutions help with performance problem prevention, amelioration, and examination. Their graphical dashboards present executives with summary information in a clear manner, so they address issues that need attention in real time. They also automate the discovery and validation of an outage. For example, when a Microsoft Teams outage occurs:

Step 1: Look at synthetics for that app
- Exoprise provides a complete picture of Teams performance end-to-end and its dependencies
- Are any network components or sub-systems failing?
Step 2: Examine integrated service health w/ M365 or Twitter feeds
- Is Microsoft reporting anything through the integrated Service Health Status or Twitter feeds?
Step 3: If the incident is triggered by the new real-time alarms of Service Watch, check to see the health of the desktops or Device Group
- Examine Desktop Experience Scores
Step 4: Drill down and examine different pieces, like ISP link, Teams jitter, packet loss, and latency
- Captured in real-time for each device and meeting
Step 6: Integrate with ServiceNow via web or email hooks, so the incident can be collaborated on
- Escalate among the network and operations teams

Once an incident has been prevented, mitigated, or resolved, the incident response team creates a report on what happened and how the incident was handled. This information becomes the foundation for any lessons learned. With the insights, the company then adjusts its incident management best practices checklist and prepares for the next blip.

Businesses rely on technology today more than ever before. Downtime can be catastrophic to an organization. They need to put tools and processes in place to be ready for such problems. The Exoprise solution enables them to proactively track system degradation, put countermeasures in place, examine performance information, and fine tune their systems. With it, corporations minimize downtime from SaaS and network incidents, thus ensuring that the computer infrastructure is available to employees and work gets done.