Lessons learned from the CrowdStrike and Microsoft outage

By now, most in IT have heard of the global impact from the CrowdStrike software bug. Essentially, a ‘content update’, not a change in software code caused Microsoft Windows based systems to crash. The impact was incredibly swift and widespread causing ground stops at three of the major US airlines to delays in healthcare appointments.

What is the difference between a content update versus software code update? In the end, nothing. It’s a nuance that is a red herring in the big picture. The bottom line is that a change to the service caused a widespread outage to the CrowdStrike service which, in turn, caused Windows based systems to crash with the blue screen of death (BSOD). Microsoft estimates that the update impacted 8.5 million Windows devices or less than 1% of all Windows systems. Even so, the impact to 1% of all Windows systems has a dramatic business and economic impact.

It is now several days later, and customers are still recovering from the fallout from the outage further extending the impact.

Crisis management communications need a plan

Mistakes happen. The key is how prepared you to deal with the crisis.

Technology is imperfect. Technology is created by humans and humans are imperfect. Even so, there are many ways to hedge against this imperfection which I will address in a minute.

Several components go into a good crisis management plan:

A plan. It is incredible how many organizations do not have a crisis management plan. The plan needs to address the scenarios, players, risks and actions. Plans need to be a cross-organization collaboration, communicated and exercised regularly. This is as true for technology vendors as it is for IT organizations. Remember that a crisis management plan is a living document that is constantly revised and updated.
Communications. Identify who is going to communicate and what they are going to communicate. Keep customers and stakeholders informed on what has been identified, next steps and expectations. Those may change as more information becomes available. Transparency is key here. So is empathy. It is easy to solely focus on your own situation and forget what customers and stakeholders are going through. It is also easy to focus on the mechanics of the crisis without considering what others are having to deal with. Hence the importance of empathy.
Assessment. Clearly articulate the assessment process to validate the issue and how to recover from it. Including common scenarios sets the stage for thinking.
Recovery. Identify the plan of recovery based on the assessment. What actions need to happen for all parties involved? What is the impact and actions required for stakeholders and customers?
Postmortem. After the incident, conduct a full postmortem on the incident. What happened, why did it happen, who did it impact, how should it be addressed in the future. Consider the actions that caused the crisis along with the process taken through the crisis. What can you learn to improve the process for the next time? Change the crisis management plan accordingly.
Close Out Communications. Even after the crisis has been averted and recovery is complete, there is still work to be done. One of those actions is to rebuild trust and confidence with customers and stakeholders that was lost in the process. There are likely financial, legal and/or reputational consequences to address too.

In the case of the CrowdStrike communications, there were several issues. The CEO did come out quickly to talk about the issue. However, the way he did not just missed a few steps but also seemed a little arrogant and tone deaf to the situation.

In an interview with CNBC’s Jim Cramer, CrowdStrike CEO George Kurtz responded to questions about the outage with statements like they have “done it a long time” and that they perform “testing through the lifecycle of updates” and finally that “not all customers are impacted”. These kinds of statements are likely squarely targeted toward quelling investors. However, they aggravate customers questioning these statements with responses like ‘if true, then how did this happen’. In the end, it just makes a bad situation worse.

Understanding your audience is key. However, if you frustrate customers, that will ultimately impact investors and other stakeholders.

Questions and advice to CIOs

In the immediate aftermath of the outage, I started to field questions common when a crisis erupts. Those questions included asking if they (customers) should consider a different vendor and if they should question cloud. I was also getting questions about whether vendors can be too big?

These are common questions to ask, but the last one is one that needs a bit more discussion in a separate post.

My initial response was: Take a deep breath and avoid the knee-jerk reaction.

It is natural to jump into crisis mode, dissect the issue and drive hard to resolve it. However, that is not always the best course of action. In this case, a measured response is needed in accordance with the nature of the cause and the steps to recover.

Customers need to listen to both CrowdStrike and Microsoft for recommendations on the best course of remediation. From that, devise a plan to recover in accordance with your own crisis management plan. From experience, this can be the most painful part and requires discipline to manage effectively.

Several days since the start of the outage and customer are still well engrossed in the recovery process.

Advice to CrowdStrike and Microsoft

Focusing on the vendors, both vendors need to revisit their approach and consider how they respond to customers and stakeholders. Even though the outage was initially caused by a change to CrowdStrike’s solution, Microsoft also has culpability with the issue.

Specifically, why did a change in a third-party solution like CrowdStrike cause a full BSOD versus simply a reduction in service? ‘We fixed the issue and reboot systems’ may be the solution, but not the answer. There are ramifications that will far outlast the initial recovery.

Communications need to be quick, candid and transparent. The CEO’s of both companies were quick to respond. The key here is to be candid and transparent with both the findings and recovery efforts.

While both companies were quick to respond the responses missed a several points. One key miss was that empathy came later leaving a bitter taste with customers. Is it the end of the world? No. It does, however, create credibility questions and tarnishes the brand which just creates another hurdle to tackle.

Each vendor needs to conduct a full postmortem. Considering the significance of the issue, I would recommend that both companies publicly post their postmortems. Each postmortem should reflect a candid, transparent and unbiased view of what happened, why it happened and what is being done to prevent it in the future.

This is also a good time to bring public relations into the mix and devise a plan to rebuild trust with customers. It must start with customers but should also consider questions from stakeholders and investors. This is about rebuilding trust and credibility with the CEOs and their respective brands.

A candid discussion about risk

Bringing this conversation full circle, CIOs, their executive teams and boards need to have a holistic conversation about risk. This outage is a solid example of how there are other forms of risk beyond cybersecurity and ransomware. Too often the emphasis is on cybersecurity and ransomware. Those are important. But as we can see from this example, there are other forms of risk that can be just as or more damaging to an organization.

Third-party and technical debt risk are key areas to consider. As with any risk, consider what the financial, legal and reputational risk is to the company, stakeholders and customers.

And finally, watch out for those that take advantage of a crisis with malicious activities. As if the crisis itself isn’t enough, bad actors will use a crisis as an opportunity while teams are distracted, or their attention is elsewhere.

More to come

In summary, as systems and architectures become increasingly more complicated and we rely more heavily on technology to run our businesses, the risk continues to increase.

As mentioned in this year’s CIO Playbook for 2024, simplification is one hedge against complexity. Unfortunately, we can’t simplify everything but need to do what we can.

At the same time, as doom-and-gloom as it sounds, we can expect more situations like the most current example to happen in the future. The challenge will be how we prepare for the next one.

Update from CrowdStrike 7/24/2024: https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

Discover more from AVOA

Subscribe to get the latest posts sent to your email.

2 comments

Iqra Technology says:

August 16, 2024 at 3:53 am

The CrowdStrike outage underscores the critical need for effective crisis management and communication. A seemingly minor ‘content update’ led to significant disruptions, demonstrating how even small changes can have widespread effects. This incident highlights the importance of having a robust crisis management plan, including clear communication strategies and empathy towards affected parties. Both CrowdStrike and Microsoft must refine their response strategies and conduct thorough postmortems to rebuild trust. As technology becomes more complex, a holistic approach to risk management, including third-party and technical debt risks, is essential. The lessons learned here will be invaluable for preparing for future challenges.

Loading...

Pingback: CIO In The Know – The multiple ways AI is maturing and CIOs are shifting gears into other subjects

Crisis management communications need a plan

Questions and advice to CIOs

Advice to CrowdStrike and Microsoft

A candid discussion about risk

More to come

Share this:

Like this:

Discover more from AVOA

2 comments

Leave a ReplyCancel reply

Discover more from AVOA

Discover more from AVOA