Talking Resilience: when an outage becomes a crisis, how deep should we dig for root cause?

20 June 2017

IT outages are nothing new.  Just think about your own organisation. You know they happen a lot, they just aren’t as public as some that we’ve seen in the news recently. 

What happens after an outage is generally dictated by the extent to which it compromises business and how deeply it’s felt by customers, leaders, key stakeholders, and - in a growing number of cases – how it’s portrayed in media coverage.  

We often focus on considering approaches to the immediate aftermath of an incident, and talk about delivering a robust response accompanied by flawless incident management and crisis leadership that resolves issues quickly and impresses stakeholders.  But what if we take a step back to consider how we can prevent the worst incidents from happening in the first place, by considering if there is a common root cause?

On the surface we could attribute the cause of a major incident to the last thing that went wrong, which some may see as the proverbial “smoking gun”.  In the case of a catastrophic IT failure this might mean, for example, that a power circuit failed, or there was a fire in a riser cabinet, or maybe someone switched off something that they shouldn’t.  But while this may have contributed to the incident, is it the root cause?   Some may say it is, but with our organisational resilience hats on, we will dig a little deeper.

Looking just below the surface, we have to consider why resilience measures weren’t in place – or didn’t work - and why recovery wasn’t delivered as expected.  Defining the appropriate levels of resilience and recovery is the role of varying levels of leadership, with each level seeking to balance their risk exposure against the cost of investment.  Investment in resilience is often a “trickier sell” compared to other internal business cases as its value can’t be appreciated unless something awful happens. If the resilience is great and therefore nothing truly awful ever happens, this means there are no visible results of the investment.  

Highly reliable organisations build resilience into the way they do things and the systems they use.  They also tend to be better at knowingly and actively accepting risk at all levels of the organisation, and understanding the risk exposure in play: here, this is mitigated with Business Continuity Planning and Incident and Crisis Response arrangements to manage the impact of a significant disruption.   In reality though, many organisations struggle to balance risk versus resilience, and can be caught out - sometimes much more severely than anticipated.This slightly deeper dive into the root cause of an incident might lead us to conclude that the root cause was the failure to implement sufficient resilience and recovery capabilities. But is that also too simple an answer?

Going even deeper beneath the surface often exposes a cluster of culture, decisions and circumstances that have led to the incident being able to take place.  Acknowledging this gives us the opportunity to consider a more strategic approach to achieving more reliable, embedded resilience.

When an organisation suffers a major disruption to business-as-usual, particularly one that affects people who depend on it, and cannot respond as robustly as stakeholders expect, we must ask what else contributed to this outcome at all levels of the organisation.  Here the answers can be individual as the organisation, its leaders and the legacy of those who came before.  However, there are some factors that we come across more commonly:

  • Lack of a true understanding of the specific risk exposure, particularly at senior levels
  • Multiple rejections of a resilience investment case for the process/system that failed
  • Unidentified and unmitigated Single Points of Failure (not all of which are technical)
  • Misunderstanding the difference between recovery (response) and resilience (a broader capability to prevent, adapt and respond)
  • Incident and crisis response teams who were not sufficiently equipped or skilled to respond
  • Strategic change that hasn’t yet resulted in the re-alignment of operational arrangements
  • Recent loss of long-standing expertise from individuals or teams responsible for the process(es)

So what is the takeaway from this post?  It is this: while we must seek to identify and resolve threats to our critical business processes and also put appropriate resilience and recovery arrangements in place, we must also accept that threat contributions exist at a number of levels within our organisations.  When we ask what caused an incident, what appears to be a simple answer on the surface may stem from deeper and more systemic issues that threaten wider aspects of organisational resilience.  Organisations that consider their resilience on a number of levels are more likely to prevent major incidents that threaten their value.

If you’d like a conversation with us about resilience, business continuity management, major incident response or crisis leadership, you may contact us at:

Charley Newnham:  
charley.newnham@pwc.com
Gitesh Khodiyar:  gitesh.khodiyar@pwc.com

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated and will not appear until the author has approved them.