Putting the ‘stress’ in to resilience testing

30 October 2018

Do you test end-to-end operational resilience? You might think so, but approaches to IT Disaster Recovery (ITDR) and Business Continuity Management (BCM) testing hasn’t evolved much over the past 10 years and has arguably not kept pace with our fast-paced Financial Services industry. With new stress testing challenges from regulators, it’s time to re-evaluate resilience testing.

Component testing is still the norm for BCM and ITDR. For BCM this manifests as testing plans by location proving capability for a team to access their applications from alternative locations. For ITDR ‘bubble’ testing remains prevalent, with applications tested in isolation.

These approaches fail to address the stresses of testing alternative working methods as they aren’t actually tested for a sustained time. The amount of pre-planning before a test also undermines the test’s ability to provide meaningful insight. It’s safe to say that current testing approaches may be low risk but they are not realistic.

The most prevalent argument against more realistic testing is the risk to ‘business as  usual’ operations. Now to a point I agree, you don’t want testing to be the cause of a disruption but strategies must be tested as fully as possible to provide meaningful insight.

Furthermore, stress testing will be a tabletop exercise utilising the data produced during internal testing. If this data isn’t accurate, the stress test will be wasted effort. So how could more meaningful testing be performed?

Firstly, testing shouldn’t be based on location or component but rather on the ‘important business service’ which is aligned to the approach required by the regulators. By reducing the delivery capability in one process in one location then simultaneously running that downstream, teams and technology can test their responses in real time.

Secondly, test with a view to break things not to be ‘green’. Test a plan to see where it would fail, do not fear faults but find them and fix them, and then test again to see if you can break the fixes.

Lastly, where possible run unannounced testing with any alternative working methods being tested for at least a day.

Going forward, why not approach resilience testing like ‘essential maintenance’ of websites? Be honest with customers that to ensure the resilience of the service they receive, there is a need to degrade the service for a period of time. Thus, testing how acceptable the predetermined ‘acceptable level’ of service that occurs in recovery mode level really is. It would, however, also introduce a greater level of risk to BAU, so a balance has to be struck which we perhaps haven’t been brave enough to aim for.

Therefore, now we know that end-to-end testing of a service is crucial to understand resilience, as component testing is frequently artificially constructed and potentially results in a false sense of security. The risk to BAU operations will (rightly) temper testing ambition but we should be braver with testing strategies and thereby improve overall service resilience.

Stella  Nunn

Stella Nunn | Director
Profile | Email | (0) +44 7932 144627

Sabrina Damian

Sabrina Damian | Senior Associate
Profile | Email | +44 (0)7841 804481

Twitter
LinkedIn
Facebook
Google+

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated and will not appear until the author has approved them.