Operational Resilience in Financial Services: The bigger they are, the harder they fail
Since 2018 the Financial Conduct Authority (FCA) has required banks to publish statistics on operational and security incidents. The latest annual and quarterly sets covering the period to the end of June 2019 make interesting reading. There are many inferences that could be drawn from the data, but one statistic that comes through clearly is how much higher the frequency of incidents are at the large, traditional banks compared to the challenger banks. For example, between them the big four banks account for over half the reported business banking incidents in the period from July 2018-June 2019. Over the same period, the frequency of personal banking incidents at challenger banks was typically half that of the more traditional banks. While the most recent quarter of personal banking incidents appears to show a drop in the frequency of incidents, business banking incidents appear to be on the increase.
Obviously one needs to be careful reading too much into what is fairly basic data but these simple measures highlight the operational resilience challenge facing the established banks. Furthermore, their large client base only serves to amplify the impact, and ultimately the coverage of incidents.
Why are incidents more frequent?
There is one very obvious explanation for the higher frequency of incidents at the traditional banks and that is the scale and diversity of their business. They typically provide more services, in more places to more customers. However, this scale does not adequately explain the extent of the differences between traditional and challenger banks. For example, internet banking activities are less dependent on these factors and yet follow the same trend between the two groups.
In November 2018 the FCA published the results of a survey to assess technology and cyber capabilities in UK financial services including statistics on the root causes of outages over the previous 12 months (see Figure 1). A look into these statistics provides some interesting insights. In particular, the most frequent four causes account for 70% of the classified outages recorded in the survey, these are:
- Change Management
- 3rd party failure
- Software / application issue
Large banks tend to be disproportionally exposed to these factors relative to start-ups. The first three factors in particular reflect the complexity of the infrastructure (both process and technology) which has grown up in these organisations over time. Incremental changes to systems over an extended period have resulted in a web of processes and fragmented data. More recently, the increasing use of third-party providers has spawned a plethora of dependencies. All of this adds up to a complex infrastructure with significant dependencies between a large number of systems, applications and third-parties. The net result is inherently fragile and poorly understood as a whole. This fragility becomes apparent when changes are made to parts of the infrastructure; explaining why Change Management accounts for 1 in 5 of all incidents. Large banks therefore face a dilemma in balancing the need to update and rationalise their legacy infrastructure with an inherent pressure to ‘leave well alone’. However, as the statistics show, even the status quo is exposed to outages resulting from application and third-party performance issues, which between them account for a third of all incidents. If that wasn’t enough, the scale and complexity of large banks make them attractive targets for cyber-attacks.
Together the two sets of data make it clear that traditional banks face a number of threats to their service resilience. Firms have historically tended to focus on system resilience as a more tractable approach to the problem, i.e. make sure the systems are up and running and de facto the services should be available. While useful, taking this approach in isolation is proving to be increasingly unreliable for a number of reasons. As the inter-dependencies internally and externally become increasingly complex the probability of an outage or other disruption (e.g. cyber-attack) somewhere in this extended, heterogeneous technology estate increases. Add to this the rapid pace of change, ageing systems and cost pressures and it is a brave person who assumes 100% availability of all systems all the time.
An alternative approach is to view operational resilience through the lens of the end business service to the customer / market. This, however, is challenging; tracing back from a particular business service to the web of processes and systems, both internal and external, that support the service can be complex. Add to this the variety of disruptions that can occur and the permutations increase exponentially. It is key therefore that firms are pragmatic in their approach to operational resilience. In July 2018 the FCA released a discussion paper which lays out a number of the elements that need to be considered and outlines a process that firms can follow to develop their approach. The challenge lies with firms to translate the guidance into practical solutions.
Practicality and complexity tend to run counter to each other. The key is to simplify the planning for operational resilience, while maximising the benefit in terms of meeting business service standards. It is beyond the scope of this note to go into a detailed exposition of all the practical elements that need to be covered, however some of the key points to consider are:
- Take an ‘outward in’ approach – start with the service to the customer and work back to the internal and 3rd party dependencies.
- Understand and set realistic impact tolerances that meet the firm’s responsibilities to its customers and the wider market. Similarly, understand the relative priorities between services, customers etc.
- Identify the types of disruptions that can occur (both internally and externally, self-inflicted or third-party) and how these impact the business service. Remember timing can be a factor – a brief outage just before market close or payment cut-off can have a much more significant impact than one earlier in the business day. Try to group scenarios into related families to simplify the resolution planning.
- Plan for the likely scenarios (e.g. major system outage, data breach, third-party outage) identifying the relevant roles and responsibilities, systems/processes and governance to handle these scenarios.
- The actual disruption is unlikely to correspond exactly to pre-planned scenarios so it is important that any framework is prescriptive enough to provide clear guidance but flexible enough to respond to the inevitable uncertainties.
- Communication is key. Clear communication internally and externally are fundamental to managing both the disruption and its impact on stakeholders.
- Think about the tail – not just restoring the systems and processes, but resolving any residual issues and capturing the lessons.
- Testing is central to assuring the practicality of the elements of operational resilience and the overall approach.
- Monitor internal and environmental factors that may lead to increased likelihood of incidents e.g. major change programmes, increased cyber threats.
- Design for business resilience. Operational resilience planning should be built into the introduction of new systems, process and products. Whether it is a new or existing service the operational resilience needs to be an ongoing, iterative process (e.g. of monitoring and applying lessons learned).
In conclusion and at the risk of stating the obvious, if the traditional banks are to improve their operational resilience, especially over the longer term, then they need to ensure that it is directly informing their investment decisions alongside other strategic priorities.
 FCA Building the UK financial sector’s operational resilience, DP18/04, July 2018, chapter 4 pp.25
 FCA mandated and voluntary information on current account services, https://www.fca.org.uk/data/mandated-voluntary-information-current-account-services#personal, published August 2019
 FCA, Cyber and Technology Resilience: Themes from cross-sector survey 2017-2018, November 2018