By Arun Gundurao, Jorge Machado, Rut Patel, and Yanwing Wong
This is the first in a series of posts on IT resilience. In this post, we introduce the seven-point manifesto that can help organizations build resiliency. We will explore several points in depth in subsequent posts.
In late January 2021, investors across the United States logged on to brokerage platforms as shares of GameStop skyrocketed. Amid the frenzy, however, millions of customers were unable to access their account information and make trades, as many of the brokerage platforms suddenly failed. Outages and unstable IT are not just a concern in the financial sector. In September 2019, Slack’s stock fell 14 percent after the quarterly earnings report revealed that the company took an $8.2 million revenue hit after giving credits (money previously allocated to cover future bills) to customers following service-level disruptions.1
These situations underscore the need for organizations to address IT resilience—a company’s ability to handle a technical disruption. To be sure, poor IT resilience is not an outcome of the COVID-19 pandemic, though the crisis certainly exacerbated it. The influx of online traffic as a result of the pandemic, however, strains already rigid legacy on-premises IT systems, resulting in outages and service delays.
So why aren’t companies strengthening their IT resilience? In short, because their CEOs and boards often don’t view IT resilience as a business problem until it has a financial impact through customer attrition or they are called out by regulators. Consider the former CEO of the Tokyo Stock Exchange who stepped down amid regulator pressure following a daylong outage of the trading platform.2 To increase IT resilience, therefore, we recommend companies take a comprehensive approach grounded in seven core beliefs that address both IT and business outcomes (exhibit).
The case for resilience
In the past, companies could mitigate outages in physical channels through manual business-continuity processes, such as a customer-care agent using administrative access to enter an order. But as more customers increasingly migrate to digital channels, the traditional ways of addressing stability issues no longer apply. In addition, the underlying dependencies of IT systems also complicate the quest for resiliency. As an example, some businesses are integrating with application-programming-interface (API) ecosystems, an approach that can create value by allowing them to build new applications through an API portal or gain access to rich customer data, but one that can also introduce a new failure point.
Add the continued complexity of IT to outdated processes and operations, and it’s no wonder that the frequency of severe outages is increasing. A 2020 survey of infrastructure and operations leaders revealed that 76 percent experienced an incident during the past two years that required an IT disaster-recovery plan, and 50 percent experienced two such incidents.3 In another survey, 88 percent of respondents reported that an hour of critical server downtime costs them more than $300,000, and 40 percent reported such costs at more than $1 million.4 These incidents with high costs of downtime have motivated more organizations to boost investments in disaster recovery.5 These investments are critical, as many IT projects have minimal controls designed into new processes, underdeveloped change plans (or none at all), and scant design input from security, privacy, risk, and legal teams. As a result, companies are creating hidden nonfinancial risks in cybersecurity, technical debt, advanced analytics, and operational resilience, among other areas.
IT resiliency manifesto
To address these issues of IT complexity and risk, companies must fundamentally change their approach. They can do so by pursuing the seven core beliefs of the IT resiliency manifesto:
Solve for journeys, not applications. Instead of focusing on remediating critical assets, such as applications and infrastructure, as the solution to IT resiliency, organizations should look at the whole customer journey and solve for the weakest link. In short, it’s not about modernizing applications; it’s about understanding how all the applications, API calls, and third-party dependencies work together to produce a desired customer-journey outcome, and then identifying which component’s downtime deters customers from completing their journeys.
Take a risk-based approach. Many organizations view resiliency as only an IT infrastructure issue. Instead, organizations should take a two-pronged, risk-based approach. A business-driven, top-down approach can prioritize journeys that address risk; companies should ask, for instance, which customer journeys impact revenue or customer-satisfaction scores. The second approach is a quantifiable bottom-up approach that calculates the risk profile of a technology component, such as a third-party API call, to help create a risk-reduction plan for that specific asset. Companies can create a risk profile using elements such as probability of failure, impact when a failure happens, and the ability to detect a failure quickly and minimize its impact.
Leverage IT operations data. IT operations generate rich data sets, but many organizations cannot consistently use them for insights, discovery, and capacity planning due to having disparate tools and lacking certain skills and organizational constructs. By using artificial intelligence technologies and advanced capabilities, such as event correlation that can link data sets, organizations can improve how they handle outages. For instance, McKinsey research finds that incident triage used to take hours and often involved having hundreds of IT engineers and operations personnel on call; now, companies can reduce the mean time to identify incidents by 50 to 75 percent.
Design for the storm, not for blue skies. Traditionally, IT organizations conduct capacity-planning exercises and assign a small multiple—50 percent, perhaps—on top of peak volume. However, surges in digital traffic to the tune of 300 to 500 percent can cause massive outages. To address this issue and deal with surge volume, organizations should build infrastructure capabilities, such as containerized applications, to rapidly augment capacity across all components of the technical stack and address bottlenecks (such as message queues) in middleware.6
Adopt an engineering mindset. Leading organizations invest in capability building by hiring new talent, reskilling the existing workforce on DevOps automation, and adopting site-reliability-engineering (SRE) capabilities. These investments help teams implement modern engineering practices such as a continuous-integration and continuous-delivery (CI/CD) pipeline to automate software delivery; service-level indicators to measure system behavior; predetermined metrics to track service-level objectives; error budgets; and end-to-end code ownership. By employing these practices, organizations can improve uptime and use automation to identify and quickly address IT issues.
Avoid hero culture. Company cultures that support quality and consistency standards are more resilient because they view a crisis as a learning opportunity. At nearly every organization, there are a handful of people who know how to do everything, are very responsive to others, and are generally the most helpful people in the room. However, this scenario can actually impede resiliency because too many responsibilities are delegated to only a few people. Instead, leaders should role-model desired organizational mindset changes by nudging teams to break the hero culture and celebrating teams that promote resilient applications and behaviors.
Become proactive, not reactive. Failure is inevitable. However, companies can and should identify IT weaknesses before they expand systemwide. Operational control failures can manifest as large resiliency issues. To identify issues quickly, recover faster, and minimize impact, organizations should build and automate controls. As an example, pre-mortem analysis, chaos engineering, and problem simulation and strategy testing can also help build resiliency—so that when actual issues occur, they won’t be a surprise.
One leading financial-services organization reduced outages by 40 percent through short-term tactical fixes and improved monitoring for its tier 1 journeys, such as the log-in. The company’s average resolution time for all high-severity incidents was reduced by almost 60 percent within six months. It also embarked on a long-term plan to reduce technical debt, modernize its IT architecture, and embrace engineering practices. To quote the head of application operations: “There is a change in culture toward resiliency. Conversations are focused on risk to business and are more purposeful.”
Companies can improve their IT resiliency and differentiate themselves from competitors. The key is taking a comprehensive and structured approach.
Arun Gundurao is an associate partner in McKinsey’s New York office, where Jorge Machado is a partner and Yanwing Wong is a senior product manager; Rut Patel is a knowledge specialist in the Waltham office.
The authors wish to thank Ritesh Agarwal, Tanguy Catlin, Krish Krishnakanthan, Chandrasekhar Panda, and Vik Sohoni for their contributions to this post.
1 Todd Haselton, “Slack service goes down for more than three hours,” CNBC, January 4, 2021, cnbc.com.
2 Takashi Umekawa, “Tokyo Stock Exchange CEO resigns over system failure,” Reuters, November 29, 2020, reuters.com.
3 Jerry Rozeman and Ron Blair, Survey analysis: IT disaster recovery trends and benchmarks, Gartner, April 30, 2020, gartner.com.
4 “Average cost per hour of enterprise server downtime worldwide in 2019,” April 2020, statista.com.
5 Jerry Rozeman and Ron Blair, Survey analysis: IT disaster recovery trends and benchmarks, Gartner, April 30, 2020, gartner.com.
6 Middleware is software that enables companies to offer applications and services outside their current operating system.