Building an engineering culture and a resilient technology foundation

The pace of change in technology has created many opportunities—and raised as many challenges. Overcoming the challenges to grasp the opportunities requires tools and processes that nurture an engineering culture, build stable foundations for resilient technology operations, and manage complex cloud costs. Driving this shift requires modern technology leaders to expand from their traditional role as “guardians of IT” to become closer partners with the business. So says David Pedreira, CTO of Santander Argentina, in this interview with McKinsey’s Jorge Machado, where he reflects on his long experience leading technology organizations and the key actions that have allowed him to transform the technology function. What follows is an edited version of their conversation.

Jorge Machado: How did you go about building up an engineering culture? What was the biggest change in terms of how you managed technology as a result?

David Pedreira: To build an engineering culture, you need to bring together people who like to challenge traditional processes—we went from requiring 15 signoffs for a road map to getting it approved in one meeting and adopting agile resource management—and who understand what is required to enable a better developer experience, such as using APIs to create services. I’ve also prioritized technology leaders who like to manage people over experienced managers with less technical experience; it’s critical to have deep experience in technology first.

Second, I’ve invested in tools, particularly an integrated developer portal. The right tool enforces process standardization and helps with communications. When you try to organize engineering in a portal, you learn that all your engineering processes need to be easy. To help your developers build, you need to go from what has traditionally been a confusing series of requests and hurdles to a much simpler self-service model, where approved code and services are ready to go and easy to integrate. A developer portal also handles all the aspects of an application’s life cycle, from the code to managing incidents in production.

By embracing an engineering portal, we’ve established that everything, even core enablers, are products and should be treated as such. And products, in the engineering core, are not designed to be closed. Everything should be composable, even at this very core level.

Finally, I’ve invested in metrics. I created a small business intelligence team that ensures that every single thing is measured. That created a data-driven culture and helped both in making decisions and in handling the continuous improvement of all activities.

Jorge Machado: What are two or three of the most successful initiatives you’ve led to improve IT resiliency at Santander?

David Pedreira: I truly believe IT resiliency is the offspring of discipline in demanding well-executed processes. Two areas where I was able to make an immediate impact were fixing the “crisis of the first day” and the site reliability engineering (SRE) practice.

Let me explain. We have a significant level of inflation, which leads many Argentinians to buy dollars. In 2020, restrictions were placed on the monthly access to US dollars. This led many Argentinians to go to the online bank to buy dollars at the earliest possible moment, the first banking day of the month. Traffic would spike 20 times higher from one minute to the next—I’ve never seen anything like it. It occurred only four hours a month, but we had an outage every single month because our mainframe was not designed to handle so many transactions.

We did not want more downtime, and we had just four weeks to fix it. Scaling the mainframe or offloading the transactions elsewhere wasn’t an option. So we buffered the mainframe by introducing a “virtual queue” to the site, which gave our teams time to offload the transactions from the mainframe and decreased downtime by 70 percent.

I also created the SRE team and moved to a true no operations (NoOps) approach. The reliability team, for example, introduced observability and automated the generation of alarms and the escalation process. We also changed the main monitoring from technical measurements to detecting anomalies in customer behavior—for example, we no longer depend on percentage of CPU usage but on number of logins to detect incidents. We also broke up our monolithic system to create small “failure” domains—you cannot imagine how many times I have seen different microservices sharing a database—and accelerated the migration of applications to the cloud, which, in my experience, is by far the most stable environment.

These actions reduced the number of incidents by 90 percent, significantly improved our response time to incidents, from hours to seconds, and also improved the management of technical debt in the road map.

Jorge Machado: How did you improve infrastructure automation and adopt infrastructure as code (IaC)?

David Pedreira: To automate infrastructure, I’ve adopted two key principles: abstraction and “zero tickets.” In terms of abstraction, the less a developer knows about the details of the infrastructure the better. I wanted to make sure that the development teams focused only on the code. The engineering teams are the ones that handle the IaC and “hide” it behind the developer portal, where developers can get prebuilt templates to work with.

The “zero tickets” principle is based on the idea that the ticket is the physical realization of waste. We leveraged a ticketing tool and established a path to zero requests over two to five quarters. That was the KPI. We also introduced a zero-ticket goal on the developer employee-satisfaction score, so that we could track if the goal was being achieved. In the end, all infrastructure teams started automating everything. That not only reduced the number of people in charge of these activities from more than 400 to less than 90 but also reduced the resolution time from weeks to minutes or seconds.

Jorge Machado: Could you discuss your experience in adopting FinOps practices and optimizing cloud infrastructure spend?

David Pedreira: I’m going to talk about my previous role at Mercado Libre. I’m quite proud of what we built while I was CTO there.

Cloud expenses are complex. The cost of a simple compute instance is a function of memory, CPU, and storage. In addition, if you pay in advance and commit consumption (reserve instances, saving plans), you get savings, and if you are willing to sacrifice quality, you can purchase spot instances. The virtual machine (VM) will have indirect costs, such as support, networking, monitoring, backup, and security. The cost ends up including more than 100 different components, and it’s hard to understand what is relevant and what is not.

Controversially, our first step was to do some math to arrive at a good-enough approximation of the price of the VM and find the three main components that made a difference. In our case, they were CPU, memory, and persistence. With that, we set a reference price for the VM. We did it for the main costs in the cloud provider and mapped it to the costs of our engineering products.

The second step was to split the problem of efficiency into “How much are you consuming (quantity)?” and “How cheaply can you provide the service to your customer (price per unit)?” The former was made an objective for the developer teams: “You cannot use more than X US dollars per month.” The latter was made an objective for the engineering teams: “Your average VM cost cannot exceed Y US dollars per month.”

Our third step was to create the right operating model. We created a center of excellence around FinOps to analyze more-complex scenarios and push for optimizations all across the board. We brought our finance and control teams on board, since this initiative needed to be shared with the organization. Once these were in place, the evolution was natural and required little effort. As a result, over the year it took to bootstrap this practice, we reduced expenses by 40 percent.

Jorge Machado: How has the CTO role changed in the past five years? How do you think it will evolve?

David Pedreira: Over the years, many organizations have viewed the technology function as “the others.” We handled tools, provided efficiency, and often reported to the CIO. Now, we’re moving toward being IT strategists who set the direction the organization should move toward, such as how to use generative AI or how to evolve a product. We define how product development teams should work and where they should focus, speaking directly on the go-to-market plan, not as an “order taker” but as an equal.

Something that never ceases to amaze me is that technology is always changing, and the pace is moving faster and faster. We will need to focus more on the strategy side and become better internal communicators about what is relevant and what technologies matter to help transform the company.

Building an engineering culture and resilient technology

Explore a career with us

Related Articles

What should you be asking your team after the CrowdStrike outage?

Scott Johnston on designing and building scalable platforms

The power of pace in technology