Revisiting data architecture for next-gen data products

Companies that want their next-gen data products to be successful may need to revise their data architecture and governance.

Aziz Shaikh

Serves as a leader of our North American data architecture and engineering guild, helping chief data officers (CDOs) across sectors generate business value at scale from data, using next-generation data technologies and governance

Henning Soller

Serves companies across Europe and the Middle East on large-scale IT and data transformations with a focus on scaling innovation

Margarita Młodziejewska

Mitch Gibbs

October 3, 2024Virtual assistants, recommendation engines, predictive maintenance, and personalized medicine: these and other next-generation data products rely on high-quality data and data architecture for optimal performance. And high-performing data organizations are three times more likely to say their data and analytics initiatives have contributed at least 20 percent to EBIT.

Next-gen data products leverage analytics and new technologies such as AI and generative AI (gen AI), cloud computing, machine learning (ML), and real-time data processing to provide valuable insights into large, complex data sets. Companies that have successfully adopted next-gen data products powered by gen AI have revised their data architecture and governance. Such companies have implemented overarching data access and data governance. Architecturally, they added vector databases to their overall landscape to ensure usability of gen AI on their overall data set.

Three different data architecture archetypes—centralized, hybrid, and decentralized—can be suited to various combinations of technological needs and strategies within companies. These archetypes are supported by reference architectures, data governance models, and considerations for implementing technologies. CIOs who proactively select the most appropriate data architecture archetype and implement robust data governance practices could unlock the full potential of AI in their next-gen data products.

What is data architecture?

Data architecture encompasses the design and blueprint by which data is organized, integrated, moved, stored, processed, and consumed. Data governance, on the other hand, is a framework and set of practices that include policies, procedures, and standards for data management that help ensure data quality and privacy, along with consistent and effective data management. Data governance relies on a well-defined data architecture to provide the infrastructure and data management tools needed to enforce policies and standards.

Choosing the right data architecture archetype

An organization needs to determine the centralization level of its data architecture—the degree to which data management, integration, storage, and access are controlled centrally across the organization versus by individual business units. CIOs can consider three levels of centralization: centralized, hybrid, and decentralized.

Centralized data architecture. Centralized data architecture is particularly useful and often best suited to banking or healthcare organizations that operate in highly regulated environments. The centralized architecture provides a single point of control for data governance, auditing, and reporting, using authoritative sources, a single intelligent landing area, and a data aggregation layer across business units. Simplified enterprise consumption platforms for reporting and analytics are also used.

Hybrid data architecture. Within a hybrid data architecture, data and platforms are organized and rationalized by data domain with single golden sources and no duplication across data domains. This type of architecture is typically helpful in operations with rapidly updated data streams and clear alignment of processes within each business unit. Telecommunications companies, for example, often use a hybrid architecture with centralized master data management (MDM) and federated data storage for the individual data domains.

Decentralized data architecture. In this approach, data is organized and optimized front to back within business unit silos, each of which includes sources, aggregation, and business unit–level reporting. Platforms aggregate across business units for enterprise-level reporting. Insurance companies, for example, may use this approach, wherein MDM and business domain data are decentralized to accommodate the needs of different customer bases, core systems, and data products.

C-suite considerations for data architecture

Determining the best approach to data architecture begins with seeking answers to fundamental questions about the company’s overall goals and strategy as well as its practical needs. Some of those questions are detailed below.

Business objectives and consumption needs. What are the business objectives? What use cases could be enabled by generative AI (gen AI)? Where do we need to provide self-service?

Data storage and design strategy. What setup would best support the organization’s needs: a data lake, data fabric, or data mesh?

Data extraction and transformation layer. What capabilities—such as easy integration with third-party data—should data extraction and the transformation layer provide?

Microservices. How should microservices be designed and built to enable real-time data messaging?

Unlocking gen AI. What data architecture components are needed to enable gen AI (such as streaming, data quality, large-language-model integrations, or prompt engineering)?

Scalability. What are the scalability requirements, considering, for example, user load, data volume growth, and peak usage times? What infrastructure is needed to support these requirements (such as public versus private cloud)?

Organizations can determine the best archetype for their data architecture by implementing a strategy that meets overall business objectives while supporting criteria identified by CIOs. This type of strategy strikes a balance between simplifying and standardizing data architecture and thoughtfully implementing any decentralization or variability within the architecture. Ultimately, the data architecture archetype is optimized to fulfill its intended purpose: supporting operations while expanding the company’s capabilities (see sidebar, “C-suite considerations for data architecture”).

The choice of architecture archetype is often initially driven by technical considerations, later resulting in adjustments that incur significant cost increases. For example, a major bank’s technology department determined that a centralized data fabric spanning multiple business units and countries would be an adequate solution for business challenges. The department had to reconsider the decision because of challenges with aligning the data definitions and with the performance of the underlying databases. They determined that a decentralized architecture that also included data warehouses in certain critical areas would be easier to set up and maintain in the business and technical context of the bank.

A road map to best-in-class data and AI architecture

Ideally, any discussion of best-in-class architecture begins by identifying the required capabilities and using them to guide decision making and develop the technical architecture and implementation plan. The exhibit details a reference model for a capability road map, which a company can customize and refine according to its specific needs and objectives.

A best-in-class reference data and AI architecture provides and essential framework design.

Once organizations identify the required capabilities, they can determine the underlying technology architecture. This could include, for example, implementing a vector database to support gen AI or a data warehouse to help support a single source of truth. The technology choices are shaped by the requirements of the data architecture archetype. However, several technologies are not limited by archetype and can enable automated data handling that is more secure.

When it comes to data security and management, organizations could make these no-regrets moves:

Data quality (data and model management). Organizations could implement automated quality checks by using ML algorithms for predictive data quality monitoring. In addition, they could implement data lineage tools to trace data from source to destination and ensure integrity and transparency. Establishing real-time data profiling systems for immediate data validation will ensure ongoing quality.

Security as code (processing). Organizations could integrate automated compliance scanning tools that continuously check for and enforce compliance with security policies in the data pipeline. Incorporating threat modeling as code in the development cycle will help predict and mitigate potential security breaches.

DataOps (data and model management). Organizations could automate the delivery of the right data assets for the underlying AI and ML models. Automating the code reviews, testing, and other tasks that are part of the software development life cycle will increase efficiency and reduce errors. Last, organizations could strengthen the possibilities to directly deploy code into production with adequate testing.

Choosing the right type of governance

To ensure data governance is effective, organizations would ideally align it with their technology and operational structure. Organizations typically follow one of three governance models: enterprise-, domain-, or business unit–oriented. Each model involves specific roles for the hub and spokes. Depending on the operating model and data architecture archetype, analytics functions and processes are divided between the hub and spokes.

Enterprise-oriented model. In this model the central hub owns the data, is responsible for data quality and availability, and defines data logic. Data is often stored centrally in a data warehouse. The spokes may provide suggestions and change requests but mostly consume data.

Domain-oriented model. The central hub defines company-wide quality standards and policies and monitors their execution in this model. Data storage can be centralized or decentralized. Spokes own the data, define the data requirements, and are responsible for adhering to the standards. Spokes may adhere to stricter standards defined by the hub. The process includes spokes consulting the hub and raising suggestions for discussion in a data council.

Business unit–oriented model. If a central hub is included in this model, it maintains technology components that multiple spokes may use. The hub can also support data consumption across the organization. Spokes own data and set their own standards. Spokes may coordinate with other spokes directly on cross-spoke needs. For example, a universal bank implemented a business unit–oriented model in which the individual front lines of the business took responsibility for the data quality and a central team provided overarching tooling and measured the data quality.

Ideally, organizations determine a data architecture archetype and detailed roles within it before implementing the data governance model. Out of the many options available, organizations can then execute the optimal pathway and degree of central steering for their archetype.

The choice among centralized, hybrid, and decentralized data architecture archetypes depends on each company’s unique needs and strategies. Reference architectures, data governance models, and thoughtful technology implementation provide a road map for forward-thinking CIOs to choose the most suitable archetype and establish strong data governance practices for next-gen data products.

Aziz Shaikh is a partner in McKinsey’s New York office, Henning Soller is a partner in the Frankfurt office, Margarita Młodziejewska is a consultant in the Zurich office, and Mitch Gibbs is a principal cloud architect in the Atlanta office.

The authors wish to thank Alan Spark, Asin Tavakoli, Lisa Weiß, Maximilian Fiedler, Mridula Juluri, and Vladimir Alekseev for their contributions to this blog post.