Adopting a smart data mindset in a world of big data

| Article

Industrial companies are embracing artificial intelligence (AI) as part of the fourth digital revolution.1The potential of advanced process controls in energy and materials,” November 23, 2020. AI leverages big data; it promises new insights that derive from applying machine learning to datasets with more variables, longer timescales, and higher granularity than ever. Using months or even years’ worth of information, analytics models can tease out efficient operating regimes based on controllable variables, such as pump speed, or disturbance variables, such as weather. These insights can be embedded into existing control systems, bundled into a separate advisory tool, or used for performance management.

Many companies in heavy industry have spent years building and storing big data but have yet to unlock its full value. In fact, our research shows that more than 75 percent have piloted some form of AI, yet less than 15 percent have realized meaningful, scalable impact. In these companies, analytics teams typically take an outside-in approach to AI and machine learning, including using various stochastic methods on top of process data that have been engineered with minimal operational insight. That approach can work, but it usually produces models that exhibit a high parameter dependence, require frequent retraining, have a high number of inputs, or give unphysical or unrealistic results. Consequently, these models rarely endure in production or achieve meaningful impact before operators and engineers lose confidence in them.

To succeed with AI, companies should have an automation environment with reliable historian data.2 Then, they will need to adapt their big data into a form that is amenable to AI, often with far fewer variables and with intelligent, first principles–based feature engineering. We term the latter format “smart data” to emphasize the focus on an expert-driven approach that improves predictive accuracy and aids in root-cause analysis. This article describes steps for creating smart data, along with approaches to bolstering and upskilling expert staff. Our experience shows that success in both areas can result in an EBITDA3 increase of 5 to 15 percent.

Creating smart data

A common failure mode for companies looking to leverage AI is poor integration of operational expertise into the data-science process. Indeed, we advocate applying machine learning only after process data have been analyzed, enriched, and transformed with expert-driven data engineering. In practice, we suggest the following steps (Exhibit 1):

Five steps can turn process data into smart data.

1. Define the process

Outline the steps of the process with experts and plant engineers, sketching out physical changes (such as grinding and heating) and chemical changes (such as oxidation and polymerization). Identify critical sensors and instruments, along with their maintenance dates, limits, units of measure, and whether they can be controlled. Finally, note the deterministic equations that govern the process (such as thermodynamic relationships or reaction stoichiometry), as well as the variables involved. The latter step should be accompanied by a literature search to expand the realm of thinking beyond the knowledge of the organization. If process expertise is limited, the use of external experts can be essential.

As an example, a North American mining company endeavored to improve the throughput of its grinding operations, which included seven grinding mills and three cyclone “classifiers,” which separate particles based on size. Experts and engineers sat with the data-science team to illustrate the process flow, which was divided into three stages of grinding and separation, each of which was monitored by approximately a dozen sensors. Data tags (or labels) were noted along with sensor redundancies and instrument accuracies. Metallurgical staff provided derivations of Plitt’s equation4 for particle separation and the Bond equation5 for grinding energy, among others. The result was a unified team with plant experts who understood what to look for in the field that may affect the resulting models and data scientists who recognized operational pitfalls and where to improve data quality.

2. Enrich the data

Raw process data nearly always contain deficiencies. Thus, creating a high-quality dataset should be the focus, rather than striving for the maximum number of observables for training. Teams should be aggressive in removing nonsteady-state information, such as the ramping up and down of equipment, along with data from unrelated plant configurations or operating regimes. Generic methods to treat missing or anomalous data should be avoided, such as imputing using averages, “clipping” to a maximum, or fitting to an assumed normal distribution. Instead, teams should start with the critical sensors identified by process experts and carefully address data gaps using virtual sensors and physically correct imputations.

For example, a European chemical company aimed to apply machine learning to its cracking furnace. Experts indicated that a flow meter was critical to the process, but the data-science team determined it was faulty, and the values were occasionally erroneous because of miscalibration. The operations team proposed pausing the project until a new flow meter was installed. Instead, the existing values were enriched by creating a virtual flow sensor using mass-balance formulas and upstream sensor data for temperature and energy use. With the virtual sensor engineered, the analytics team was able to triangulate and correct the flow values. In total, the project delivered a 20 percent increase in processing throughput.

Would you like to learn more about our Metals & Mining Practice?

3. Reduce the dimensionality

AI algorithms build a model by matching outputs, known as observables, to a set of inputs, known as features, which consist of raw sensor data or derivations thereof. Generally, the number of observables must greatly exceed the number of features to yield a generalized model. A common data-science approach is to engineer input combinations to produce new features. When combined with the sheer number of sensors available in modern plants, this necessitates a massive number of observations. Instead, teams should pare the features list to include only those inputs that describe the physical process, then apply deterministic equations to create features that intelligently combine sensor information (such as combining mass and flow to yield density). Often, this is an excellent way to reduce the dimensionality of and introduce relationships in the data, which minimize the number of observables required to adequately train a model.

As an example, a European chemical company observed occasional pressure increases in the feed line to a spray dryer, which necessitated stops or slowdowns in its continuous process. A model was built to predict pressure buildup. Even when all the relevant sensor data were included, the results were unsatisfactory. In response, the team combined details of the pipe geometry with some of the sensor information into the Darcy–Weisbach equation.6 The result was a reduced number of model inputs and enhanced data quality, which subsequently increased the model performance. Operators were then able to leverage the model to nearly eliminate slowdowns, yielding an 8 percent throughput increase.

4. Apply machine learning

Industrial processes can be characterized by deterministic and stochastic components. In practice, first principle–based features should provide the deterministic portion, with machine-learning models capturing the statistical portion from ancillary sensors and data. Teams should evaluate features by inspecting their importance and therefore their explanatory power. Ideally, expert-engineered features that capture, for example, the physics of the process should rank among the most important. Overall, the focus should be on creating models that drive plant improvement, as opposed to tuning a model to achieve the highest predictive accuracy. Teams should bear in mind that process data naturally exhibit high correlations. In some cases, model performance can appear excellent, but it is more important to isolate the causal components and controllable variables than to solely rely on correlations. Finally, errors in the underlying sensor data should be evaluated with respect to the objective function. It is not uncommon for data scientists to strive for higher model accuracy only to find that it is limited by sensor accuracy.

For example, a North American metal producer wanted to create a model to predict the heat needed to melt a batch of recycled material. The team first created one deterministic feature for “required heat” based on specific-heat equations that utilize the mass, heat capacity, and melting point of each alloy. Subsequently, data from 19 sensors were added as features to capture stochastic behavior, such as loss of heat through the flue or changes in the atmospheric temperature. The resulting model showed excellent performance, with the deterministic feature exhibiting an importance of more than 80 percent.7 The model output was sent directly to a human-machine interface (HMI) where operators could utilize the predictions to sequence melting. In total, the model has been running every minute for nearly two years, yielding a 10 percent reduction in melt time and a more consistent batch temperature.

5. Implement and validate the models

Impact can be achieved only if models (or their findings) are implemented. Taking action is critical.

Impact can be achieved only if models (or their findings) are implemented. Taking action is critical. Teams should continuously review model results with experts by examining important features to ensure they match the physical process, reviewing partial dependence plots (PDPs) to understand causality, and confirming what can actually be controlled. Additional meetings should be set up with operations colleagues to gauge what can be implemented and to agree on baseline performance. It is not uncommon for teams to convey model results in real time to operators in a control room or to engage in on-off testing before investing in production-grade, automated solutions.

As an example, a European bioscience player tried to optimize the yield of its fermentation process where data were scarce. After initial modeling efforts, only 40 percent of the variability in throughput could be explained with sensor data and engineered features. The team used insights from the parameter relations in the model to design an experiment in the plant, and these results were used to improve the model and inform operations as to where to place new sensors. The result was consensus between data-science and operations colleagues and a production increase of more than 20 percent.

Building the team

Deploying AI in heavy industry requires cross-functional teams made up of operators, data scientists, automation engineers, and process experts. We often find that companies have (or are hiring to fill) roles for data science, but they face three main challenges regarding process experts: there is a dearth of process expertise either at a specific facility or across the company; there are sufficient process experts, but they are not comfortable with modern digital or analytical tools; or process experts don’t know how to work effectively on digital teams (Exhibit 2).

Industrial companies have varying levels of process expertise.

Process experts

Industrial companies are increasingly facing a shortage of process experts due, in part, to the retirement of tenured employees and the lack of younger job candidates. As a result, companies looking to implement AI often need to first rebuild their expert pipeline, typically through partnerships with universities and internship programs. While the pipeline is being reestablished, OEMs and external consultants can be used to augment teams, but “owning” the skills is important in the long term because it is a source of differentiated value.

Concurrently, companies should upskill their existing process experts in analytics tools and agile ways of working. Experts typically have engineering or other similar backgrounds; they are accustomed to leveraging formulas to describe physical processes. That type of thinking can be beneficial in creating smart data, but it can also engender distrust for AI-based approaches. Upskilling process experts with a combination of classroom training and in-field apprenticeship on cross-functional AI teams can build comfort with the approach and results. With these skills, process experts can better support digital teams, including partnering with data scientists to help them understand the problem, create smart data, and pressure-test models to ensure that the models have learned the correct first principle–based behavior. Moreover, in our experience, upskilling has the added benefit of increasing job satisfaction and retention.

Ways of working

It can be challenging to create high-performing teams using cross-functional roles because of differences in approach. For example, it is common for operations employees to follow unidirectional stage-gated processes—often for safety reasons—whereas data-science colleagues are usually familiar with iterative workflows, such as agile. When deploying AI, our experience shows that iterative, inclusive, and colocated agile teams tend to realize the most impact. As a result, coaching is needed for colleagues unfamiliar with this approach.

Planning out the model development can be a good exercise to solidify a way of working and avoid linear approaches that include exhaustively completing one stage (such as data extraction) before proceeding to the next. Instead, pieces of each stage should be completed concurrently to quickly develop a fully working model with the intention of maturing individual components in future iterations. In practice, this usually means starting with a subset of sensor data, creating a limited list of features, and working with simpler algorithms. Then, the team can decide what to invest in for the next stage. As part of each iteration, there should be a discussion of what the definition of “done” is to align on the outcome and avoid scope creep.


Industrial companies are looking to AI to boost their plant operations—to reduce downtime, proactively schedule maintenance, improve product quality, and so on. However, achieving operational impact from AI is not easy. To be successful, these companies will need to engineer their big data to include knowledge of the operations (such as mass-balance or thermodynamic relationships). They will also need to form cross-functional data-science teams that include employees who are capable of bridging the gap between machine-learning approaches and process knowledge. Once these elements are combined with an agile way of working that advocates iterative improvement and a bias to implement findings, a true transformation can be achieved.

Explore a career with us