DataProphet
DataProphet AI requirements

Data Requirements for AI in Manufacturing

Oct. 23, 2019
Artificial Intelligence offers the ability to learn complex patterns of information — assuming that the process is properly designed and provided with examples of the right data

At the core of state-of-the-art Artificial Intelligence (AI) algorithms is the ability to learn complex patterns from a sample of data. In the manufacturing context, an example of a pattern might be the ways in which a set of parameters contained in that data, which are related to a process in a plant, vary together. When considering AI, it’s important to understand what the data requirements are at the outset.

The algorithm learns the patterns by being shown many examples of the parameter values in question—typically between a few thousand and several million. This data sample is a representation of the history of the factory process. Now, if a trend exists in the sample to the effect that, for example, every increase in the process temperature by one degree Celsius tends to be accompanied by a decrease in the process’s time by ten seconds, the AI will learn this apparent relationship between the temperature and time parameters. In this way, the AI effectively learns a model of the process. It does so automatically, assuming that it is properly designed and fed enough examples of the right data.

The right data for AI? — What constitutes the “right” data for AI-enabled process optimization? The general answer is the set of data that is sufficient to describe how changes to a process’s parameters affect quality. The bulk of process data can generally be represented as a table, or a collection of tables, comprising of columns (parameters) and rows (production examples, representing, say, one production batch per row). In order to be meaningful as a representation of a process, or more specifically of the history of a process, these tables need to be accompanied by some explanatory information. Start by taking a look at the kinds of explanatory information that are necessary, before discussing the data requirements in terms of those tabular columns and rows.

The key pieces of explanatory information, required by the data science team, are:
• A high-level description of the physical process;
• A description of the flow of production through the process (normally in the form of a process-flow diagram), including in some contexts the time offsets between process steps;
• A description of how the data table(s) relate to the process.

Some of these descriptions may be obtained from the available technical documentation. In most cases, however, the necessary insights can be learned by walking through the data tables with the plant specialists.

Due to the nature of AI-enabled parameter optimization, there are some clear fundamentals that the bulk of the data — the data tables — needs to satisfy.

Data columns represent quality — To begin, the data columns must include a representation of the quality result. It’s important to note that data may not contain a full representation of how quality is measured in a manufacturing operation. These gaps in the data are common — batch sampling, for example: in some cases, the available data can be sufficient to achieve dramatic results.

The second set of required data columns concerns process parameters. These fall into two types: controllable and non-controllable parameters.
• Controllable parameters are the ‘levers’ available to the factory operator to alter the process and thus to improve quality. In general terms, these could include controllable aspects of the process chemistry, temperature, and time.
• Non-controllable parameters represent inputs to the process that cannot be controlled by the plant operator from day to day, such as the ambient temperature, the identity of the machine (in the case of a parallel process), or characteristics of the input material.

These parameter columns should together represent the factors that have the greatest influence on quality.

However, due to the ability of AI models to learn complex interactions in a large number of variables, a manufacturer is advised to make all available data points around the process available for inclusion in the AI model. The cost of including additional variables is low.

A good AI specialist will employ the necessary statistical techniques to determine whether the variable should be included in the final model. Variables that might be considered marginal at first may contribute to an AI model that leverages effects and interactions in the process, of which the specialists had previously been unaware, potentially resulting in an improved optimization outcome.

Row-wise data requirements — Let’s turn now to the row-wise data requirements. The general rule here is that the data needs to be representative of the process, and in particular of the interactions that are likely to affect quality in the future. A basic aspect of this is to ask: how many rows, i.e. production examples, make a sufficient training set? The answer depends on the complexity of the process. The sample needs to be a sufficient representation of this complexity. In the manufacturing context, the lower bound typically ranges from a few hundred to several thousand historical examples. Training a model on more data than is strictly sufficient, however, tends to increase the model’s confidence and level of detail, which in turn is likely to further improve the optimization outcome.

A sufficient number of historical examples does not in itself guarantee a representative sample; the historical examples also should be representative with respect to time. The dataset should be sufficiently recent to represent the likely operating conditions — like machine wear — at the time of optimization. In many cases the data also should represent one or more sufficient periods of continuous operation, as this allows the AI to learn which operating regions can be sustained, as well as how effects from one part of the process propagate to others over time.

Consistency and data availability — This brings us to the last key data requirement, namely consistency and continued availability. In order to keep the AI model current with operating conditions on the production line, fresh data needs to be available for regular “retrains” of the model. This, in turn, requires some level of integration with the data source. In a worst-case scenario, this might mean a continuous digitization process if the record-keeping system is offline, or manual exports of tabular data by factory technicians. These approaches are relatively labor-intensive and may be subject to inconsistencies.

An ideal setup would consist of a live data stream from the manufacturer’s data bus into a persistent store dedicated to supplying the AI training pipeline. For some manufacturers, a mixture of approaches is appropriate to cater for multiple plants.

Continued data availability goes hand in hand with the requirement for data consistency. This can best be illustrated with a negative example, in which a factory intermittently changes the representation of variables in data exports, such as whether a three-state indicator is represented as a number from the set {1, 2, 3} or as a string of text from the set {‘red’, ‘orange’, ‘green’}. If uncaught, these types of changes could quietly corrupt the optimization model and potentially result in a negative impact on process quality.

The digitization and automation of process data infrastructure and data exports goes a long way toward addressing these issues. Whatever the plant’s data infrastructure, however, a good AI ingest pipeline should feature a robust data validation layer, to ensure inconsistencies are flagged and fixed.
Joris Stork is a senior data scientist for DataProphet, developer of the OMNI artificial intelligence platform for manufacturing. Contact him via LinkedIn.

Latest from Enterprise Data