Statistically valid imputation as a tool for data preprocessing, model validation, and generative modeling.
Imputation is the process of inferring unknown data. It has useful applications in model validation, data preprocessing, and generative modeling. When dealing with sequential data, the hidden Markov model provides one of many possible tools for imputation; it is especially useful in practice because of its generative nature.
Dealing with missing data
In data science, it is very common to encounter incomplete, missing, and insufficient data. In such a case, imputation is the practice of inferring data where it is not explicitly known. Some naive methods for imputation include filling in missing values with the previous or next non-missing value. It’s also common to fill in missing data with fixed values such as 0 or the mean values of the dataset. These naive techniques may have some utility, depending on the goal of the imputation, but they are sure to introduce bias into the data. For sequential data in particular, these methods ignore important and valuable contextual information. More sophisticated forms of imputation involve fitting the known data to a function — this method is called interpolation — or more generally, fitting the known data to a mathematical model, and using statistical inference to determine the missing values. Before we talk about some explicit goals of imputation, let’s consider what might go wrong with naive imputation.
This seems like a minor distinction, but consider a scenario where a wind turbine has an alarm sensor that is “off” for 99% of the time. If we impute missing data for the alarm sensor by simply setting missing values to “off” we are guaranteed to have an accuracy of at least 99%, which seems good.
However, this is a statistically invalid method since the nominal alarm rate is guaranteed to be different from the actual alarm rate, and furthermore, forecasting with this method leads to the assumption that the alarm state will never be triggered, which could be disastrous. Therefore it’s important to impute in a statistically valid way, and in a manner that considers the context of the data and the purpose of the imputation. In what follows, we discuss three possible use cases for imputation.
One preliminary use of imputation is to validate a trained model. By artificially redacting data, then imputing the missing data, it’s possible to get a sense of how well the model understands the data by validating the accuracy of the imputed data against the redacted data. This is a useful tool for understanding the overall fitness of a model.
Diagram showing how imputation is used to validate a trained model.
A well-trained model will impute values that — in the best case — equal the original value. In a more likely case, imputed values are as statistically valid as the original value, given everything we know.
A sample visualization showing model validation from our companion post, “How to use data visualization to validate imputation tasks”.
Imputation is also a useful tool for data preprocessing. It is not uncommon to receive data with incomplete or missing observations. Using statistically informed imputation, these values can be imputed by considering the probability distribution of possible values given the known observations surrounding the missing data points.
Diagram showing how imputation is used to fill missing or incomplete data.
The tools that allow imputation to be used for data preprocessing can also be used for short-term forecasting, by predicting “next” observations based on the current set of observations.
For a model with an underlying generative structure, imputation is an important tool for data generation. A model is generative if it models the joint probability distribution of both observations and target data. Because of this property, generative models have been used to great effect in tasks such as image and text generation.
Diagram showing how imputation is used to generate a new dataset with similar features and values to the original one.
Moreover, since generative models give a probability distribution over the complete data — in the case of hidden Markov models this means both observations and hidden states — with just enough data for initial training, a model can be used to generate an unlimited amount of data like it.
Better imputation with hidden Markov models
The hidden Markov model (HMM) is an important tool for modeling sequential data. It works by detecting hidden patterns in multivariate sequential data that arise by considering not only the characteristics of isolated observations, but also the observations in sequence.
In this way the HMM captures not only correlations between variables but also between observations that are close in time, modeled via underlying “hidden states”. This information is encoded in terms of transition probabilities (i.e. likelihood of transitioning from one hidden state to the next) and emission probabilities (i.e. likelihood of seeing a given observation given a fixed hidden state).
An HMM is trained using the Baum-Welch expectation-maximization algorithm, and once trained can be a powerful tool for imputation. For any incomplete observation, an HMM can be used to determine not only the most likely hidden state by using transition probabilities, but also the most likely observation for that state using emission probabilities.
By determining not only the relative likelihood of hidden states, but also the corresponding likelihood of observations, the HMM can be used to fill in missing values and generate data.
For any imputation method, it is helpful to have a sense of the technical underpinnings of the methodology to get a true understanding of its capabilities and limitations. In an upcoming post on When Machines Learn we’ll dive deeper into the technical methodology of imputation using hidden Markov models.