Imputation is a tool for filling in missing data. There are many ways to impute, and in this post we explain a few practical methods. For imputing sequential data, the hidden Markov model will prove particularly useful.
Collecting Equipment Data
These days it’s very common for everything from powerful commercial wind turbines to common household appliances like ceiling fans, refrigerators, and even garden hoses to come equipped with sensors that feed data continuously to the cloud. When everything is working properly, this makes it easy for consumers to track their data and get daily snapshots of their equipment’s performance. However, anytime the Wi-Fi is down, the system might fail to capture data, resulting in unknown values in the historical record. This is especially problematic if we hope to analyze the data using machine learning algorithms that don’t necessarily handle missing data well.
As data scientists confronted with this situation, we’d like to fill in these values in the smartest way possible. While we can’t know the true value of this missing data, we can get better performance from these algorithms by making an educated guess at what the values would have been. Luckily, missing data points in the context of a more complete dataset leave us with plenty of clues for assigning the missing values.
Imputation is an inference-based method for assigning values to an otherwise unknown range of data. This practice has many applications in data science. Perhaps the most common — and the topic of discussion in this post — is filling in missing data during preprocessing. You can read more about other applications of imputation in Applications of Imputation for Sequential Data, or if you’re new to imputation, you might start by reading this brief Introduction to Imputation. In this post we’ll be taking a closer look at some common machine learning approaches for imputation and discussion why they work…or why they don’t.
To compare different methods of imputation we’ll be intentionally redacting values from a data set of energy production vs demand. The energy production data is gathered from an array of nine 300 watt solar panels located on my roof in Cambridge, MA. This array sends energy production data to a remote server over my home’s Wi-Fi network where it is then broadcast to a website and mobile app. We’ll also be looking at the electricity demand across the New England electrical grid from October 2018 to October 2020. Since the goal of this post is to measure the effectiveness of various imputation strategies, we’re beginning with a dataset that doesn’t have any missing values.
We’re going to focus on the task of imputation for time series data and evaluate the effectiveness of several different strategies, but before we proceed, a small note about mathematical rigor. In this post we’re going to judge the effectiveness of our imputation by visual inspection only. A lot can be learned by looking at a plot! Of course there are lots of mathematical ways to determine how well an imputation strategy works, but we’ll save those for another post.
Exploring the Data
Let’s start by looking at the data. As expected, energy production follows a very clear seasonal pattern, with higher energy production in the sunny summer months. The seasonality of the energy demand curve is a bit less obvious, but the demand is highest in summertime with a smaller peak in wintertime. If we look at a scatter plot of energy production against energy demand, we don’t see any obvious clustering of the data.
To test the various imputation strategies, we’re going to “forget” several values over the course of the data; this process is known as redacting data points. In the plots below we’ll highlight the redacted data points.
Our goal now is to fill in these missing values in a way that takes advantage of everything we can learn about the underlying distribution of the data.
Imputation with IterativeImputer
One thing we can do is to impute the missing values using scikit-learn’s built-in IterativeImputer. This method works by fitting a regression model for each feature that has a missing value (in this case that’s “Energy Production” and “Energy Demand”). The regression model is then used to fit the missing values. Unfortunately for our data, this means that the best this method can do is replace the missing value with the mean value for each feature. If we view this as a time series, we can see that the imputed value is never too wrong, since it’s always at the mean.
However, viewed as a scatterplot above on the right we can see that we haven’t really gained any insight into the data from using this method of imputation, since every missing point, regardless of context, was filled in with the same value.
Although the clustered data doesn’t present any obvious Gaussian properties, we know that there is some seasonality signal in the data. Therefore, a smart method of imputation would take advantage of the seasonality in the data, and differentiate between the mean values of the different seasonal components of the data.
One such method uses hidden Markov models (HMM). If you’ve never heard of a hidden Markov model, you might start by reading the previous post on this blog about Using Hidden Markov Models to Detect Seasonality in Sequential Data.
An HMM has the ability to detect latent states in multidimensional data that are not always visible to the naked eye. Training an HMM on the energy data set we can detect three hidden states, roughly corresponding to winter, summer, and a combined state of spring/fall.
There are now several ways that we can use the hidden Markov model to impute the missing data points. Let’s consider two possible strategies, and what we might gain from each.
HMM Imputation with Averaging
At each moment in time, we have a spread of probabilities across the possible hidden states. If our goal is to impute a value at time t, then we’ll first need to know the relative likelihood of each hidden state at time t, given everything that we know. In terms of probability, we’ll express the relative likelihood of hidden state i at time t as
where hₜ denotes the hidden state at time t and D denotes the complete data available to us.
Let’s suppose that each of our hidden states corresponds to a Gaussian distribution. In this case, for each hidden state, i, we’d have the corresponding mean, μᵢ, standard deviation, σᵢ, and therefore observations in this hidden state are drawn from
Keep in mind that in a Gaussian distribution the mean acts as the most likely value to be drawn from the distribution. In terms of hidden states, this suggests that the most likely value to occur in any hidden state is the mean itself. This observation is already enough to devise a method of imputation. At any time, t, where data is missing, we can simply take our imputed value to be a weighted average of the means for each hidden state, that is,
This will deliver an imputed value that is more contextually relevant than scikit-learn’s IterativeImputer, since the relative likelihood of hidden states is completely informed by context. For this reason, we can see in the plot below that each of the missing values is imputed with a slightly different value. However, it is very unlikely that this method will precisely impute the correct value, since the values are typically going to be drawn towards the weighted mean, to the exclusion of extreme values. On the other hand, this method will always be more accurate than the basic scikit-learn IterativeImputer, and there is an upper bound to how wrong it can be. Let’s look at an even smarter version of HMM imputation.
HMM Imputation with Argmax
Similar to the method above, we will consider the spread of relative probabilities for each hidden state at a missing observation. Unlike the method above, we will impute with only the most probable value, rather than taking a weighted average. Recall that for each hidden state the mean represents the most likely value to appear in the corresponding probability distribution. But also recall that probability distributions have variance associated with them, so in terms of the probability density function, the means of two different Gaussian distributions aren’t considered to be equally likely. To overcome this, we will compute the image of the mean under the probability density function for each hidden state and take only the maximum of those values to be our imputed value, that is,
Imputing with this method, we get the values below.
Notice how much closer the imputed values are to the original under the HMM Argmax method as compared to the HMM Averaging method. Of course none of these values are precisely correct, but they are far closer than the values produced by the other algorithms we’ve explored so far. Keep in mind that imputation will rarely yield the precisely correct value, since the internal calculations are always balancing the likelihood of multiple plausible values. Imputed values will have a tendency to drift towards the center.
In summary, each successive method was slightly better at imputing the redacted values. We can see this altogether in the plot below. Notice how each successive imputation attempt is drawn closer and closer to the original values.
There are many ways to impute missing data, and we’ve presented just three of them here. As always, there are tradeoffs between each method. Scikit-learn’s IterativeImputer is open source, easy to use and requires no setup, but fails to capture the temporal dynamics of the data. When dealing with missing data that has any sort of temporal dependence there is a lot to be gained by HMM imputation. An HMM is sensitive to the underlying distributions of the data, and it is able to accurately impute values that are far from the overall mean of the data. However, the two HMM methods require a custom implementation and more computationally intensive training. As to which of the Argmax or the Average method is best, this depends on the goal of imputation and the nature of the underlying data.
For more about imputation and machine learning with sequential data, check out When Machines Learn.