Data Science

Using Hidden Markov Models to Detect Seasonality in Sequential Data

June 18, 2020

Anna Haensch Ph.D.

TL;DR

In this post we will discuss the advantages of performing seasonal clustering on time series data with a hidden Markov model as compared to a traditional spatial clustering algorithm. We further explore how applying this clustering improves possible applications to important machine learning tasks such as operating mode classification.

Understanding the Seasons

The seasons winter, spring, summer, and fall, are a mechanism that we use for both classifying weather conditions and describing the time of year. Humans have long been aware of the cadence of the seasons, and have a reasonably consistent way of interpreting them in terms of local weather patterns; we know that there are four of them, we know they occur annually in a predictable, sequential way, and we know some general characteristics of the weather for each of the seasons.

In the northeastern United States, summer is usually warm, while winter is usually cold; trees tend to bud in the spring and turn brown in the fall.

What’s not obvious is whether the definition of seasons is something inherent to the weather itself, or something that requires some presupposed knowledge of the cyclic nature of seasons.

If we didn’t know that seasons were cyclic, could we still detect the seasons from the weather?

In this post, we will explore whether it’s possible to reclaim the definition of seasons, directly from raw data through machine learning.

Finding the Seasons in Data

Using weather data from DarkSky, gathered every 12 hours from January 1, 2019 through December 31, 2019 for Cambridge, MA, as well as data from an array of nine 300 Watt Solar panels located in Cambridge, we will consider different types of models that have the potential to learn and identify seasons. We will also consider models that can solve the slightly more complicated problem of correctly identifying the season on seasonably uncharacteristic days.

The uppermost plot above gives the difference in 7-day rolling mean temperature over 30 days. In addition the five continuous measurements above, we also have a categorical measurement giving one of 18 narrative summaries of the weather for each 12-hour block, such as, “heavy rain” or “humid and partly cloudy.”

Raw data variables used as features in training our Machine Learning model.

From the plots above, we can see that raw weather data exhibits a clear seasonality. If we were to cluster dates together by some measure of similarity, it would be reasonable to expect, for example, the warm, sunny, high UV index days of June, July and August to belong to the same cluster.

Spatial Clustering Techniques

One simple method of clustering is K-means clustering, which partitions data into groupings by assigning each observation to a cluster with the closest mean value. We refer to such an algorithm as a spatial clustering technique, since the observed values are considered as points in some high dimensional space — 6 dimensions, in this case — and the clusters are related to their position in this space. In the plot below we perform K-means clustering on the data and specify four clusters to reflect the four seasons we hope to discover.

Clustered states identified in the data with the K-means clustering algorithm.

At first glance, the plot above doesn’t immediately demonstrate seasonal clustering. However, looking closely, we see that in certain date ranges there are dominant clusters which are likely groupings of days which are most typical to the season. For example, cluster 2 is dominant in the summer. A weakness of this algorithm is that it’s not sensitive to outliers. The algorithm has no way of understanding that a cold day in August is different from a cold day in February.

To counter this, we can turn to a more advanced spatial clustering algorithm, such as HDBSCAN, which draws a distinction between core points and outliers. For the purposes of weather classification, the outliers can be thought of as anomalies in the data. Performing HDBSCAN clustering on our data leads to the labeling of seasons below, which we can see comes fairly close to our conception of the seasons, particularly in the clustering of summer days.

Clustered states identified in the data with the HDBSCAN clustering algorithm.

However, this still leaves us with the problem of assigning seasons to the anomalous days. Spatial clustering algorithms do not take into account the sequential nature of the data, in which there is still a great deal of signal to be found. For example, an unusually hot day can often be classified as such by comparing it to the most recent typical day. For this task, we will introduce hidden Markov modeling.

Hidden Markov Modeling

A Markov chain is a series of events (also called states) in which the probability of the next event depends only on the current state. If we loosely categorize all possible weather scenarios into four different types, we can consider the series of weather types for each 12 hour period over one year as a Markov chain.

In this way, we can give some measure to the likelihood of the weather type tomorrow given the weather type today by considering the various transition probabilities.

Modeling the changing weather in this way gives a minimal backbone from which we can describe the changing of the seasons. In particular, we can tune it to be sensitive to the fact that day after day, the season tends to stay the same, and the seasons must follow one after the other in an ordered way.

In our example, we pretend not to know the seasons for each day, but rather, we hope to infer them solely from the observed data measurements. Therefore, in this example the Markov chain is a sequence of hidden states. In general, a system where a series of hidden states satisfying the Markov rule have a corresponding series of observations is called a hidden Markov model (HMM). In its most generic rendering, an HMM can be visualized as in the diagram below, where at each time t we have observed data, X(t), and a hidden state, Z(t).

Graphical representation of a hidden Markov model.

An HMM has three important parameters associated with it:

The initial state probability, which gives the probability of each of the hidden states at the first recorded timestamp.
The emission probability which gives the probability of the observed data (e.g. the weather), given each of the hidden states.
The transition probability matrix which gives the probability of transitioning from one hidden state to another at any given time.

Using the celebrated Baum-Welch expectation-maximization (EM) algorithm, we can use the observed raw weather data to train an HMM to learn optimal initial state, emission, and transition parameters. The EM algorithm is an iterative process of optimizing model parameters by incrementally optimizing the likelihood of the observed data. For somewhat technical reasons, the algorithm is guaranteed to terminate, but will only terminate at locally optimal model parameters.

In practice, this means that the outcome of EM is dependent on the choice of initializing model parameters. It’s important to seed the algorithm carefully both for outcome and run-time considerations. We will soon see how this choice of initial model parameters can impact the performance of the HMM.

In both of the following examples, the initial state probability and emission probabilities are seeded randomly and in both we set the number of hidden states to be equal to four. In the first example, the transition probability matrix is also seeded randomly. This means that the probability of changing from any season to another from one day to the next is totally random.

As we see below, even after training to a locally optimal set of parameters, this model performs quite poorly at the task of detecting seasons.

Detecting four hidden states with a non-cyclic hidden Markov model.

In fact, even though the model knows about four hidden states, only three of them actually occur. This is typical in a poorly seeded HMM, since it’s easy to converge to a locally optimal solution which is far from globally optimal. On the other hand, suppose we seed the EM algorithm with the following more informed transition probability matrix:

The above matrix captures two important features of seasonality. The entry in row i and columns j of this matrix represents the probability of transitioning from hidden state i to hidden state j. In the matrix above, the largest values occur along the main diagonal. This reflects the fact that the probability of remaining in the same season from one day to the next is very high.

In addition, we notice that each row has only one additional non-zero value off the main diagonal. This is a reasonable choice for our transition matrix since we know that each season has only one other season which follows it. The probability of transitioning into any season other than the current one or the season immediately following it must be zero.

With these values set to zero, the cyclic nature of the HMM will be preserved by training with EM since transition probabilities seeded with zero won’t change. Once trained, we can use the Viterbi algorithm with our model to label our data with the most likely sequence of states.

Detecting four hidden states with a cyclic hidden Markov model.

The plot above incorporates the sequential nature of the data as well as the cyclic nature of seasons to detect the seasons of the year with reasonable accuracy.¹ Notice how this differs from the labeling produced by the spatial clustering models which also detected seasonal trends, but were not able to correctly label days that exhibited seasonally uncharacteristic weather patterns.

A visual summary of season detection in weather and solar data.

Conclusion

Clustering is very effective in enhancing the interpretability of data, but the power of hidden Markov modeling is in its ability to contextualize sequential raw data. Consequently, this type of modeling can be used to great effect in decision making processes and this framework has broad applicability to industrial settings including:

Operating mode detection and analysis.
Time-to-event forecasting for industrial assets and machinery.
Imputation of missing data to fill in otherwise degraded or incomplete data.
Generative modeling for the training of robust models even when only a very limited amount of initial training data is available.

Stay tuned for more posts about these topics and more on When Machines Learn.

¹ You might notice that the dominant hidden states for the cyclic HMM plot appear to cycle 0, 1, 3, 2, 0. In fact, if you look closely, you see small colored bands between the dominant cycles, so the cyclic nature of the HMM is still preserved but in an unexpected way.

‍

Copied!