Survival Analysis, Part 2: Taking advantage of static data

In Part 1, we covered the Weibull model and its applicability to modeling the distribution of failure times for a generic piece of equipment. We also delved into the useful statistics that can be extracted from this seemingly simple model using the tools of survival analysis. In Part 2, we’ll talk about how we can use static asset-specific data to improve upon our modeling capabilities.

Image for post
Static data engraved on a transformer nameplate.

Encoding static data

Let’s return to the example of fictional Company X, which maintains a fleet of 10,000 hypothetical machines. Each time they install a machine, Company X records the following information about the machine in their database:

  • Manufacturer
  • Electrical rating
  • Maximum power output

We say each of these features are static, because they are intrinsic properties of the machine and are unchanging over the machine’s lifetime. Employees at Company X are in agreement that machines made by certain manufacturers tend to last a little longer than others, and that electrical rating and maximum power output also are correlated with each machine’s useful life. But they’re not clear on how to use this information to inform their understanding of how long each asset is likely to last. Luckily, survival modeling is here to help…

The first step when dealing with categorical data (data which can only take on a finite number of values, corresponding to different categories) is to transform it into a numeric value that a machine learning model can make sense of. In general, machine learning models can’t directly use the information that an asset was manufactured by “Manufacturer A” without doing some preprocessing first.

Here’s one simple way to go about it. Imagine all machines in Company X’s fleet are made by one of three manufacturers: Manufacturers A, B, and C. Then we could just assign the code “1” to machine made by manufacturer A, “2” to machines made by manufacturer B, and “3” to machines made by manufacturer C. 1, 2, and 3 are perfectly good numbers, and can be easily operated upon by a machine learning model. Problem solved, right?

Not quite. By blithely creating an arbitrary mapping between categories and numeric values, we encode relative importance between each manufacturer without meaning to. For a linear model, the effect of “machine made by manufacturer C” will necessarily be three times more significant than the effect of “machine made by manufacturer A”, just because of the way we encoded the categorical data! This is an issue, but luckily there’s a simple and elegant solution known as one hot encoding.

One hot encoding

Image for post
Original data
Image for post
One hot encoded data

The example above provides a simple example of how one hot encoding works. For each unique value of our static variable of interest (in this case, manufacturer), we create a new binary variable that encodes whether the machine was made by that manufacturer or not. The advantage of this approach is that now the model can weight the effects of each manufacturer separately. In order to understand why this is true, it’s useful to forget about the Weibull model for a minute (we’ll be coming back to it) and understand how this information could be used to construct a linear regression model.

Imagine we’re trying to predict the average lifetime of a machine (in years) based only on the manufacturer. The linear regression problem would look something like

Image for post
Image for post

y represents our estimate of remaining useful life, and β₀, β₁, β₂, and β₃ are the parameters that we are trying to learn from the observed data. Obviously, we can only use machines which have failed in this calculation, since if they’re still in service we know only a lower bound on their lifetimes (this phenomenon is known as censoring, and is one of the central distinctions that separates survival analysis from other techniques). We can use a simple method such as OLS (Ordinary Least Squares) regression to minimize the sum of squared errors and find the coefficients β₀, β₁, β₂, and β₃ that best fit our data. Because this is a linear model, these coefficients have a very useful interpretation: holding all other variables fixed, the coefficient associated with a particular manufacturer is our best estimate of how much longer (or shorter) the machine is likely to live than the average (which happens to be β₀). If we run OLS regression and find β₁ = -4, we estimate that machines made by manufacturer A are likely to live 4 years less than the average machine.

The Cox Proportional Hazards Model

While linear regression is an excellent tool for many applications, it doesn’t fit naturally into the framework of survival analysis in general. Therefore, we seek to reconcile the Weibull hazard based solely on observed machine lifetimes (described in detail in Part 1) and the effect of machine-specific features we’ve been discussing. A popular method for doing exactly that is the Cox Proportional Hazards model, a mainstay of survival modeling introduced in 1972 by David Cox. As a reminder, the hazard function for an asset is defined as

Image for post

which represents the probability of failure in the next time period t+1, given the asset has survived up until time t. In the case where we only have machine lifetime data, covered in Part 1, we compute a base hazard function h₀(t) that is our best estimation of the hazard for all of the transformers in the fleet. But now that we have access to both the base hazard function and the features of each individual machine, we’d like to make use of the new information to compute machine-specific hazard functions which will depend on the features of each machine as well as the base hazard. How do we incorporate both of these information sources in a single equation for hazard?

The assumption of the Cox model is that the combination of features belonging to a specific machine increases the hazard by a fixed constant factor, which we call the scaling factor. For example, let’s assume the machine with features

Image for post
Static features of a hypothetical machine.

is twice as likely to fail compared to the average machine in the fleet. Let’s also assume that each machine has three features, X₁, X₂, and X₃. Then we would like a function f that will satisfy:

Image for post

Our function f does exactly what we’d like: we give f the values of the features for machine n, and it spits out the correct scaling factor which happens to be 2 in this case. We have to impose some constraints on f, though; it has to return a positive output for all inputs, since the hazard function represents a probability and therefore has to be zero or positive everywhere. The next key assumption is that each feature impacts the hazard differently, and therefore requires a tweak-able parameter β that tells us how much it contributes to the hazard for machine n. This is exactly analogous to the linear regression example we discussed earlier; however, we will need to use something a little more complicated than OLS in order to get the “right” values of β (more on that in a future post!).

Finally, we have all the pieces we need:

  • A base hazard function computed by fitting the Weibull model to data on observed machine lifetimes
  • A set of learned parameters β₀, β₁, β₂, and β₃ that encode how much each feature X₁, X₂, and X₃ contributes to hazard
  • A link function that ensures the hazard will be positive (negative probabilities don’t make any sense).

Choosing the exponential link function eˣ, we arrive at the Cox model:

Image for post

When we evaluate the exponential term, we simply end up with a positive number that multiplies the base hazard function, resulting in a scaled hazard function tailored to the risk associated with that specific machine. And as in the linear regression case, each parameter βᵢ can be interpreted as a measure of the magnitude of the importance of the corresponding feature Xᵢ on hazard.

If you’re scratching your head and asking where the learned β parameters come from (or what it means to “learn” a parameter), you’re asking the right question. We’ll delve into the specifics of the machine learning magic behind survival modeling in Part 4, but first, I’ll cover how time-varying features can be incorporated in the survival framework in Part 3. As always, stay tuned for the next installment and thanks for reading!