Introducing Shapley values for TimeGPT

SHAP values provide a powerful approach for better understanding and validating your predictions. See how to use them in TimeGPT.

In the version 2 release of the TimeGPT API, we added SHAP values as a capability. SHAP values are based on cooperative game theory principles, and are useful when you are making use of exogenous variables in your predictions.

SHAP values show you the contribution of each exogenous variable to the overall forecast. This allows you to understand the impact of each exogenous variable on TimeGPT’s predictions, providing deeper insights into the decision-making process. This is particularly valuable for model explainability and trust, especially in critical applications.

When to use SHAP values

Exogenous variables are external factors that are not a part of your time series data, but provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.

For example, if you’re forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase and on colder days decrease. Exogenous variables are crucial in time series forecasting, because often your predictions are linked to these external factors, even more so than just past historical patterns.

SHAP values allow you to understand the impact of those exogenous variables. When you forecast with exogenous features, you can access the SHAP values for all series at each prediction step, and use the popular shap Python package to make different plots and explain the impact of the features.

An example

Let’s go through an example using the open-access EPF dataset. This dataset includes data from five different electricity markets, each with unique price dynamics, such as varying frequencies and occurrences of negative prices, zeros, and price spikes. Since electricity prices are influenced by exogenous factors, each dataset also contains two additional time series: day-ahead forecasts of two significant exogenous factors specific to each market.

For simplicity, we will focus on the Belgian electricity market (BE). This dataset includes hourly prices (y), day-ahead forecasts of load (Exogenous1), and electricity generation (Exogenous2). It also includes one-hot encoding to indicate whether a specific date is a specific day of the week. Eg.: Monday (day_0 = 1), a Tuesday (day_1 = 1), and so on.

A reproducible notebook with this tutorial is available here and you can get started right away by opening it in Colab.

An excerpt from the data is given below:

Table with the first 5 rows of the EPF dataset. Columns include unique_id, ds, y, Exogenous1, Exogenous2, day_0, day_1, day_2, day_3, day_4, day_5, day_6

In this example, we know the future values of the ‘Exogenous1` and `Exogenous2` variables. However, often you don’t know the future values of your exogenous variables. In case you don’t have the future values of your exogenous variables, you can either (i) predict them using TimeGPT (see an example here), or (ii) use your exogenous variables as historical exogenous variables (see an example here).

We now have:

Historical data: for both our target y and our exogenous variables;
Future data: only for our exogenous variables.

In the example below, df contains our historical data and future_ex_vars_df contains our future exogenous variables.

We’ll use those together in TimeGPT to make our forecasts! We’re going to predict the future values of y, which are hourly electricity prices.

First, we load the nixtla library, and set it up with our API key that you can get from the TimeGPT dashboard.

Python code. pip install nixtla. from nixtla import NixtlaClient. nixtla_client = NixtlaClient(api_key = "my api key provided by nixtla").

Next, we call the forecast method, and to access the SHAP values, we also need to specify feature_contributions=True in the forecast method.

Python code. timegpt_fcst_ex_vars_df = nixtla_client.forecast(df=df, X_df=future_ex_vars_df, h=24, level=[80], feature_contributions=True)

Let’s explain what these variables mean:

df = our dataframe;
X_df = our dataframe of future exogenous variable;
h = our forecast horizon. Here our data is hourly, so 24 indicates we create a forecast for 24 hours ahead;
level = the confidence level of the forecasts. A value of 80 indicates that TimeGPT will return a lower- and upper bound for the forecasts within which the model expects 80% of the actual observed values.
feature_contributions = True. This is what to include so the SHAP values are calculated.

This completes in a few seconds, and in just a few lines of code we’ve done our forecast and calculated the SHAP values!

Let’s look at the results of our forecast. The forecast for the target variable that we wanted to predict, hourly prices, is in the TimeGPT column. So, for example we’ve predicted that on 2018-12-24 the hourly price will be 50.98, with confidence bounds of 48.7 (lower) and 53.2 (upper).

Table with the first 5 rows of the forecasting results. Columns include unique_id, ds, TimeGPT, TimeGPT-hi-80, TimeGPT-hi-90, TimeGMT-lo-80, TimeGPT-lo-90. ds are dates, and are 2016-12-31 ):)) to 2016-12-31 04:00. TimeGPT results are 51.6322830 to 33.785370.

This seems reasonable. You wouldn’t expect to see big swings in hourly pricing from hour to hour, so that they’re all in a similar range makes sense. We also can look at the data from 2018-12-23. You’d be likely to see similar values there as well, so it’s another good thing to look at for validation.

In our documentation, we provide methods to validate the forecasts in a more structured and comprehensive way.

SHAP values

We now return to the focus of this blog post: SHAP values. The SHAP values will allow us to understand the contribution of each exogenous variable to the forecast. This enables us to answer questions such as:

Do the exogenous variables have an impact on our forecasts?
Which exogenous variables have an impact on our forecasts and how much do they each contribute?

We can then extract the SHAP values as follows:

Python code. shap_df = nixtla_client.feature_contributions

This returns a DataFrame containing the SHAP values and base values for each series, at each step in the horizon. Let’s have a look:

Table with the first 5 rows of the EPF dataset SHAP values. Columns include unique_id, ds, y, Exogenous1, Exogenous2, day_0, day_1, day_2, day_3, day_4, day_5, day_6. Exognenous1 has the most impact with a range of 27.929638 to 33.785370. Then Exogenous2 with a range of -16.36360 to -20.619830.

In the dataframe, the SHAP values are contained in the exogenous variables columns (Exogenous1, Exogenous2 and day_0 to day_6), and in a `base_value` column. The base value is the prediction of the model if exogenous features were unknown.

What is important, is that the forecast from TimeGPT (in the ‘TimeGPT’ column) is equal to the sum of the base value and the SHAP values of each exogenous feature in a given row. This is a key property of SHAP values: the individual contributions of the base value and the exogenous variables sum up to the overall forecast. Hence, we immediately have obtained a notion of contribution of each exogenous variable to the overall forecast.

It’s easier to get a sense of the contribution of these different exogenous variables if we plot them. We can use the shap package to make any plots that we want.

The following code creates a waterfall plot for the SHAP values of the first timestamp in our forecast (midnight on December 31, 2016):

Python code. selected_ds = shap_df['ds'].min() filtered_df = shap_df[shap_df['ds'] == selected_ds] shap_values = filtered_df[shap_columns].values.flatten() base_value = filtered_df['base_value'].values[0] features = shap_columns shap_obj = shap.Explanation(values=shap_values, base_values=base_value, feature_names=features) shap.plots.waterfall(shap_obj, show=False) plt.title(f'Waterfall Plot: NP, date: {selected_ds}') plt.show()

‍

Waterfall plot of SHAP values. Exogenous 1 has the widest bar with a value of +27.93. Exogenous2 has the next biggest bar with -16.36. Then day_5 with -3.41. Then day_1 with -1.88. Then day_6 with +1.1, day_4 with +0.42, day_3 with -0.3, day_0 with +0.08.

‍

The x-axis represents the value of the forecast of this timestamp. At the bottom, we see E[f(X)] which represents the base_value (the predicted value if exogenous features were unknown).

Then, we see how each feature has impacted the final forecast. Features like day_3, day_0, day_5, Exogenous2 all push the forecast to the left (smaller value). On the other hand, day_1, day_2, day_4, day_6 and Exogenous1 push it to the right (larger value).

Let’s think about this for a moment. In the introduction, we stated that Exogenous1 represents electricity load, whereas Exogenous2 represents electricity generation.

Exogenous1, the electricity load, adds positively to the overall prediction. This seems reasonable: if we expect a higher demand, we might expect the price to go up.
Exogenous2, on the other hand, adds negatively to the overall prediction. This seems reasonable too: if there’s a higher electricity generation, we expect the price to be lower. Hence, a negative contribution to the forecast for Exogenous2.

At the top right, we see f(x) which is the final output of the model after considering the impact of the exogenous features. Notice that this value corresponds to the final prediction from TimeGPT.

SHAP values provide a powerful approach for better understanding and validating your predictions. The feature_contributions attribute in TimeGPT gives you access to all the necessary information to explain the impact of exogenous features using the shap package, and to enable it, you just need to add that ‘feature_contributions’ line into your forecast call. So, add it in, make some plots and see what your exogenous variables are doing for you.