How to leverage XGBoost for Time Series Forecasting
XGBoost is an efficient implementation of gradient boosting for classification and regression problems.
It is both quick and effective, featuring good performance, if not top-of-the-line, on a broad array of forecasting modelling activities and is popular amongst data science contest winners, like those on Kaggle.
XGBoost can additionally be leveraged for time series prediction, even though it needs that that the time series dataset can be converted into a supervised learning problem to start with. It also needs the leveraging of a specialist strategy for assessing the model referred to as walk-forward validation, as assessing the model leveraging k-fold cross validation would have the outcome optimistically biased results.
In this guide, you will find out how to develop an XGBoost model for time series prediction.
After going through this guide, you will be aware of:
- XGBoost is an implementation of the gradient boosting ensemble algorithm for classification and regression.
- Time series datasets can be converted into supervised learning leveraging a sliding-window representation.
- How to fit, assess and make forecasts with an XGBoost model for time series prediction.
Tutorial Summarization
This tutorial is subdivided into three portions, which are:
1] XGBoost Ensemble
2] Time Series Data Prep
3] XGBoost for Time Series Forecasting
XGBoost Ensemble
XGBoost is short for Extreme Gradient Boosting and is an effective implementation of the stochastic gradient boosting machine learning algorithm.
The stochastic gradient boosting algorithm, also referred to as gradient boosting machines or tree boosting, is a potent machine learning strategy that features good performance or even best on a broad array of challenging machine learning problems.
Tree boosting has been demonstrated to provide state-of-the-art outcomes on many conventional classification benchmarks.
It is an ensemble of decision trees algorithm where new trees rectify errors of those tress that are already integrated into the model. Trees are included until no subsequent enhancements can be made to the model.
XGBoost furnishes a very effective implementation of the stochastic gradient boosting algorithm and access to a suite of model hyperparameters developed to furnish control over the model training process.
XGBoost is developed for classification and regression on tabular datasets, even though it can be leveraged for time series forecasting.
To start with, the XGBoost Library must be setup.
You can set it up leveraging pip as follows:
1 | sudo pip install xgboost |
Once setup, you can confirm that it was setup successfully and that you are leveraging an advanced version by running the following code.
1 2 3 | # xgboost import xgboost print(“xgboost”, xgboost.__version__) |
Running the code, you should observe the following version number or higher.
1 | xgboost 1.0.1 |
Even though the XGBoost library has its own Python API, we can leverage XGBoost models with the scikit-learn API through the XGBRegressor wrapper class.
An example of the model can be instantiated and leveraged just like any other scikit-learn class for model evaluation. For instance:
…
# define model
model = XGBRegressor()
Now that we are acquainted with XGBoost, let’s look at how we can prep a time series dataset for supervised learning.
Time Series Data Preparation
Time series data can be termed as supervised learning.
Provided a sequence of numbers for a time series dataset we can rebuild the data to look like a supervised learning problem. We can perform this by leveraging prior time steps as input variables and leverage the next time step as the output variable.
Let’s make this concrete with an instance. Imagine we have a time series as follows.
time, measure
1, 100
2, 110
3, 108
4, 115
5, 120
We can restructure this time series dataset as a supervised learning problem by leveraging the value at the prior time step to forecast the value at the subsequent time-step.
Reorganizing the time series dataset in this manner, the data would appear as follows.
X, y
?, 100
100, 110
110, 108
108, 115
115, 120
120, ?
Note that the time column is dropped and a few rows of data are unusable for training a model, like the first and the last.
This representation is referred to as a sliding window, as the window of inputs and expected outputs is moved forward across time to develop new “samples” for a supervised learning model.
We can leverage the shift() function in Pandas to automatically develop new framings of time series problems provided the desired length of input and output sequences.
This would be a good tool as it would enable us to explore differing framings of a time series problem with machine learning algorithms to observe which might have the outcome of better-performing models.
The function below will take a time series as a NumPy array time series with a single or more columns and transform it into a supervised learning problem with the mentioned number of inputs and outputs.
# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols = list()
# input sequence (t-n, … t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
# forecast sequence (t, t+1, … t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
# put it all together
agg = concat(cols, axis=1)
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg.values
We can leverage this function to prep a time series dataset for XGBoost.
For more on the step-by-step development of this function, see the tutorial.
After the dataset has been prepped, we must be meticulous in how it is leveraged to fit and assess a model.
For instance, it would not be valid to fit the model on data from the future and have it forecast the past. The model must receive training on the past and forecast the future.
This implies that strategies that randomize the dataset during evaluation, like k-fold cross-validation, cannot be leveraged. Rather, we must leverage a strategy referred to as walk-forward validation.
In walk-forward validation, the dataset is initially split into train and test sets by choosing a cut point, for example, all data except the previous 12 days is leveraged for training and the last 12 days is leveraged for testing.
If we are concerned in rendering a one-step forecast, for example, a single month, then we can assess the model by training on the training dataset and forecasting the first step in the test dataset. We can then include the real observation from the test set to the training dataset, refit the model, then have the model forecast the second stage in the test dataset.
Repeating this procedure, for the complete test dataset will provide a single-step forecast for the entire test dataset from which an error measure can be calculated to assess the skill of the model.
For more on walk-forward validation, observe the tutorial:
The function below performs walk-forward validation.
It takes the complete supervised learning version of the time series dataset and the number of rows to leverage as the test set as arguments.
It then steps through the test set, calling the xgboost_forecast() function to make a single-step prediction. An error measure is calculated and the details are returned for analysis.
# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
predictions = list()
# split dataset
train, test = train_test_split(data, n_test)
# seed history with training dataset
history = [x for x in train]
# step over each time-step in the test set
for i in range(len(test)):
# split test row into input and output columns
testX, testy = test[i, :-1], test[i, -1]
# fit model on history and make a prediction
yhat = xgboost_forecast(history, testX)
# store forecast in list of predictions
predictions.append(yhat)
# add actual observation to history for the next loop
history.append(test[i])
# summarize progress
print(‘>expected=%.1f, predicted=%.1f’ % (testy, yhat))
# estimate prediction error
error = mean_absolute_error(test[:, -1], predictions)
return error, test[:, 1], predictions
The train_test_split() function is called upon to split the dataset into train and test sets.
We can define this function below.
# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
predictions = list()
# split dataset
train, test = train_test_split(data, n_test)
# seed history with training dataset
history = [x for x in train]
# step over each time-step in the test set
for i in range(len(test)):
# split test row into input and output columns
testX, testy = test[i, :-1], test[i, -1]
# fit model on history and make a prediction
yhat = xgboost_forecast(history, testX)
# store forecast in list of predictions
predictions.append(yhat)
# add actual observation to history for the next loop
history.append(test[i])
# summarize progress
print(‘>expected=%.1f, predicted=%.1f’ % (testy, yhat))
# estimate prediction error
error = mean_absolute_error(test[:, -1], predictions)
return error, test[:, 1], predictions
We can leverage the XBGRegressor class to make a single-step forecast.
The xgboost_forecast() function below implements this, taking the training dataset and test input row as input, fitment of a model, and making a single-step prediction.
1 2 3 4 5 6 7 8 9 10 11 12 | # fit an xgboost model and make a one step prediction def xgboost_forecast(train, testX): # transform list into array train = asarray(train) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = XGBRegressor(objective=’reg:squarederror’, n_estimators=1000) model.fit(trainX, trainy) # make a one-step prediction yhat = model.predict([testX]) return yhat[0] |
Now that we are aware of how to prep time series data for predicting and evaluating an XGBoost model, then we can look at leveraging XGBoost on a real dataset.
XGBoost for Time Series Forecasting
In this section, we will look into how to leverage XGBoost for time series forecasting.
We will leverage a conventional univariate time series dataset with the intent of leveraging the model to make a single-step prediction.
You can leverage the code in this portion of the blog as the beginning point in your own project and simply adapt it for multivariate inputs, multivariate forecasts, and multi-step forecasts.
We will leverage the daily female births dataset, that is the monthly births across three years.
You can download the dataset from here, place it in your present working directory with the filename “daily-total-female-births.csv”
Dataset (daily-total-female-births.csv)
Description (daily-total-female-births.csv)
The first few lines of the dataset appear as follows:
1 2 3 4 5 6 7 | “Date”,”Births” “1959-01-01”,35 “1959-01-02”,32 “1959-01-03”,30 “1959-01-04”,31 “1959-01-05”,44 … |
To start with, let’s load and plot the dataset.
The full instance is detailed below.
# load and plot the time series dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
series = read_csv(‘daily-total-female-births.csv’, header=0, index_col=0)
values = series.values
# plot dataset
pyplot.plot(values)
pyplot.show()
Running the instance develops a line plot of the dataset.
We can observe there is no obvious trend or seasonality.
A persistence model can accomplish an MAE of approximately 6.7 births when forecasting the previous 12 days.
This furnishes a baseline in performance above which a model might be considered skilful.
Then, we can assess the XGBoost model on the dataset when making single-step forecasts for the previous 12 days of data.
We will leverage just the prior six time steps as input to the model and default model hyperparameters, except we will alter the loss to “reg:squarederror” (to avoid a warning message) and leverage a 1,000 trees in the ensemble (to avoid understanding)
The full example is detailed below.
# forecast monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from matplotlib import pyplot
# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols = list()
# input sequence (t-n, … t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
# forecast sequence (t, t+1, … t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
# put it all together
agg = concat(cols, axis=1)
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg.values
# split a univariate dataset into train/test sets
def train_test_split(data, n_test):
return data[:-n_test, :], data[-n_test:, :]
# fit an xgboost model and make a one step prediction
def xgboost_forecast(train, testX):
# transform list into array
train = asarray(train)
# split into input and output columns
trainX, trainy = train[:, :-1], train[:, -1]
# fit model
model = XGBRegressor(objective=’reg:squarederror’, n_estimators=1000)
model.fit(trainX, trainy)
# make a one-step prediction
yhat = model.predict(asarray([testX]))
return yhat[0]
# walk-forward validation for univariate data
def walk_forward_validation(data, n_test):
predictions = list()
# split dataset
train, test = train_test_split(data, n_test)
# seed history with training dataset
history = [x for x in train]
# step over each time-step in the test set
for i in range(len(test)):
# split test row into input and output columns
testX, testy = test[i, :-1], test[i, -1]
# fit model on history and make a prediction
yhat = xgboost_forecast(history, testX)
# store forecast in list of predictions
predictions.append(yhat)
# add actual observation to history for the next loop
history.append(test[i])
# summarize progress
print(‘>expected=%.1f, predicted=%.1f’ % (testy, yhat))
# estimate prediction error
error = mean_absolute_error(test[:, -1], predictions)
return error, test[:, -1], predictions
# load the dataset
series = read_csv(‘daily-total-female-births.csv’, header=0, index_col=0)
values = series.values
# transform the time series data into supervised learning
data = series_to_supervised(values, n_in=6)
# evaluate6.7 births.
mae, y, yhat = walk_forward_validation(data, 12)
print(‘MAE: %.3f’ % mae)
# plot expected vs preducted
pyplot.plot(y, label=’Expected’)
pyplot.plot(yhat, label=’Predicted’)
pyplot.legend()
pyplot.show()
Running the instance reports the expected and forecasted values for every step in the test set, then the MAE for all forecasted values.
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
We can observe that the model has better performance than a persistence model, accomplishing a MAE of approximately 5.9 births contrasted to 6.7 births.
Can you do better?
You can evaluate XGBoost hyperparameters and numbers of time steps as input to observe if you can accomplish improved performance. Share your outcomes in the comments below.
>expected=42.0, predicted=44.5
>expected=53.0, predicted=42.5
>expected=39.0, predicted=40.3
>expected=40.0, predicted=32.5
>expected=38.0, predicted=41.1
>expected=44.0, predicted=45.3
>expected=34.0, predicted=40.2
>expected=37.0, predicted=35.0
>expected=52.0, predicted=32.5
>expected=48.0, predicted=41.4
>expected=55.0, predicted=46.6
>expected=50.0, predicted=47.2
MAE: 5.957
A line plot is developed for contrasting the series of expected values and forecasted values for the previous 12 days of the dataset.
This provides a geometric interpretation of how well the model had performance on the test set.
After a final XGBoost model configuration is selected, a model can be finalized and leveraged to make a forecast on fresh data.
This is referred to as an out-of-sample forecast, for example, forecasting beyond the training dataset. This is like making a forecast during the assessment of the model. As we always wish to assess a model leveraging the same process that we expect to leverage when the model is leveraged to make forecasts on new data.
The instance below illustrates fitting a final XGBoost model on all available data and making a single-step forecast beyond the end of the dataset.
# finalize model and make a prediction for monthly births with xgboost
from numpy import asarray
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from xgboost import XGBRegressor
# transform a time series dataset into a supervised learning dataset
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols = list()
# input sequence (t-n, … t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
# forecast sequence (t, t+1, … t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
# put it all together
agg = concat(cols, axis=1)
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg.values
# load the dataset
series = read_csv(‘daily-total-female-births.csv’, header=0, index_col=0)
values = series.values
# transform the time series data into supervised learning
train = series_to_supervised(values, n_in=6)
# split into input and output columns
trainX, trainy = train[:, :-1], train[:, -1]
# fit model
model = XGBRegressor(objective=’reg:squarederror’, n_estimators=1000)
model.fit(trainX, trainy)
# construct an input for a new preduction
row = values[-6:].flatten()
# make a one-step prediction
yhat = model.predict(asarray([row]))
print(‘Input: %s, Predicted: %.3f’ % (row, yhat[0]))
Running the instance fits the XGBoost model on all available data.
A new row of input is prepped leveraging the previous six days of known data and the next month beyond the conclusion of the dataset is forecasted.
Input: [34 37 52 48 55 50], Predicted: 42.708
Further Reading
This section furnishes additional resources on the subject if you are seeking to delve deeper.
Conclusion
In this guide, you found out how to develop an XGBoost model for time series forecasting.
Particularly, you learned:
- XGBoost is an implementation of the gradient boosting ensemble algorithm for classification and regression.
- Time series datasets can be transformed into supervised learning leveraging a sliding-window representation.
- How to fit, assess, and make forecasts with an XGBoost model for time series forecasting.