With the rapid proliferation of smart electricity meters and the widespread adoption of electricity generation tech such as solar panels, there is a literal treasure trove of electricity usage data available at our disposal today.

This data signifies a multivariate time series of power-related variables, which in turn could be leveraged to model and even predict future electricity consumption.

In this guide, you will learn about a household power consumption dataset for multi-step time series predictions and how to better comprehend the raw data leveraging exploratory analysis.

After going through this guide, you will be aware of:

• The household power consumption dataset that details electricity utilization for a singular household over the course of four years.
• How to look into and comprehend the dataset leveraging a suite of line plots for the series data and histogram for the data distributions.
• How to leverage the new comprehension of the problem to consider differing framings of the forecasting problem, ways in which the data might be prepped, and modelling strategies that might be leveraged.

Tutorial Summarization

This guide is subdivided into five portions, which are:

1. Power Consumption Dataset
3. Patterns in Observations Over Time
4. Time Series Data Distributions
5. Ideas on Modelling

Household Power Consumption Dataset

The Household Power Consumption Dataset is a multivariate time series dataset that details the power consumption for a singular household across four years.

The information was gathered in the duration between December 2006 and November 2010 and observations of power utilization within the house were accumulated each minute.

It is a multivariate series consisted seven variables (aside from the date and time), which are:

• global_active_power: The cumulative active power consumed by the household (kilowatts)
• global_reactive_power: The cumulative power consumed by the household (kilowatts)
• voltage: Mean voltage (volts)
• global_intensity: Average current intensity (amps)
• sub_metering_1: Active energy for kitchen (watt-hours of active energy)
• sub_metering_2: Active energy for laundry (watt-hours of active energy)
• sub_metering_3: Active energy for climate control systems (watt-hours of active energy)

Active and reactive energy are a reference to the tech details of alternative current.

Generally speaking, the active energy is the real power utilized by the household, while the reactive energy is the unused power present in the lines.

We can observe that the dataset furnishes the active power in addition to some division of the active power by main circuit in the home, particularly the kitchen, laundry, and climate control. These are not all the circuits in the household.

The remainder watt-hours can be calculated from the active energy through initial conversion of the active energy to watt-hours then removing the other sub-metered active energy in watt-hours, as follows:

sub_metering_remainder = (global_active_power * 1000 / 60) – (sub_metering_1 + sub_metering_2 + sub_metering_3)

The dataset appears to have been furnished with no seminal reference paper.

Nonetheless, this dataset has become a standard for assessing time series forecasting and ML strategies for multi-step forecasting, particularly for predicting active power. Further, it is not clear if the other features in the dataset might furnish advantages to a model in forecasting active power.

The dataset can be obtained from the UCI Machine Learning repository as a singular 20 MB .zip file:

Get the dataset and unzip it into your present working directory. You will then have the file “household_power_consumption.txt” that is approximately 127 MBs in size and contains all of the observations.

Go through the data file.

Listed below are the initial five rows of data (and the header) from the raw data file.

 1234567 Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_316/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.00016/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.00016/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.00016/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.00016/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000…

We can observe that the data columns are demarcated by semicolons (‘;’)

The data is reported to possess a single row for every day in the time period.

The data does have absent values, for instance, we can observe 2-3 days worth of absent data around 28/04/2007

 1234567 …28/4/2007;00:20:00;0.492;0.208;236.240;2.200;0.000;0.000;0.00028/4/2007;00:21:00;?;?;?;?;?;?;28/4/2007;00:22:00;?;?;?;?;?;?;28/4/2007;00:23:00;?;?;?;?;?;?;28/4/2007;00:24:00;?;?;?;?;?;?;…

We can begin by loading the data file as a Pandas DataFrame and summarize the loaded data.

It is simple to load the data with this function, but a tad tricky to load it in the right way.

Particularly, we require to do a few custom things:

• Specify the separate between columns as a semicolon (sep= ‘;’)
• Specify that line 0 has the names for the columns (header=0)
• Specify that we possess tons of RAM to avoid a warning that we are loading the data as an array of objects rather than an array of numbers, due to the ‘?’ values for absent data (low_memory=False)
• Specify that it is okay for Pandas to attempt to infer the date-time format during the parsing of dates, which is way quicker (infer_datetime_format = True)
• Specify that we would like to parse the date and time columns together as a fresh column referred to as ‘datetime’ (parse_dates={‘datetime’:[0,1]})
• Specify that we would wish for our new ‘datetime’ column to be the index for the DataFrame (index_col=[‘datetime’])

Bringing all of this together, we can now load the data and summarize the loaded shape and initial few rows.

Then, we can mark all absent values signified with ‘?’ character with a NaN value, which is a float.

This will enable us to operate with the data as a single array of floating point values instead of mixed variants, which has reduced efficiency.

# mark all missing values

dataset.replace(‘?’, nan, inplace=True)

Now we can develop a fresh column that consists of the remainder of the sub-metering, leveraging the calculation from the prior section.

 123 # add a column for for the remainder of sub meteringvalues = dataset.values.astype(‘float32’)dataset[‘sub_metering_4’] = (values[:,0] * 1000 / 60) – (values[:,4] + values[:,5] + values[:,6])

We can now save the cleansed version of the dataset to a fresh file; in this scenario we will just modify the file extension to .csv and save the dataset as

‘household_power_consumption.csv’

# save updated dataset

dataset.to_csv(‘household_power_consumption.csv’)

To confirm that we have not messed-up, we can re-load the dataset and summarize the initial five rows.

Connecting all of this together, the full instance of loading, cleaning-up, and saving the dataset is detailed below.

from numpy import nan

# summarize

print(dataset.shape)

# mark all missing values

dataset.replace(‘?’, nan, inplace=True)

# add a column for for the remainder of sub metering

values = dataset.values.astype(‘float32’)

dataset[‘sub_metering_4’] = (values[:,0] * 1000 / 60) – (values[:,4] + values[:,5] + values[:,6])

# save updated dataset

dataset.to_csv(‘household_power_consumption.csv’)

# load the new dataset and summarize

Running the instance first loads the raw data and summarizes the shape and initial five rows of the loaded data.

 123456789 (2075259, 7) Global_active_power      …       Sub_metering_3datetime                                     …2006-12-16 17:24:00               4.216      …                 17.02006-12-16 17:25:00               5.360      …                 16.02006-12-16 17:26:00               5.374      …                 17.02006-12-16 17:27:00               5.388      …                 17.02006-12-16 17:28:00               3.666      …                 17.0

The dataset is then cleansed and saved to a fresh file.

We load this fresh file and again print the initial five rows, displaying the removal of the date and time columns and addition of the new sub-metered column.

 1234567 Global_active_power       …        sub_metering_4datetime                                       …2006-12-16 17:24:00                4.216       …             52.2666702006-12-16 17:25:00                5.360       …             72.3333362006-12-16 17:26:00                5.374       …             70.5666662006-12-16 17:27:00                5.388       …             71.8000002006-12-16 17:28:00                3.666       …             43.100000

We can look inside the new ‘household_power_consumption.csv’ file and check that the absent observations are indicated with an empty column, that pandas will rightly read as NaN, for instance around row 190,499:

 12345678 …2007-04-28 00:20:00,0.492,0.208,236.240,2.200,0.000,0.000,0.0,8.22007-04-28 00:21:00,,,,,,,,2007-04-28 00:22:00,,,,,,,,2007-04-28 00:23:00,,,,,,,,2007-04-28 00:24:00,,,,,,,,2007-04-28 00:25:00,,,,,,,,…

Now that we possess a cleansed variant of the dataset, we can investigate it further leveraging visualizations.

Patterns in Observations Over Time

The data is a multivariate time series and the ideal way to comprehend a time series is to develop line plots.

We can begin by developing an independent line plot for every one of the eight variables.

The full instance is detailed below.

 12345678910111213 # line plotsfrom pandas import read_csvfrom matplotlib import pyplot# load the new filedataset = read_csv(‘household_power_consumption.csv’, header=0, infer_datetime_format=True, parse_dates=[‘datetime’], index_col=[‘datetime’])# line plot for each variablepyplot.figure()for i in range(len(dataset.columns)):pyplot.subplot(len(dataset.columns), 1, i+1)name = dataset.columns[i]pyplot.plot(dataset[name])pyplot.title(name, y=0)pyplot.show()

Running the instance develops a singular image with eight subplots, one for every variable.

This provides us a really high level of the four years of single minute observations. We can observe that something fascinating was going on in ‘Sub_metering_3’ (environmental control) that might not directly map to hot or cold years. Probably new systems were setup.

Fascinatingly, the contribution of ‘sub_metering_4’ appears to reduce with time, or display a downward trend, probably correlating with the solid increase in seen towards the conclusion of the series for ‘Sub_metering_3’

These observations do reinforce the requirement to honour the temporal ordering of subsequences of this information during fitting and assessing any model.

We might be able to observe the wave of a seasonal impact in the ‘Global_active_power’ and some other variates.

There is some spiky usage that might correlate with a particular period, like weekends.

We can zoom in and concentrate on the ‘Global_active_power’, or ‘active power’ for short.

We can develop a fresh plot of the active power for every year to observe if there are any common patterns throughout the years. The starting year, 2006, has lesser than a single month of data, so we will delete it from the plot.

The full instance is detailed below.

 1234567891011121314151617181920 # yearly line plotsfrom pandas import read_csvfrom matplotlib import pyplot# load the new filedataset = read_csv(‘household_power_consumption.csv’, header=0, infer_datetime_format=True, parse_dates=[‘datetime’], index_col=[‘datetime’])# plot active power for each yearyears = [‘2007’, ‘2008’, ‘2009’, ‘2010’]pyplot.figure()for i in range(len(years)):# prepare subplotax = pyplot.subplot(len(years), 1, i+1)# determine the year to plotyear = years[i]# get all observations for the yearresult = dataset[str(year)]# plot the active power for the yearpyplot.plot(result[‘Global_active_power’])# add a title to the subplotpyplot.title(str(year), y=0, loc=’left’)pyplot.show()

Running the instance develops one singular image with four line plots, one for every full year (or mostly complete years) of data within the dataset.

We can observe some common gross patterns throughout the years, like around Feb-Mar and around Aug-Sept where we observe a marked reduction in consumption.

We also seem to observe a downward trend during the summer months (middle of the year in the Northern Hemisphere) and probably more consumption in the winter months towards the edges of the plots. These may demonstrate an annual seasonal pattern in consumption.

We can additionally observe a few patches of absent data in at least the first, third, and fourth plots.

We can continue to zoom in on consumption and observe active power for every one of the entirety of 2007.

This might assist tease out gross structures throughout the months, like daily and weekly patterns.

The full instance is detailed below.

 1234567891011121314151617181920 # monthly line plotsfrom pandas import read_csvfrom matplotlib import pyplot# load the new filedataset = read_csv(‘household_power_consumption.csv’, header=0, infer_datetime_format=True, parse_dates=[‘datetime’], index_col=[‘datetime’])# plot active power for each yearmonths = [x for x in range(1, 13)]pyplot.figure()for i in range(len(months)):# prepare subplotax = pyplot.subplot(len(months), 1, i+1)# determine the month to plotmonth = ‘2007-‘ + str(months[i])# get all observations for the monthresult = dataset[month]# plot the active power for the monthpyplot.plot(result[‘Global_active_power’])# add a title to the subplotpyplot.title(month, y=0, loc=’left’)pyplot.show()

Executing the instance develops a singular image with twelve line plots, a single one for every month in 2007.

We can observe the sign-wave of power utilization of the days within every month. This is good as we would expect some variant of everyday pattern in power utilization.

We can observe that there are stretches of data with very little consumption, like in August and in April. These might indicate vacation periods where the home was unoccupied and where power consumption was minimal.

Lastly, we can zoom in on one additional level and take a closer look at power utilization at the everyday level.

We would expect there to be some pattern to consumption every day, and probably variations in days across a week.

The full instance is detailed below.

 1234567891011121314151617181920 # daily line plotsfrom pandas import read_csvfrom matplotlib import pyplot# load the new filedataset = read_csv(‘household_power_consumption.csv’, header=0, infer_datetime_format=True, parse_dates=[‘datetime’], index_col=[‘datetime’])# plot active power for each yeardays = [x for x in range(1, 20)]pyplot.figure()for i in range(len(days)):# prepare subplotax = pyplot.subplot(len(days), 1, i+1)# determine the day to plotday = ‘2007-01-‘ + str(days[i])# get all observations for the dayresult = dataset[day]# plot the active power for the daypyplot.plot(result[‘Global_active_power’])# add a title to the subplotpyplot.title(day, y=0, loc=’left’)pyplot.show()

Executing the instance develops a singular image with 20 line plots, one for the first 20 days in January 2007.

There is commonality across the days, for instance, several days consumption begins early morning, at around 6-7AM.

A few days display a drop in consumption during mid-day, which would make sense if most occupants are out of the house.

We do observe some strong overnight consumption on a few days, that in a northern hemisphere January might match up with a heating system being leveraged.

Time of year, particularly the season and the weather that it brings, will be a critical factor in modelling this information, as would be expected.

Time Series Data Distributions

Another critical area of consideration is the distribution of the variables.

For instance, it might be interesting to be aware if the distributions of observations are Gaussian or some other variant of distribution.

We can look into the distributions of the data by reviewing histograms.

We can begin by developing a histogram for every variable in the time series.

The full instance is detailed below.

 12345678910111213 # histogram plotsfrom pandas import read_csvfrom matplotlib import pyplot# load the new filedataset = read_csv(‘household_power_consumption.csv’, header=0, infer_datetime_format=True, parse_dates=[‘datetime’], index_col=[‘datetime’])# histogram plot for each variablepyplot.figure()for i in range(len(dataset.columns)):pyplot.subplot(len(dataset.columns), 1, i+1)name = dataset.columns[i]dataset[name].hist(bins=100)pyplot.title(name, y=0)pyplot.show()

Executing the instance develops a singular figure with an independent histogram for each one of the eight variables.

We can observe that active and reactive power, intensity, in addition to the sub-metered power are all skewed distributions down towards small watt-hour or kilowatt values.

We can additionally observe that distribution of voltage data is strongly Gaussian.

The distribution of active power seems to be bi-modal, implying it appears like it has dual mean groups of observations.

We can look into this further by observing the distribution of active power consumption for the four complete years of data.

The full instance is detailed below.

 12345678910111213141516171819202122 # yearly histogram plotsfrom pandas import read_csvfrom matplotlib import pyplot# load the new filedataset = read_csv(‘household_power_consumption.csv’, header=0, infer_datetime_format=True, parse_dates=[‘datetime’], index_col=[‘datetime’])# plot active power for each yearyears = [‘2007’, ‘2008’, ‘2009’, ‘2010’]pyplot.figure()for i in range(len(years)):# prepare subplotax = pyplot.subplot(len(years), 1, i+1)# determine the year to plotyear = years[i]# get all observations for the yearresult = dataset[str(year)]# plot the active power for the yearresult[‘Global_active_power’].hist(bins=100)# zoom in on the distributionax.set_xlim(0, 5)# add a title to the subplotpyplot.title(str(year), y=0, loc=’right’)pyplot.show()

Executing the instance develops a singular plot with four figures, one for every one of the years between 2007 to 2010.

We can observe that the distribution of active power consumption across those years appears very similar. The distribution is indeed bimodal with a single peak around 0.3KW and probably another around 1.3 KW.

There is a long tail on the distribution to higher kilowatt values. It could open the door to notions of discretizing the data and separating it into peak 1, peak 2 or long tail. These groups or clusters for utilization on a day or hour might be beneficial in generating a predictive model.

It is feasible that the identified groups might vary over the seasons of the year.

We can look into this by looking at the distribution for active power for every month in a year.

The full instance is detailed below.

 12345678910111213141516171819202122 # monthly histogram plotsfrom pandas import read_csvfrom matplotlib import pyplot# load the new filedataset = read_csv(‘household_power_consumption.csv’, header=0, infer_datetime_format=True, parse_dates=[‘datetime’], index_col=[‘datetime’])# plot active power for each yearmonths = [x for x in range(1, 13)]pyplot.figure()for i in range(len(months)):# prepare subplotax = pyplot.subplot(len(months), 1, i+1)# determine the month to plotmonth = ‘2007-‘ + str(months[i])# get all observations for the monthresult = dataset[month]# plot the active power for the monthresult[‘Global_active_power’].hist(bins=100)# zoom in on the distributionax.set_xlim(0, 5)# add a title to the subplotpyplot.title(month, y=0, loc=’right’)pyplot.show()

Executing the instance develops an image with 12 plots, one for every month in 2007.

We can observe generally the same data distribution every month. The axes for the plots seem to align (provided the similar scales), and we can observe that the peaks are shifted down in the warmer northern hemisphere months and moved up for the colder months.

We can additionally observe a thicker or more prominent tail toward larger kilowatt values for the cooler months of December through to March.

Ideas on Modelling

Now that we are aware of how to load and explore the dataset, we can put forth some ideas on how to model the dataset.

In this portion of the blog, we will take a deeper look at three primary areas when working with the data, they are:

• Problem Framing
• Data Preparation
• Modelling Methods

Problem Framing

There does not seem to be a seminal publication for the dataset to illustrate the intended way to frame the data in a predictive modelling problem.

We are therefore left to guess at potentially useful ways that this data might be leveraged.

The data is just for a singular household, but probably effective modelling strategies could be generalized across to similar households.

Probably the most useful framing of the dataset is to predict an interval of future active power consumption.

For instances include:

• Forecast hourly consumption for the subsequent day
• Forecast daily consumption for the next week
• Forecast daily consumption for the next month
• Forecast monthly consumption for the next year

Generally speaking, these variants of forecasting problems are referenced to as multi-step forecasting. Models that leverage all of the variables might be referenced to as a multivariate multi-step forecasting models.

Each one of these models is not restricted to predicting the minutely data, but rather could model the problem at or below the selected forecast resolution.

Forecasting consumption in turn, at scale, could assist in a utility company forecasting demand, which is a broadly studied and critical problem.

Data Prep

There is a ton of flexibility in prepping this data for modelling purposes.

The particular data prep strategies and their advantages are really dependent on the selected framing of the problem and the modelling strategies. Nonetheless, here is a listing of general data prep strategies that might be beneficial.

• Daily differencing might be useful to adjust for the everyday cycle in the data.
• Yearly differencing might be useful to adjust for any annual cycle in the data.
• Normalization might assist in reducing the variables with differing units to the same scale.

There are several simplistic human factors that might be helpful in engineering features from the data, that might make particular days simpler to predict.

Some instances include:

• Indicating the time of day, to account for the likelihood of individuals being present at home or not.
• Indicating if a day is a weekday or a weekend.
• Indicating if a day is public holiday in the North American Region or not.

These factors might be considerably less important for predicting monthly data, and probably to a degree for weekly data.

More generalized features might include:

• Indicating the season, which might lead to the type or amount of environmental control systems being leveraged.

Modelling Strategies

There are probably four categories of methods that might be interesting to explore on this issue, which are:

• Naïve methods
• Classical Linear methods
• Machine Learning methods
• Deep Learning methods

Naïve Methods

Naïve methods would include strategies that make really simple, but usually very effective assumptions.

Some instances include:

• Tomorrow will be the same as today.
• Tomorrow will be the same as this day last year.
• Tomorrow will be an average of the last few days.

Classical Linear Methods

Classical linear methods include strategies that are really efficient for univariate time series forecasting.

Two critical instances include:

• SARIMA
• ETS (triple exponential smoothing)

They would need that the extra variables be discarded and the parameters of the model be setup or tuned to the particular framing of the dataset. Concerns connected to adjusting the data for everyday and seasonal structures can also be supported directly.

Machine Learning Methods

Machine learning methods need that the problem be framed as a supervised learning problem.

This would need that lag observations for a series be framed as input features, discarding the temporal relationship in the information.

A suite of nonlinear and ensemble strategies could be looked into, which include:

• k-nearest neighbours.
• Support vector machines
• Decision Trees
• Random Forest

Meticulous attention is needed to make sure that the fitment and assessment of these models preserved the temporal structure in the data. This is critical so that the strategy is not able to ‘cheat’ by leveraging observations from the future.

These strategies are usually agnostic to large numbers of variables and might assist in teasing out if the extra variables can be leveraged and include value to predictive models.

Deep Learning Models

Typically, neural networks have not proven really efficient at autoregression type problems.

Nonetheless, strategies like convolutional neural networks are able to automatically learn complicated features from raw data, which includes one-dimensional signal data. And recurrent neural networks, like the long-short-term memory network, have the capacity of direct learning across several parallel sequences of input data.

Further, combos of these strategies, like CNN LSTM and ConvLSTM, have proven efficient on time series classification tasks.

It is feasible that these strategies might be able to leverage the massive volume of minute-based data and several input variables.

Conclusion

In this guide, you found out about a household power utilization dataset for multi-step time series forecasting and how to better comprehend the raw data leveraging exploratory analysis.