>Business >​Deep learning frameworks for human activity identification

​Deep learning frameworks for human activity identification

Human activity recognition, or HAR in short, is a difficult time series classification activity. It consists of forecasting the movement of an individual on the basis of sensor information and conventionally consists of deep domain expertise and strategies that range from the raw data in order to go about fitting a machine learning model. Lately, deep learning strategies like convolutional neural networks and recurrent neural networks have demonstrated potent and even accomplish cutting-edge outcomes by automatically learning features from the raw sensor data.

In this blog article, you will find out the problem with regards to human activity recognition and the deep learning strategies that are accomplishing cutting-edge performance on this problem.

After going through this post, you will be aware of:

  • Activity recognition is the issue of forecasting the movements of an individual, usually indoors, on the basis of sensor information, like an accelerometer in a smartphone.
  • Streams of sensor information are usually split into subs-sequences referred to as windows, and every window is connected with a wider activity, referred to as a sliding window strategy.
  • Convolutional neural networks and long short-term memory networks, and probably both combined, are ideally suited to learning features from raw sensor and forecasting the connected movement.

Tutorial Summarization

This article is sub-divided into five portions, which are:

  1. Human Activity Recognition
  2. Advantages of Neural Network Modelling
  3. Supervised Learning Data Representation
  4. Convolutional Neural Network Models
  5. Recurrent Neural Network Models

Human Activity Recognition

Human activity recognition, abbreviated as HAR, is a wide domain of study which concerns itself with identification of the particular movement or action of a person on the basis of sensor information.

Movements are usually conventional activities carried out indoors, like walking, speaking, standing, and sitting. They might additionally be more concentrated activities like those variants of activities carried out in a kitchen or on a factory floor.

The sensor information might be remotely documented, like video, radar, or other wireless strategies. Alternatively, data might be documented directly on the topic like through carrying custom hardware or smart phones that have accelerometers and gyroscopes.

Sensor-driven activity recognition looks for the profound high-level knowledge with regards to human activities from multitudes of low-level sensor readings.

From a historical perspective, sensor information for activity recognition was a challenge and costly to collect, needing customized hardware. Now smart phones and other private tracking devices leveraged for fitness and health monitoring are cheap and ubiquitous. As such, sensor information from these devices is less expensive to gather, more typical, and thus is a more widely researched variant of the general activity recognition issue.

The problem is to forecast the activity provided a snapshot of sensor information, usually data from one or a minimal number of sensor types. Typically, this issue is framed as a univariate or multivariate time series classification task.

The problem is a challenge as there are no overt or direct methods to relate the documented sensor information to particular human activities and every subject may carry out an activity with considerably variance, having the outcome of variations in the documented sensor data.

The intention is to document sensor information and corresponding activities for particular topics, fit a model from this data, and generalize the model to categorize the activity of new unobserved subjects from their sensor information.

Advantages of Neural Network Modelling

Conventionally, strategies from the domain of signal processing were leveraged to undertake analysis and distil the gathered sensor data.

Such strategies were for feature engineering developing domain-particular, sensor-particular, or signal processing-specific features and perspectives of the original data. Statistical and machine learning frameworks were then trained on the processed variant of the data.

A restriction of this strategy is the signal processing and domain expertise needed to undertake analysis of the raw data and engineer the features needed to fit a model. This expertise would be needed for every new dataset or sensor modality. Basically, it is costly and not scalable.

But, in a majority of everyday HAR tasks, those strategies might mostly be dependent on heuristic handcrafted feature extraction, which is typically restricted by human domain knowledge. Further, only shallow features can be learned by those strategies, leading to undermined performance for unsupervised and incremental activities. Owing to these restrictions, the performances of conventional [pattern recognition] strategies are limited with regards to classification precision and model generalization.

In the best case scenario, learning strategies could be leveraged that automatically learn the features needed to make precise forecasts from the raw data directly. This would enable new problems, new datasets, and new sensor modalities to be taken up swiftly and affordably.

Lately, deep neural network models have begun delivering on their promises of feature learning and are accomplishing cutting-edge outcomes for human activity recognition. They have the potential to perform automatic feature learning from the raw sensor information and out-perform models fitted on hand-crafted domain-particular features.

The feature extraction and model building processes are usually carried out simultaneously in the deep learning strategies. The features can be learned automatically through the network rather than being manually developed. Besides, the deep neural network can additionally extract high-level representation in deep layer, which makes it more apt for complicated activity recognition activities.

There are two primary strategies to neural networks that are relevant for time series classification and that have been illustrated to feature good performance on activity recognition leveraging sensor information from commodity smart phones and fitness tracking devices.

They are Convolutional Neural Network Models and Recurrent Neural Network Models.

RNN and LSTM are indicated to identify short activities that have natural order while CNN is better at inferring long-term monotonous activities. The reason is that RNN could leverage the time-order relationship between sensor readings, and CNN is more potent with regards to learning deep features consisted in recursive patterns.

Supervised Learning Data Representation

Prior to diving into the particular neural networks that could be leveraged for human activity recognition, we are required to talk about data preparation.

Both variants of neural networks apt for time series classification need that data be prepped in a particular fashion in order to go about fitting a model. That is, in a ‘supervised learning’ way that enables the model to associate signal data with an activity class.

A straight-forward data prep strategy that was leveraged both for traditional machine learning strategies on the hand-crafted features and for neural networks consists of dividing the input signal data into windows of signals, where a provided window may have one to a few seconds of observation data. This is typically referred to as a ‘sliding window’.

Human activity recognition intends to infer the actions of a single or more individuals from a grouping of observations collected by sensors. Typically, this is carried out by following a static length sliding window strategy for the features extraction where dual parameters have to be static: the size of the window and the shift.

Every window is additionally connected with a particular activity. A provided window of data might have several variables, like the x, y, and z axes of an accelerometer sensor.

Let’s make this concrete with an instance.

We have sensor information for 10 minutes, they might appear like:






x,                                 y,                                 z,                                 activity

1.1,           2.1,           0.1,           1

1.2,           2.2,           0.2,           1

1.3,           2.3,           0.3,           1

If the data is documented ay 8 Hz, that implies there will be eight rows of data for 1 second of elapsed time carrying out an activity.

We might opt to possess a single window of data represent one second of data, that implies eight rows of data for an 8 Hz sensor. If we have x, y, and z data, that implies we would possess three variables. Thus, a singular window of data would be a 2D array with eight time steps and a trio of features.

One window could indicate a single sample. One minute of data would indicate 480 sensor data points, or 60 windows of eight time steps. The cumulative ten minutes of data would indicate 4,800 data points, or 600 windows of data.

It is a matter of convenience to detail the shape of our prepped sensor information in terms of the number of samples or windows, the number of time steps in a window, and the number of features which have undergone observation at every time step.

[samples, time steps, features]

One instance of ten minutes of accelerometer data documented at 8 Hz would be summed up as a three-dimensional array possessing the dimensions.

[600, 8, 3]

There is no ideal window size, and it is very dependent on the particular model being leveraged, the nature of the sensor data that was gathered, and the activities that are being categorized.

There is a tension in the size of the window and the size of the model. Bigger windows need bigger models that are slower to train, while lesser windows need smaller models that are much simpler to fit.

Intuitively, reducing the window size enables for a quicker activity detection, in addition to reduced resources and energy requirements. To the contrary, larger data windows are usually considered for the recognition of complicated activites.

Nonetheless, it is typical to leverage one or two seconds of sensor data in order to categorize a present fragment of an activity.

From the outcomes, reduced windows (2 s or less) are illustrated to furnish the most precise detection performance. As a matter of fact, the most accurate recognizer is gathered for really short windows (0.25-0.5s), leading to the ideal recognition of most activities. Conversely to what is often believed, this research illustrates that big window sizes do not necessarily convert into an improved recognition performance.

There is some risk that the splitting of the stream of sensor data into windows might have the outcome of windows that miss the transition of a single activity to another. As such, it was conventionally, typical to split data into windows with an overlap such that the first half of the window consisted of the observations from the final half of the prior window, in the scenario of a 50% overlap.

A wrong length might truncate an activity instance. In several scenarios, errors prop up at the start or at the conclusion of the activities, when the window overlaps the conclusion of a single activity and the start of the subsequent one. In other scenarios, the window length might be too short to furnish the ideal data for the recognition procedure.

It is not clear whether windows with overlap are needed for a provided problem.

In taking up neural network models, the leveraging of overlaps, like a 50% overlap, will double the size of the training information, which might assist in modelling lesser datasets, but might also lead to models that overfit the training dataset.

An overlap amongst adjacent windows is tolerated for specific applications; but, this is less frequently leveraged.

Convolutional Neural Network Models

Convolutional Neural Network models, abbreviated as CNNs, are a variant of deep neural network that were generated for leveraging with image data, for example, like handwriting recognition.’

They have proven to be really efficient on difficult computer vision issues when trained at scale for activities like identification and localization of objects in imagery and automatically detailing the content of the imagery.

They are models that are consisted of two primary variants of elements: convolutional layers and pooling layers.

Convolutional layers read an input, like a 2D image or a 1D signal, leveraging a kernel that reads in minimal segments at a time and steps across the cumulative input field. Every read has the outcome of an input that is projected onto a filter map and indicates an internal interpretation of the input.

Pooling layers take the feature map projections and distil them to the most basic elements, like leveraging a signal averaging or signal maximization process.

The convolution and pooling layers can be repeated at depth, furnishing several layers of abstraction of the input signals.

The output of these networks is typically one or more completely connected layers that interpret what has been read and map this internal representation to a class value.

CNNs can be applied to human activity recognition data.

The CNN model learns to map a provided window of signal information to an activity where the model reads throughout every window of data and preps an internal representation of the window.

During application to time series classification like HAR, CNN has two benefits over other models: local dependency and scale invariance. Local dependency implies the closeby signals in HAR are probable to be correlated, while scale invariance references to the scale-invariant for differing paces or frequencies.

The first critical work leveraging CNNs to HAR was by Ming Zeng, et al. in their 2014 research paper entitled “Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors”

In the paper, the writers generate a simple CNN model with regards to accelerometer data, where every axis of the accelerometer data is inputted into independent convolutional layers, pooling layers, then concatenated prior to being interpreted by hidden completely connected layers.

The image below obtained from the paper obviously displays the topology of the model. It furnishes a good template for how the CNN might be leveraged for HAR problems and time series classification, generally speaking.

There are several ways to model HAR problems with CNNs.

One fascinating instance was by Heeryon Cho and Sang Sin Yoon in their 2018 paper entitled “Divide and Conquer-Based 1D CNN Human Activity Recognition Using Test Data Sharpening.”

In it, they divide activities into ones that consist of movement, referred to as “dynamic” and those where the topic is stationary, referred to as “static”, then generate a CNN model to discriminate amongst these two primary classes. Then, within every class, models are developed to discriminate amongst activities of that variation, like “walking” for dynamic and “sitting” for static.

They refer to this as a two-phase modelling strategy.

Instead of directly identifying the individual activities leveraging a single six-class classifier, we apply a divide and conquer strategy and develop a two-phase activity identification process, where abstract activities, i.e., dynamic and static activity, are first identified leveraging a two-class or binary classifier, and then individual activities are recognized leveraging two 3-class classifiers.

Quite big CNN models were produced, which in turn enabled the authors to claim state-of-the-art outcomes on challenging traditional human activity recognition datasets.

Another fascinating strategy was put forth by Wenchao Jiang, and Zhaozheng Yin in their 2015 research paper entitled “Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks”

Rather than leveraging 1D CNNs on the signal information, they rather bring together the signal data together to develop “images” where are then inputted to a 2D CNN and processed as image data with convolutions along the time axis of signals and across signal variables, particularly accelerometer and gyroscope data.

Firstly, raw signals are stacked on a row-wise basis into a signal image […]. In the signal image, each signal sequence has the opportunity to be adjacent to each other sequence, which facilitates DCNN to extract hidden correlations amongst neighbouring signals. Then, 2D Discrete Fourier Transform (DFT) is applied to the signal image and its magnitude is selected as our activity image.

Below is a depiction of the process of raw sensor information into imagery, and then from images into an “activity image” the outcome of a discrete Fourier Transform.

Lastly, another good paper on the subject is by Charissa Ann Ronao and Sung-Bae Cho in 2016 entitled “Human activity recognition with smartphone sensors using deep learning neural networks.”

Meticulous study of the leveraging of the CNNs is carried out displaying that bigger kernel sizes of signal data are good and restricted pooling.

Experiments display that convnets indeed obtain relevant and more complicated features with each extra layers, even though difference of feature intricacy level reduces with each extra layer. A broader time span of temporal local correlation can be exploited (1×9 – 1×14) and a low pooling size (1×2 – 1×3) is displayed to be beneficial.

Usefully, they furnish the complete hyperparameter configuration for the CNN models that might furnish a beneficial beginning point on new HAR and other sequence classification problems, summarized below.

Recurrent Neural Network Models

Recurrent neural networks, abbreviated as RNN, are a variant of neural network that was developed to learn from sequence information, like sequences of observations across time, or a sequence of words in a sentence.

A particular variant of RNN referred to as the long short-term memory network, abbreviated as LSTM, is probably the most broadly leveraged RNN as its meticulous design overcomes the generalized difficulties in undertaking training of a stable RNN on sequence data.

LSTMs have proven efficient on sequence forecasting issues that are a challenge when trained at scale for such activities like handwriting recognition, language modelling, and machine translation.

A layer in an LSTM model is consisted of special units that possess gates that govern input, output, and recurrent connections, the weights of which are learned. Every LSTM unit additionally has internal memory or state that is accumulated as an input sequence is read and could be leveraged by the network as a variant of local variable or memory register.

Like the CNN that can read across an input sequence, the LSTM reads a sequence of input observations and produces its proprietary internal representation of the input sequence. Not like the CNN, the LSTM goes through training in a fashion that pays particular focus to observations made and forecasting errors made over the time steps in the input sequence, referred to as backpropagation.

LSTMs can be applied to the issue of human activity recognition.

The LSTM learns to map every window of sensor data to an activity, where the observations in the input sequence are read once at a time, where every time step might be consisted of a single or more variables. (For example, parallel sequences)

There has been restricted application of simple LSTM models to HAR problems.

One instance is by Abdulmajid Murad and Jae-Young Pyun in their 2017 paper entitled “Deep Recurrent Neural Networks for Human Activity Recognition”

Critically, in the paper they comment on the restriction of CNNs in their requirement to function on static-sized windows of sensor information, a restriction that LSTMs do not strictly possess.

But, the size of convolutional kernels limits the captured range of dependencies amongst data samples. As an outcome, traditional models are unadaptable to a broad array of activity-recognition configurations and need static-length input windows.

They explore the leveraging of LSTMs that both process the sequence information forward (normal) and both directions (Bidirectional LSTM). Fascinatingly, the LSTM forecasts an activity for every input time step of a subsequence of sensor information, which is then aggregated in order to forecast an activity for the window.

There will be a score for every time-step forecasting the variant of activity happening at time t. The forecast for the entire window T is obtained by merging the individual scores into a singular forecast.

The figure below obtain from the paper furnishes a depiction of the LSTM model followed by completely connected layers leveraged to interpret the internal representation of the raw sensor data.

It might be more typical to leverage an LSTM in combination with a CNN or HAR problems, in a CNN-LSTM framework or ConvLSTM model.

This is where a CNN model is leveraged to collect the features from a subsequence of raw sample information, and output features from the CNN for every subsequence are then interpreted by an LSTM in aggregate.

An instance of this is in the 2016 paper entitled “Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition.”

We put forth a new DNN framework for wearable activity recognition, which we reference to as DeepConvLSTM. This architecture brings together convolutional and recurrent layers. The convolutional layers function as feature extractors and furnish abstract representations of the input sensor data within feature maps. The recurrent layers model the temporal dynamics of the activation of the feature maps.

A deep network architecture is leveraged with four convolutional layers with no pooling layers, followed by dual LSTM layers to go about interpreting the extracted features over several time steps.

The writers make the claim that the removal of the pooling layers is a crucial portion of their model architecture, where the leveraging of pooling layers after the convolutional layers interferes with the convolutional layer’s capability to learn to downsample the raw sensor data.

In the literature, CNN frameworks usually consist of convolutional and pooling layers successively, as a measure to minimize data intricacy and put forth translation invariant features. Nonetheless, such a strategy is not strictly portion of the architecture, and in the time series domain […] DeepConvLSTM does not consist of pooling operations as the input of the network is limited by the sliding window mechanism […] and this fact restricts the potential of downsampling the data, provided that DeepConvLSTM needs a data sequence to be processed by the recurrent layers. But, with no sliding window requirement, a pooling mechanism could be good to cover differing sensor data time scales at in-depth layers.

The image below obtained from the paper makes the architecture clearer. Observe that layers six and seven in the image are as a matter of fact, LSTM layers.

Further Reading

This section furnishes additional resources on the subject if you are seeking to delve deeper.


  • Deep Learning for Sensor-driven Activity Recognition: A Survey, 2018

Sliding Windows

  • A Dynamic Sliding Window Approach for Activity Recognition, 2011
  • Window Size Impact in Human Activity Recognition, 2014.


  • Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors, 2014.
  • Divide and Conquer-Based 1D CNN Human Activity Recognition using Test Data Sharpening, 2018.
  • Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks, 2015.
  • Human Activity Recognition with Smartphone Sensors using Deep Learning Neural Networks, 2016.


  • Deep Recurrent Neural Networks for Human Activity Recognition, 2017.
  • Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition, 2016.


In this blog article, you found out about the problem of human activity recognition and the leveraging of deep learning strategies that are accomplishing cutting-edge performance on this problem.

Particularly, you learned:

  • Activity recognition is the issue of forecasting the movement of an individual, usually indoors, on the basis of sensor information, like an accelerometer in a smartphone.
  • Streams of sensor data are usually split into subs-sequences referred to as windows and every window is connected with a wider activity, referred to as a sliding window approach.
  • Convolutional Neural Networks and long short-term memory networks, and probably both combined, are ideally suited to learn features from raw sensor information and forecasting the connected movement.
Add Comment