4.3. Experimental Results and Discussion
In the first experiment, we evaluated our proposed method on the MobiAct, DaLiAc, UCI-HAR, and UC-HAR datasets. The recognition results are presented as confusion matrices in
Figure 4 and summarized as average recognition accuracy in
Table 3. With the MobiAct dataset, our method yields 100% recognition accuracy on the test set. The three remaining datasets are more challenging, but the overall accuracy is still impressive, with 98.9%, 97.11%, and 98.02% accuracy on the DaLiAc, UCI-HAR, and UC-HAR datasets, respectively. With the DaLiAc dataset, our method misrecognized some activities in the same group, for example, vacuuming with sweeping in HOUSE and bicycling 50 W with bicycling 100 W, because of the similarity of those activities in a realistic environment and the inhomogeneity of the actors.
The activities in the REST (sitting, lying, and standing) and WALK (walking, running, and ascending/descending stairs) groups are proficiently recognized with very high accuracy. In the UCI-HAR dataset, our method was confused between walking and downstairs and upstairs and downstairs, but sitting and lying were precisely recognized. This result is explained by the position of the smartphone during data collection for that dataset; all volunteers wore the smartphone on the waist. In our dataset, standing and stretching were confused because the time between the two consecutive stretching actions was considered as a standing activity. Furthermore, sweeping was sometimes detected as stretching because of the position of the smartwatch.
In the second experiment, we compared our proposed encoding technique with four other approaches that transform inertial sensor signal into an image for activity representation: the first one [
39], called the raw signal plot method, transforms the acceleration signal directly into a time series image and represents it as a gray-scale image; the second one [
40], called the spectrogram method, plots a spectrogram of an inertial signal after computing squared Short-Time Fourier Transform for input into a deep neural network; the third one [
41], called recurrence plot method, a distance matrices that capture temporal patterns in the signal, represented in image with texture patterns; the last one [
42], called the multichannel method, encodes the acceleration signal (including
X,
Y, and
Z) into the corresponding red, green, and blue channels of a color image by normalizing, scaling, and rounding a real value into an integer for pixel representation. Some example activity images generated by the transformation methods are presented in
Figure 5.
In this experiment, we evaluated and compared the accuracy of HAR using our deep network with the input images generated from each of the different transformation techniques on all four datasets. All of the benchmarked techniques are realized by our own implementation. In the comparison result reported in
Table 4, Iss2Image is replaced by the other transformation methods without modifying our deep neural network. As shown in
Table 4, our proposed encoding technique outperformed the other transformation methods for most of the benchmarked datasets: by 0.97% on MobiAct, 6.53% on DaLiAc, 4.87% on UCI-HAR, and 3.72% on UC-HAR on average. Compared with the raw signal plot, spectrogram and recurrence plot approaches, Iss2Image is much more powerful, 4.45%, 4.3% and 6.59% more accurate, respectively, on average across all datasets. Similar to Iss2Image, the multichannel approach encodes sensor signal sample of [
x y z] to a pixel with three values for the red, green, and blue channels; however, it encodes only the integer of the real number instead of the integer plus four more decimal places used in Iss2Image. Thus, the precision of the multichannel approach is less than in our proposed technique, which lowers its accuracy by approximately 0.36% on average across all datasets. Clearly, plotting a raw signal, spectrogram or recurrence plot are not efficient solutions for representing activity signals in images because of distortion in the original information during the conversion and the complexity of the operation.
In the third experiment, we have compared the accuracy of our network (trained from scratch and pre-trained on CIFAR-10 [
43]) with three other pre-trained CNNs, which are Resnet18 [
44], Alexnet [
45] and GoogleNet [
46]. Note that the depth of our network is small, having six layers. Resnet18, containing 18 layers, introduces a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, it explicitly let these layers fit a residual mapping to solve degradation problem. Alexnet, containing eight layers, adopted ReLUs, local response normalization and overlapping pooling to improve the performance and reduce training time. GoogleNet, containing 22 layers, is based on NIN (Network In Network) [
47], adding Inception architecture which includes 1 × 1, 3 × 3 and 5 × 5 convolution for efficiency and then 1 × 1 convolution for dimension reduction to reduce the computation. All of these three networks classifies images into 1000 objects categories, based on the ImageNet database having 1.2 million images. To pre-train UCNet6, we have used CIFAR-10 dataset, having 50,000 training images and 10,000 testing images with 10 classes.
From the experimental results, the highest accuracy on different signal transformation methods is achieved by using ResNet18 and GoogleNet. Recurrence plot and Iss2Image method showed the highest accuracy on ResNet18 while raw signal plot, spectrogram and multichannel method showed highest results on GoogleNet. Raw signal plot has the lowest accuracy on all five networks, which indicates such approach has lowest efficiency on signal transformation. This can be inferred that having too much blank area on the image, convolves meaningless information. Spectrogram and recurrence plot showed average accuracy between raw signal plot and multichannel-Iss2Image group. On all of the networks, Multichannel and Iss2Image showed high accuracy over 98%, meaning that images made from these two methods are concise. UCNet6 trained from scratch shows the lowest performance on the overall average, especially on the high-resolution images, but shows comparable result on multichannel and Iss2Image. Pre-trained UCNet6 overcomes this weakness, showing comparable results on other three signal transformation methods, resulting average accuracy of 96.41%. Although this is still lower than other public pre-trained networks, it has the difference under 1%, ResNet18 with 0.66%, GoogleNet with 0.84% and Alexnet with 0.27%. Because the purpose of UCNet6 is for fast and lightweight, to be able to run on mobile platform in the future work, the performance is acceptable, and has the strength on training time and inference time which will be shown in the last part of this section. The three public pre-trained networks are pre-trained with very large data, having ability to classify images with richer feature maps. Pre-trained UCNet6 is trained with small-scale dataset, but possess the optimized parameters for learning, which can show comparable performance. Meanwhile, UCNet6 trained from scratch shows lowest performance but the gap is trivial. The recognition accuracy comparison of different transformation methods on different networks are shown in
Table 5.
In the last experiment, we compared our proposed our method with state-of-the-art methods for HAR on the three public datasets, MobiAct, DaLiAc, and UCI-HAR, in terms of recognition accuracy. For a fair comparison, we strictly followed the benchmark setups (such as dataset partition and k-fold validation) indicated in the published research:
MobiAct dataset: In the dataset paper [
36], MobiAct was evaluated using a window size of 5 s with an overlapping ratio of 80%. The recognition accuracy was conducted using a conventional approach with three components: feature extraction, feature selection, and classification. In particular, the authors extracted and manually selected 64 features to train with the IBk and J48 classifiers. Both classifiers yield very high accuracy results of 99.88% and 99.30% on the MobiAct dataset, as shown in
Table 6. However, this approach cannot precisely recognize mostly similar activities, such as stairs up and stairs down, due to the limitation of feature engineering. The method in [
48] resampled the frequency to 20 Hz, segmented the data in 10 s window without overlapping, extracted features using Auto-Regressive model, and classified with SVM. But this approach also confused the similar activities, stairs up and stairs down, resulting 97.45% accuracy. Our Iss2Image-UCNet6 method consistently reports outstanding performance on MobiAct, with an average accuracy of 100%. In Iss2Image-UCNet6, many features produced inside the network by the convolutional layers, ReLU layers, and pooling layers are learned, producing better recognition accuracy than available with traditional classifiers, such as k-nearest neighbors and decision tree.
DaLiAc dataset: Following the guidance in [
37], we reimplemented our Iss2Image-UCNet6 method on DaLiAc with a 5-s window and 50% overlapping ratio. The comparison results are reported in
Table 7, using the results for the other methods presented in the dataset paper. The authors extracted 152 features for each sliding window, including time and frequency domain features, and a hierarchical classification system including AdaBoost, a classification and regression tree, k-nearest neighbor (kNN), and SVM. The methods in [
49,
50] also extracted features from the acceleration signal in both the time and frequency domains; however, they each use a single classifier, decision tree and kNN, respectively. The paper in [
51] divided the subjects of DaLiAc dataset into three subsets for training, validating and testing. 10 steps of feature extraction was conducted from original and magnitude time series data, and select features by discarding unimportant features and applying diversified forward-backward feature selection method. Comparing with six different classifiers, SVM showed the highest accuracy of 93%. Following the experiment configuration in [
37], we evaluated the Iss2Image-UCNet6 method with a leave-one-subject-out procedure. In this comparison, our method outperformed the existing approaches with an impressive improvement in accuracy. Compared with traditional classification techniques, a deep neural network is much more powerful in classifying a large dataset. In addition, transforming the sensor signal into an image without much data distortion, as Iss2Image does, is important for achieving high recognition accuracy.
UCI-HAR dataset: For a fair comparison, we benchmarked our proposed method with a sliding window size of 2.56 s and 50% overlap. The comparison results are reported in
Table 8. In [
52], the authors extracted 17 features measured in the time and frequency domains for both accelerometer and gyroscope data and classified the activities using a MultiClass SVM (MC-SVM). Additionally, they developed another lightweight version with fixed-point arithmetic for energy efficiency. The accuracy of the standard version and lightweight version of the MC-SVM is acceptable, approximately 89.30% and 89.00%. The authors of [
19] proposed a HAR system using deep learning neural networks, entering the accelerometer and gyroscope sensor data after some preprocessing steps. To improve performance, those convolutional networks were combined with an MLP. This combination strategy produced 94.79% recognition accuracy on average. Another strategy described in [
19] combines the features extracted from the convolutional layers with the features extracted by FFT. That strategy improves on the first strategy, with an average accuracy of 95.75%. The authors of [
53] proposed a hierarchical classification named GCHAR which has two stage classification. The first stage is group based classification, dividing similar activities into a specific activity group. The second stage is context awareness based, correcting the activity to be included in the proper group from stage one. Comparing with six other traditional classifiers, GCHAR showed the highest accuracy of 94.16%. With our Iss2Image-UCNet6 method, the accuracy suffered from decreasing the window size from 3 s to 2.56 s, so in this experiment, Iss2Image-UCNet6 achieved an average accuracy of only 96.84%, but that is still better than the other methods. Our method is much better than the MC-SVM, whereas the accuracy improvement over Convnet is not significant. These results show the power of deep CNNs in classification tasks.
Finally, we benchmarked the sensor signal transformation time, training time and inference time for different networks using our collected dataset.
From the perspective of signal transformation time from inertial sensor signal to image, we have set the time for 10 s and count how many images were created for fair comparison. The raw signal plot and spectrogram method created only few images, less than 10 images. The recurrence plot created medium number of images about 700 images, and the multichannel method and our proposed method created the most with similar number of images, 2838 images and 2772 images, respectively.
Plotting x, y, and z signals into a single component image and combining all of the images to a unified image, raw signal plot method spends much time for signal transformation. Similarly, spectrogram method also spends much time for plotting spectrogram and writing it to an image. Meanwhile, recurrence plot captures temporal patterns in the signal first and then plot the overall texture pattern into image, shortening time than raw signal plot and spectrogram.
Compared with the above mentioned three methods, multichannel and our proposed approach takes less time for converting raw inertial signal to image due to directly encoding only raw data to pixel value. However, the computational cost of our method is more expensive than multichannel method because we have to encode not only the integer part but also the floating part. The comparison result of signal transformation time is shown in
Table 9.
From the perspective of training time, we have compared ResNet18, GoogleNet, AlexNet, and trained from scratch and pre-trained UCNet6 with five different signal transformation methods. A total of 38,279 activity samples are segmented with a sliding window size of three seconds and an overlapping width of one second. We have made different sizes of images on different methods that the networks have different size of input image; 244 × 244 for ResNet18 and GoogleNet, 227 × 277 for AlexNet. The input size for UCNet6 networks are not fixed, having flexibility to be changed. Raw signal plot, spectrogram and recurrence plot have higher resolution while multichannel and Iss2Image has low resolution.
Training time for public pre-trained networks costed similar times; average time about 62 min for ResNet18, 57 min for GoogleNet, and 74 min for AlexNet. On each network, training time on different signal transformation methods also did not show big differences; the time gap between highest and lowest time is 4 min for ResNet18, 6 min for GoogleNet, and 7 min for AlexNet. Iss2Image was fastest in GoogleNet among other signal transformation methods, but not from the others. But Iss2Image is still competitive on other networks that it was third fast on ResNet18 and second fast on AlexNet.
The UCNet6 trained from scratch took more than 5 h for training for raw signal plot, spectrogram and recurrence plot. Without utilizing pre-trained parameters, it took long time on high resolution images for reaching desired training performance. On the contrary, training time for multichannel and Iss2Image took very short, 7 min and 9 min respectively, showing that the way of initializing parameters (from random generation or from pre-trained model) does not impact on training speed of low resolution image. The pre-trained UCNet6 showed the highest performance, having average speed of 9 min. By utilizing transfer learning, it now has the ability to well handle the high resolution images. The comparison result of training times on different networks is shown in
Table 10.
From the perspective of inference time, we have compared ResNet18, GoogleNet, AlexNet and our network with five different signal transformation methods. A total of 1000 activity samples are segmented with a sliding window size of three seconds and an overlapping width of one second. We have made different sizes of images on different methods that the networks have different size of input image; 244 × 244 for ResNet18 and GoogleNet, 227 × 277 for AlexNet, and 150 × 6 for UCNet6. The inference time is shown by calculating the average for 10 times execution.
As shown in the result on
Table 11, all of the five different methods showed fastest inference time on UCNet6 than other pre-trained networks where the average speed is less than one second. This is because that having less layers with less complex layers, an input image will pass the network end-to-end with less computation. Among three pre-trained networks, GoogleNet has the most number of layers but faster than ResNet18. This is because ResNet18 contains more complex layers such as batch normalization and addition layers, resulting in more computation time. In the case of AlexNet, it has relatively small number of layers than other two networks which is able to compute fastest among three networks.
From the perspective of signal transformation methods, all three pre-trained networks showed the Iss2Image as second or third fastest method, compared with multichannel method, where the difference is negligible. The fastest was the raw signal plot, but this is not practical in real-time recognition environment that creating the image takes too much time. From UCNet6, because it is optimized for Iss2Image, the multichannel and Iss2Image showed the first and second fastest and faster than raw signal plot.