Deep Fusion of Skeleton Spatial–Temporal and Dynamic Information for Action Recognition

Gao, Song; Zhang, Dingzhuo; Tang, Zhaoming; Wang, Hongyan

doi:10.3390/s24237609

Open AccessArticle

Deep Fusion of Skeleton Spatial–Temporal and Dynamic Information for Action Recognition

¹

Aviation Maintenance NCO Academy, Air Force Engineering University, Xinyang 464007, China

²

College of Information Engineering, Dalian University, Dalian 116622, China

³

School of Comuputer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(23), 7609; https://doi.org/10.3390/s24237609

Submission received: 18 June 2024 / Revised: 12 August 2024 / Accepted: 10 September 2024 / Published: 28 November 2024

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Focusing on the issue of the low recognition rates achieved by traditional deep-information-based action recognition algorithms, an action recognition approach was developed based on skeleton spatial–temporal and dynamic features combined with a two-stream convolutional neural network (TS-CNN). Firstly, the skeleton’s three-dimensional coordinate system was transformed to obtain coordinate information related to relative joint positions. Subsequently, this relevant joint information was encoded as a color texture map to construct the spatial–temporal feature descriptor of the skeleton. Furthermore, physical structure constraints of the human body were considered to enhance class differences. Additionally, the speed information for each joint was estimated and encoded as a color texture map to achieve the skeleton motion feature descriptor. The resulting spatial–temporal and dynamic features were further enhanced using motion saliency and morphology operators to improve their expression ability. Finally, these enhanced skeleton spatial–temporal and dynamic features were deeply fused via TS-CNN for implementing action recognition. Numerous results from experiments conducted on the publicly available datasets NTU RGB-D, Northwestern-UCLA, and UTD-MHAD demonstrate that the recognition rates achieved via the developed approach are 86.25%, 87.37%, and 93.75%, respectively, indicating that the approach can effectively improve the accuracy of action recognition in complex environments compared to state-of-the-art algorithms.

Keywords:

skeleton spatial–temporal feature; skeleton dynamic feature; feature enhancement; action recognition; two stream convolutional neural networks

1. Introduction

As a prominent research area in computer vision, human action recognition plays a crucial role in intelligent surveillance, human–computer interaction, video retrieval, and other applications. The recognition of human body actions based on RGB data has shown poor robustness due to factors such as illumination changes and cluttered backgrounds [1,2]. To address these limitations, advanced research works [3,4] have proposed the integration of depth information into human action recognition methods, aiming for higher accuracy and improved robustness against environmental changes, due to the fact that depth information is less affected by illumination changes and can effectively filter out irrelevant texture and color information from the background. However, it should be noted that the redundancy in depth image information results in increased computational complexity, which limits its practical application.

With the continuous advancement of sensor technology, affordable depth cameras such as Intel RealSense3D [5] and Xtion PRO have gained popularity. These cameras enable the easy acquisition of 3D human joint coordinates from depth maps, which contain abundant motion information. The 3D skeletal data not only provide rich depth information, but also offer advantages like simplicity of format, comprehensive motion details, and straightforward calculations. Consequently, skeletal information has gradually become the primary focus of research on human action recognition. However, effectively extracting motion information from 3D skeletal data to recognize human movements remains a significant challenge due to the noise present in the raw skeletal information captured by depth sensors and blurred spatial–temporal relationships among joints. Manual features were exploited to recognize actions in [6,7]. However, the hand-crafted feature is relatively simplistic, resulting in limited recognition accuracy and poor generalizability.

In recent years, deep learning models such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and graph convolutional networks (GCNs) have demonstrated promising results in various computer vision tasks. Building upon this progress, the field of action recognition has integrated deep learning models to improve both its generalization capabilities and its recognition performance. Leveraging the strong temporal series modeling abilities of RNNs, action recognition models have been developed in [8,9]. Nevertheless, RNNs fail to effectively capture inter-joint spatial relationships, leading to underutilization of joint space information and limited improvements in recognition performance. To address this issue, a novel hybrid spatio-temporal convolutional network was proposed in [10] to exploit the powerful spatial feature extraction capabilities of CNN for action recognition. In [11,12], action features were extracted from sequential skeleton-sequence-encoded images using CNN, where each piece of joint information was independently encoded as a color image. However, the related inter-joint information was disregarded. To overcome this limitation, a multi-task learning model was developed in [13] to capture the correlation between skeleton and action classes by maximizing potential edges. Additionally, an action recognition method based on joint distance maps (JDM) was proposed in [14,15], encoding pairwise joint distances as color images but neglecting the spatial constraints among joints during the encoding process, which leads to the confusion of joint spatial information and limited recognition accuracy. Addressing this concern, an action recognition method based on a tree skeleton diagram and reference nodes, which incorporates human structure constraints into the skeleton sequence analysis, was introduced in [16]. However, this approach solely focuses on the static characteristics of joints while ignoring their dynamic characteristics and individual participation levels during the action completion process, leading to motion information that is incompletely encoded and the loss of the spatial saliency information of joints, thereby limiting action recognition rates. It should be noted that the detection and seperation of multi-target, and the effectiveness of joint combination, as well as other issues, should be considered first in the multi-person scenario. Therefore, existing action recognition methods have mainly focused on the single-person case. Consequently, the action recognition issue for single person si considered in this paper.

To address the aforementioned issue, this study presents two-stream CNN (TS-CNN) based action recognition method that integrates spatio-temporal and dynamic information of the skeleton. The coordinate system of the skeletal sequences is firstly transformed by the proposed method, enabling the construction of a skeleton spatial–temporal map (SSTM) that effectively captures the relative positions of joints. Furthermore, the dynamic characteristics of joints are encoded into a joint motion speed map (JMSM) to highlight differences in motion characteristics. Additionally, the SSTM and JMSM features can be enhanced by employing motion saliency and morphological operators to improve inter-class differences and reduce intra-class differences. Finally, the enhanced SSTM and JMSM can be deeply fused by using TS-CNN to effectively achieve action classification. Numerous experimental results demonstrate that the proposed skeletal spatio-temporal and dynamic feature information fusion-based recognition method outperforms state-of-the-art recognition algorithms in complex scenarios.

Along the above lines, the main contributions of this work can be summarized as follows:

(1) The coordinate system of the skeletal sequences, typically established with a depth sensor as the origin, is transformed into a body coordinate system with the hip joint serving as the origin. This transformation is justified by its relative immobility during movement, allowing for the effective representation of spatial information. By employing this formulated coordinate system, both the absolute coordinates of joints and the relative coordinates among joints are collectively encoded into a color texture map to construct SSTM characterizing the spatial–temporal features of actions.

(2) To effectively distinguish different actions implemented via joints with varying velocities, the velocity information for each joint involved in a specific action is extracted to depict the dynamic properties of the action. Consequently, the amplitude of the velocity information in each direction can be encoded as JMSM. With a view to further enhancing the representation ability of features, both spatial–temporal and dynamic features can be respectively improved by incorporating motion saliency through adjusting the color weight associated with each joint and utilizing morphology operators such as corrosion and expansion operations.

(3) To improve the performance of action recognition by effectively leveraging multiple types of information associated with skeleton sequences, TS-CNN is utilized to deeply integrate the extracted static and dynamic skeleton features, namely SSTM and JMSM. The constructed TS-CNN model comprises two enhanced versions of AlexNet, where both channels in the formulated TS-CNN possess identical structures that are meticulously designed.

The remainder of this work is organized as follows: The state-of-the-art algorithms for action recognition are depicted in Section 2. The effectiveness of the proposed approach is verified via numerical examples in Section 3. Finally, conclusions are given in Section 4.

2. Methods

The framework of the proposed method is illustrated in Figure 1. The proposed algorithm can be divided into the following four parts: The 3D coordinates of the skeletal sequence acquired by Kinect are first transformed to the body coordinate system with the hip joint as the origin. Subsequently, spatio-temporal and joint velocity information are encoded separately, as SSTM and JMSM. Furthermore, motion saliency is utilized to enhance joint information exhibiting significant motion characteristics in SSTM, while morphological operators are employed to improve speed information in JMSM. Finally, TS-CNN is used to extract depth information for each feature, followed by obtaining posterior probabilities based on multiplicative fusion for achieving recognition results.

2.1. Coordinate System Transformation

The human body can be viewed as a hinge structure consisting of the torso and limbs, where each type of movement is executed via a directional circular motion of the limb around the hip joint. However, the skeletal sequences captured by depth sensors such as Kinect are mapped to a Cartesian coordinate system with the camera as the origin (shown in Figure 2). The joint coordinates are independent of each other and fail to characterize their relative positions.

Therefore, a coordinate system transformation of the skeletal 3D coordinates is necessary to obtain a body coordinate system that effectively represents the spatial information. The coordinates of the joints in the body coordinate system are relative to the origin of the coordinates. It should be noted that the choice of the origin significantly affects the representation of the relative spatial information between the joints. The body coordinate system used in [9] takes the center of the spine as the origin, disregarding the influence of the bending motion of the upper torso on other joints. Accordingly, when the upper torso moves, the origin of the coordinates, i.e., the center of the spine, changes its position constantly, which causes the coordinate system between adjacent frames to change continually. As a result, this continual change leads to a decrease in the correlation among joints. Therefore, the joint, which is as static as possible during the movement, should be selected as the origin of coordinates to obtain relatively stable coordinate information.

With the aforementioned description, a body coordinate system with hip joint as origin because of small motion is constructed here. For a video sequence with

F

frames, the coordinate transformation associated with

N

joints can be expressed as follows:

P_{j}^{' f} = P_{j}^{f} - P_{*}^{f}

(1)

where

P_{j}^{f}

,

P_{j}^{' f}

are the coordinate information of the joint

j

at the

f

th frame before and after the coordinate system transformation, and

P_{*}^{f} = (p_{*, x}, p_{*, y}, p_{*, z})

is the coordinate information of the hip joint at the

f

th frame. The joint visualization after the transformation is shown in Figure 3.

2.2. Skeletal Space-Time Feature Descriptor Construction

When the body’s joints collaborate to perform specific movements, there are distinct variations in the relative positions of joints across different movements. Hence, it is crucial to extract spatial information regarding the relative positions of joints during motion to effectively characterize the movement. To effectively encode the joint spatial position information, the absolute coordinates of joints and the relative coordinates among joints are jointly encoded into a color texture map here to formulate SSTM characterizing the spatial–temporal feature of the action.

With the skeletal sequence

S = (P_{1}^{' 1}, P_{2}^{' 2}, \dots P_{N}^{' F})

after coordinate transformation, the relative positions of joints can be obtained by the following equation:

C_{j_i}^{f} = P_{i}^{' f} - P_{j}^{' f}

(2)

where

C_{f_i}^{j}

is the 3D coordinates of the

j

th joint in the

f

th frame relative to the

i

th joint and also depicts the spatial information of the skeleton connected to the

j

th and

i

th joint, which is the absolute coordinate of the

j

th joint with

i = 1

, i.e.,

C_{f_1}^{j} = P_{j}^{' f}

.

With the description above, the spatial–temporal characteristics of the

j

th joint can be represented by the following:

Q_{j_i} = {[C_{j_i}^{1}, C_{j_i}^{2}, \dots, C_{j_i}^{F}]}^{T}

(3)

In the skeletal structure, there are pathways among all joints, i.e., they are connected by a certain number of edges (shown in Figure 4). The fewer the edges that connect two joints, the higher the correlation between them. With this spatial constraint, only the first- and second-level correlation information (i.e., joint pairs connected by only one or two edges) is selected to reduce computational complexity and inter-class confusion, and to improve intra-class robustness. The first- and second-level correlation information is shown in the following:

R_{1} = [Q_{h_k}, Q_{j_i}, \dots, Q_{m_n}], R_{2} = [Q_{p_o}, Q_{u_v}, \dots, Q_{y_x}]

(4)

in which

h

,

k

;

j

,

i

;

m

,

n

, and so on denote joint pairs connected by only one edge;

p

,

o

,

u

,

v

,

y

,

x

, etc. illustrate joint pairs connected by two edges.

The perception region of CNN expands with network depth, necessitating the extraction of spatial information associated with joint pairs exhibiting higher correlation in shallow layers and lower correlation in deep layers. In contrast to JDM (shown in Figure 5a), proposed in [14,15], which arranges joint information as color images in a fixed order and ignores the difference of the relative spatial information, the coordinate information is arranged here according to the body structure. As depicted in Figure 4, all joints can be categorized into left-arm, right-arm, left-leg, right-leg, and torso groups arranged according to physical connections among joints. Taking the right arm as an example, the joints with numbers 9–12, 24, and 25 shown in Figure 4 are adjacent to each other, and thus have a high correlation, and they can be grouped into a parcel to extract their spatial relationship features more effectively. With the information mentioned above, the resultant SSTM can effectively encode the spatial–temporal information of joints (shown in Figure 5b).

Based on the encoded skeletal sequences, the skeletal spatio-temporal feature map can be obtained as

E_{k} = {[R_{1}, R_{2}, A]}^{T} = {(\begin{matrix} C_{25_11}^{1} \\ C_{25_11}^{2} \\ ⋮ \\ C_{25_11}^{F} \end{matrix} \begin{matrix} \dots \\ \dots \\ \dots \\ \dots \end{matrix} \begin{matrix} C_{2_1}^{1} \\ C_{2_1}^{1} \\ ⋮ \\ C_{2_1}^{1} \end{matrix} \begin{matrix} C_{25_10}^{1} \\ C_{25_10}^{2} \\ ⋮ \\ C_{25_10}^{F} \end{matrix} \begin{matrix} \dots \\ \dots \\ \dots \\ \dots \end{matrix} \begin{matrix} C_{21_1}^{1} \\ C_{21_1}^{1} \\ ⋮ \\ C_{21_1}^{1} \end{matrix} \begin{matrix} C_{25_1}^{1} \\ C_{23_1}^{2} \\ ⋮ \\ C_{25_1}^{F} \end{matrix} \begin{matrix} \dots \\ \dots \\ \dots \\ \dots \end{matrix} \begin{matrix} C_{2_1}^{1} \\ C_{2_1}^{1} \\ ⋮ \\ C_{2_1}^{1} \end{matrix})}^{T}

(5)

where

k

is the action category, and

A

is the absolute coordinates of the joint points.

Let the 3D coordinates of

C_{j_i}^{t}

correspond to RGB channels.

E_{k}

can be transformed as SSTM of

72 \times F

, as shown in Figure 5b, in which one column can be reshaped as one frame, each row is associated with specific joint coordinate information, and the vertical and horizontal directions of SSTM encode the temporal and spatial information, respectively, effectively encoding the spatio-temporal information relevant to the whole action into a color texture map. Compared with the joint distances encoded in [14,15], SSTM is relatively simple to calculate and contains more abundant spatial domain information. Thus, it can effectively distinguish actions and is more robust to differences between similar actions.

2.3. Construction of Skeletal Motion Feature Descriptors

The completion of a specific action involves only certain joints, and different actions require the involvement of different joints with varying velocities. This fact allows for the extraction of velocity information from each joint to characterize the dynamic properties of the action. Due to individual differences in execution, there is a wide variation in velocity directions among similar actions within the same class, leading to increased divergence within intra-class dynamics. Consequently, feature descriptors that capture joint motion characteristics can be constructed by solely utilizing scalar velocity information. The velocity values of joints in the

x

,

y

, and

z

directions within the

f

th frame can be expressed as follows:

v_{x} = |\frac{p_{x}^{f + Δ f} - p_{x}^{f}}{Δ t}|, v_{y} = |\frac{p_{y}^{f + Δ f} - p_{y}^{f}}{Δ t}|, v_{z} = |\frac{p_{z}^{f + Δ f} - p_{z}^{f}}{Δ t}|

(6)

where |*| denotes the absolute value operator,

p_{x}^{f + Δ f}

,

p_{y}^{f + Δ f}

, and

p_{z}^{f + Δ f}

are the 3D coordinates of the joint in the

f + Δ f

frame,

Δ f

is the time step, and

Δ t

is as follows:

Δ t = \frac{Δ f}{F P S}

(7)

where FPS is the frame speed of the employed Kinect camera. By mapping

v_{x}, v_{y}, v_{z}

to R, G, and B, respectively, the joint motion information can be encoded as JMSM with dimensions of

N \times (F - Δ f)

.

The proposed method does not impose constraints on the speed value, and takes into account the sudden change in joint position caused by sensor error within the collected data. This is due to the fact that each joint has a different motion range, and imposing constraints on the speed would result in loss of dynamic features for joints with large ranges of motion and fast speeds. Furthermore, analysis of position mutation reveals that joint velocity in the 3D direction is non-zero, and the motion of a joint in an actual scene is interconnected with adjacent joints. Consequently, joints exhibiting position mutation appear as single-color blocks in JMSM. Leveraging this observation, morphological operators can be utilized to enhance texture information in the motion feature map and eliminate noise associated with position mutation joints to improve speed estimation performance. Further details regarding implementation will be provided in the next section.

2.4. Image Enhancement

The involvement of each joint varies throughout the entire action sequence. From a visual perspective, the joints with more prominent movement are more likely to capture attention. Based on this observation, the spatial information of joints in SSTM with distinct motion characteristics can be enhanced by leveraging motion energy.

The instantaneous energy possessed by the joint

i

with coordinates

P_{i}^{' f} = (p_{x}, p_{y}, p_{z})

in the

f

th frame within the

k

th class action sequence can be illustrated as follows:

ϕ_{i}^{f, k} = ‖P_{i}^{' f, k} - P_{i}^{' f - 1, k}‖

(8)

where

f > 1

,

‖\cdot‖

denotes the Euclidean distance. Consequently, the motion energy of the joint

i

within the whole action sequence can be expressed as follows:

Φ_{i}^{k} = \sum_{f = 2}^{F} ϕ_{i}^{f, k}

(9)

With the motion energy

Φ_{i}^{k}

, the color weight

Ω_{i}^{k}

of the

i

th joint can be obtained by the following equation:

Ω_{i}^{k} = \frac{Φ_{i}^{k} - Φ_{\min}^{k}}{Φ_{\max}^{k} - Φ_{\min}^{k}}

(10)

where

Φ_{\max}^{k}

and

Φ_{\min}^{k}

are the maximum and minimum values of the motion energy of all joints within the

k

th action sequence, respectively.

Correspondingly, the enhancement weights relevant to the

k

th class action motion can be expressed as

W_{k} = [Ω_{25}, \dots, Ω_{1}]

according to the above encoding order. Therefore, the enhanced SSTM image can be depicted as follows:

M_{k} = [R_{1} \times W_{k}^{T}, R_{2} \times W_{k}^{T}, A \times W_{k}^{T}]

(11)

It can be seen from Figure 6a that the colors corresponding to the joint information with high motion energy are enhanced, while the other joint colors are defocused. Consequently, the adaptive enhancement method developed can effectively enhance SSTM motion saliency and, subsequently, improve motion classification performance.

Depth sensors, such as Xtion PRO, introduce noise during the acquisition of joint coordinates, leading to a significant estimation error in the joint information. This hampers recognition capabilities. Focusing on this, the texture information relevant to motion feature map can be enhanced by exploiting morphological operators to improve speed estimation performance. A corrosion operation, which is commonly used to eliminate smaller and meaningless objects, is first performed on the JMSM to eliminate the noise, i.e.,

A Θ E = \{z| B (z)\} \subset A

(12)

where

A

is the binary image,

B (z)

is the result obtained via the corrosion operation

Θ

, and

E

is the structure element, which yields

I_{v} = [v_{x} Θ E v_{y} Θ E v_{z} Θ E]

(13)

Since the corrosion could change the size of the original image and distort the original texture, the corroded image can be expanded to restore and smooth the original texture, thereby effectively reducing intra-class velocity differences. Consequently, incorporating an expansion operation enables one to achieve

I_{v} = [(v_{x} Θ E) \oplus E (v_{y} Θ E) \oplus E (v_{z} Θ E) \oplus E]

(14)

in which

\oplus

stands for the expansion operator.

It is shown in Figure 6b that the texture of the enhanced image (second row) is smoother compared to the original image (first row), and the irrelevant information is effectively eliminated while largely preserving the original texture, thereby reducing intra-class action differences.

2.5. TS-CNN Design for Action Recognition

Thanks to the unique advantages of feature extraction and representation, CNN is widely used in image recognition, speech processing, and other fields [17,18]. However, traditional CNN-based action recognition methods only utilize a single type of skeleton data, paying less attention to different types of status information, which limits their recognition performance. In contrast to conventional deep networks, TS-CNN can accept multiple types of information as inputs, extract the corresponding deep features separately, and then fuse these extracted features for classification purposes. In light of this information, TS-CNN is exploited here to deeply fuse the extracted static and dynamic skeleton features to improve the action recognition performance.

TS-CNN is exploited via the proposed method to extract and fuse the depth features from SSTM

M

and JMSM

I_{v}

, to make full use of the space–time and dynamic information of skeleton sequences. The TS-CNN model used consists of two improved AlexNet [19]. SSTM and JMSM are used as the inputs of dynamic and static flow, respectively. Following processing through convolutional layers, pooling layers, and fully connected layers, the posterior probabilities generated by each stream CNN are integrated to yield the final identification result.

The two channels in the constructed TS-CNN have identical structures (shown in Figure 7), ensuring equal weight dimensions and independent updates for each channel. An 8-layer CNN model is chosen, comprising five convolutional layers and three fully connected layers. The first convolutional layer has a convolutional kernel size of

11 \times 11

and a stride of 4. The second layer has a convolutional kernel size of

5 \times 5

with a stride of 2. The next three layers have a convolutional kernel size of

3 \times 3

with a stride of 1. Each convolutional layer uses ReLU activation function to improve the nonlinear mapping and accelerate convergence speed. The first, second, and fifth layers are followed by a maximum pooling layer of size 2 to diminish redundant information and network complexity. Following the fifth convolutional layer, three fully connected layers are employed to comprehensively integrate the extracted depth features. The first and second fully connected layers contain 4096 neurons and dropout can be set to 0.5. For the purpose of preventing network convergence slowdown caused by data distribution change, each convolutional layer is followed by a batch normalization (BN) layer.

Given a skeletal sequence

S_{m}, m = 1, \dots, M

, the SSTM and JMSM are obtained via the above procedure, and then they are scaled to

227 \times 227

pixels by bilinear interpolation for subsequent deep feature extraction. The depth features extracted by using the CNN are outputted to the fully connected layer, which is then normalized by the softmax function to obtain the following posterior probability:

p_{x} (n |x^{m}) = \frac{e^{z_{x}^{n}}}{\sum_{k = 1}^{N} e^{z_{x}^{k}}}

(15)

in which

p_{x} (n |x^{m})

is the probability of

x^{m}

belonging to the

n

th action class,

z_{x}^{k}

is the output of the fully connected layer corresponding to the kth action class,

x

stands for SSTM or JMSM, and

N

is the number of action classes.

The proposed model outputs

p_{S S T M} (n |x_{S S T M}^{m})

and

p_{J M S M} (n |x_{J M S M}^{m})

and the stream output are fused by using multiplication fusion to acquire the following final result:

A c t i o n C l a s s = \arg \max P (l a b)

(16)

where

P (l a b) = p_{S S T M} (l a b |x_{S S T M}^{m}) ⊙ p_{J M S M} (l a b |x_{J M S M}^{m})

,

1 \leq l a b \leq n

, and

⊙

is the dot product operator.

The model parameter can be updated via softmax classifier based on the following loss function:

L (W) = - [\sum_{m = 1}^{M} y^{(m)} \log (p_{x} (A c t i o n C l a s s^{(m)} |x^{m}, W))]

(17)

where

W

is the model parameter, and

y^{(m)}

is the corresponding true label value.

3. Experiments and Discussion

The effectiveness of the proposed approach was validated through comparing it with action recognition algorithms based on manually extracted features, RNN models, and CNN models in terms of viewpoint change, subject diversification, and similar forms of action diversification, using the following three publicly available action recognition datasets: NTU RGB-D, Northwestern-UCLA, and UTD MHAD.

3.1. Experimental Environment Configuration

For the experimental hardware environment, the mode is built using an Intel Core^(TM) i7-7700 processor with 3.60 GHz, 32 GB memory, a NVIDIA GeForce GTX 1070 GPU, and the PyTorch framework with Python 3.7 and CUDA 10.0. Stochastic gradient descent (SGD) is employed to update the network weights, the network learning rate is set to 0.001, momentum is 0.5, and weight decay is 0.00005. The training period is 200, during which 10% of the training set is randomly selected to adjust the training parameters. Additionally, data augmentation techniques such as random vertical flipping, panning, and scaling are employed to increase the number of training samples and enhance the model’s generalization capability.

3.2. NTU RGB-D Dataset

The NTU RGB-D dataset comprises action sequences captured simultaneously by three Kinect V2 cameras located at three different views at NTU [9]. The dataset consists of 60 types of actions (including some similar actions, such as reading and writing, etc.) performed by 40 subjects, generating a total of 56,880 skeletal sequences with richly performed subjects and a variety of similar actions with noise (shown in Figure 8, where the third row shows the diversity of similar actions). Currently, this dataset stands as the largest and most challenging action recognition dataset available.

According to the crossover experimental method in [9], the crossover subjects and view experiments are conducted. Specifically, the 40 classes of subjects are divided into training and test sets, where the training sets are numbered as follows: 1, 2, 4, 5, 8, 9, 13, 14, 15, 16, 17, 18, 19, 25, 27, 28, 31, 34, 35, 38. The remaining classes form the test and validation sets, which possesses equal shares of the data. The training along with test and validation sets contain 40,320 and 16,560 samples, respectively.The cross-view trial uses the samples obtained from the first camera to test while all other cameras’ samples serve as training data. The training along with test and validation sets contain 37,920 and 18,960 samples, respectively.

Figure 9 shows the overall recognition rate of the cross-view experiment conducted by the proposed algorithm on the NTU RGB-D dataset. In Figure 9, each row is the actual category of the action, while each column is the recognition result of the corresponding action achieved by the proposed algorithm, and the main diagonal element indicates the accuracy of action recognition. It can be seen from the confusion matrix shown in Figure 9 that most actions exhibit high recognition rates due to the construction of skeletal spatio-temporal descriptors using relevant inter-joint information and joint dynamic information to jointly characterize skeletal motion features. Furthermore, leveraging motion enhancement and visual enhancement through the developed approach leads to significant improvements in SSTM and JMSM, resulting in a 9.75% increase in recognition rate for certain actions (e.g., sitting down, taking off a jacket) compared to the overall recognition rate. It is worth noting that confusion only arises among actions with subtle differences, such as reading and writing. These findings demonstrate that the proposed method performs well in complex scenes characterized by changing viewpoints, rich noise, and subtle action variations.

The recognition rates of the existing state-of-the-art methods are compared in Table 1.

Due to the large number of training samples in this dataset, methods based on RNN [9] and LSTM [23] exhibit relatively high recognition rates. Moreover, the proposed method achieved higher accuracy in the cross-view experiment compared to Deep-RNN [9] and ST-LSTM [23], with improvements of 22.78% and 13.87%, respectively, owing to CNN’s effective learning of skeletal space information. Additionally, by incorporating human structure constraints along with joint dynamic information and motion saliency features, the proposed method outperformed TSRJI [16] by 5.77% in terms of recognition rate in the cross-subject experiment. It is worth noting that the recognition rate of the proposed method is 0.97% inferior to that of MM-Net [27], which can be attributed to the fact that MM-Net extracts multi-dimensional features, such as joint distance (JD) and JD velocity (JDV), joint angle (JA) and JA velocity, along with fast-action joint position (FJP) and slow-action joint position (SJP), while the proposed method only acquires space–time and dynamic features, such as SSTM and JMSM.

To assess the robustness of the developed approach with regard to the input noise, an investigation was conducted on this dataset by introducing zero mean input noise subject to a Gaussian distribution with standard deviation

σ

to the skeleton sequences. The results are presented in Table 2. It is obvious from Table 2 that the proposed method still maintained a high accuracy even with the addition of significant noise (the standard deviation

σ

was set to 12 cm, which is substantial in the context of the human body’s scale), while MM-Net exhibited significant degradation with the increase in noise level. This can be attributed to the SSTM and JMSM being enhanced via motion and visual enhancement, respectively, to reduce the impact of noise on the accuracy obtained through our method, whereas MM-Net ignores the influence of noise. This demonstrates that our method exhibits considerable robustness against input noise compared to the comparative approach.

3.3. Northwestern-UCLA Dataset

The Northwestern-UCLA dataset [28] comprises 1494 sequences featuring 10 action categories captured from diverse viewpoints: picking up with one hand, picking up with both hands, throwing trash, walking, sitting, standing up, putting on, taking off, throwing, and taking. Following the same protocol as in [28], samples obtained from the first two cameras were utilized for training purposes, while the remaining samples served as test and validation data, each of which had 50% of the data.

As shown in Table 3, the recognition rates achieved by deep learning approaches are significantly higher than those obtained through manual feature extraction methods. Secondly, under the assumption that the skeleton is perpendicular to the ground, HOJ3D [29] with skeletal information neglects inter-skeleton relationships and, consequently, exhibits a lower accuracy rate. Moreover, LARP [6], based on associated skeletons with variable parameters, outperforms HOJ3D but fails to consider the temporal information of skeleton sequences. Furthermore, HBRNN-L [8] models inter-skeleton dynamic information to achieve a recognition rate of 78.52%, yet its performance improvement is limited due to relatively small training samples. In contrast, our proposed method encodes spatio-temporal information on skeletons and enhances motion saliency through motion enhancement techniques, resulting in an overall recognition rate improvement of 13.17% compared to LARP. Moreover, by fusing spatial–temporal and motion depth information based on TS-CNN, our proposed method elevates the recognition rate by 8.82% compared to HBRNN-L. Consequently, the developed algorithm demonstrates high action recognition rates for the aforementioned actions, even under viewpoint changes.

3.4. UTD MHAD Dataset

UTD-MHAD [30] is a multimodal dataset, comprising data collected from Kinect cameras and wearable inertial sensors. It encompasses 27 classes of actions, with a total of 864 sequences performed by eight test subjects (equally distributed between genders) for each action. In this study, the cross-subject protocol shown in [30] was employed: the odd-numbered tester samples were used for training and the remaining samples were used for testing and validating; each of the sample types had half of the data.

As shown in Table 4, the recognition rate obtained by the proposed method is significantly higher than that of the compared algorithms, which can be attributed to the fact that there are more similar actions in the dataset, such as drawing circles clockwise and counterclockwise. Furthermore, the information associated with the joint time of these actions and the relations among them plays a crucial role in achieving improved accuracy rates, and the above information is integrated in the proposed approach through SSTM and enhanced by exploiting motion energy, thereby enabling higher accuracy rates to be attained.

3.5. Ablation Experiments

The effectiveness of each module is validated in this subsection based on NTU RGB-D and UTS-MHAD datasets from the following three viewpoints: coordinate system conversion, feature descriptor construction, and image enhancement.

3.5.1. Effectiveness of Coordinate System Conversion

The coordinate system conversion was conducted on the premise that the correlation between the original joint pairs remains unchanged. Taking the ‘sit down’ action of the eighth type in the NTU RGB-D dataset as an example, the correlation among the first-level-related joint pairs was calculated. It is obvious from Figure 10 that the proposed method maintains the original correlation among joint pairs while enhancing the correlation among some joint pairs. In contrast, the joint pair correlation acquired in [9] is notably lower. It can be seen from the calculation of the correlation coefficient between the joint pair coordinates that the average correlation of the joint pair is 0.89 after transforming the coordinate system by using the proposed approach, which is 0.18 higher than that achieved in [9].

3.5.2. Verification of the Complementarity between SSTM and JMSM

The two feature descriptors, SSTM and JMSM, are designed to capture spatial–temporal and motion information, respectively. After training TS-CNN, the corresponding classification results are presented in Table 5. It can be observed from Table 5 that the fusion of these two feature descriptors leads to an average increase in recognition rate of 8.975% under the NTU RGB-D dataset, indicating a high level of complementarity between them.

3.5.3. Effectiveness of Image Enhancement

The results obtained from comparing the skeletal sequences with and without motion enhancement and visual enhancement on the NTU RGB-D and UTD-MHAD datasets are presented in Table 6. As depicted in Table 6, the recognition rates of the SSTM and JMSM on the UTD-MHAD dataset witnessed improvements of 3.52% and 8.37%, respectively, after motion and visual enhancements were applied. This suggests that the proposed approach for enhancing action recognition is effective in reducing intra-class variation, thereby significantly boosting the overall recognition rate.

4. Conclusions

The present study proposed an action recognition method that utilizes 3D skeletal spatio-temporal and dynamic information. Initially, a skeletal spatio-temporal map was constructed using the proposed approach, incorporating inter-joint correlation information while considering the structural constraints of the human body to enhance differentiation among various actions. Subsequently, a joint motion velocity map was established by exploiting the dynamic characteristics of joints. Moreover, motion enhancement was applied to the skeletal spatio-temporal map based on joint motion energy, and morphological operators were employed on the joint motion velocity map for visual enhancement. Finally, TS-CNN was utilized to extract depth features from both the skeletal spatio-temporal map and the joint motion velocity. The classification results obtained from each single stream were fused using multiplicative methods to achieve an overall recognition rate. The experimental evaluations conducted on three public action recognition datasets (NTU RGB-D, Northwestern-UCLA, and UTD MHAD) demonstrated that compared to state-of-the-art methods of action recognition, our proposed approach achieves higher recognition rates in complex scenarios involving viewpoint changes, rich noise levels, diverse subjects, and similar actions.

The developed skeleton spatial–temporal and dynamic-features-based TS-CNN model has several shortcomings that need to be addressed. Firstly, upon analyzing the NTU RGB-D dataset, it was observed that the model easily confuses some actions when only 3D skeleton coordinates are utilized. Considering this issue, we intend to incorporate RGB image features as Supplementary Information to enhance the network’s recognition capabilities. Moreover, upon examining the input feature matrices, it was noted that certain actions exhibit minimal variation between frames, and most of the elements in the matrices remain unchanged. In future work, we will try to incorporate the attention mechanism to enable the network to focus on more valuable information. Furthermore, the constructed model only considers a few features, thereby limiting its performance in action recognition. Future research will explore more global and local features to enhance the recognition performance. Finally, the built model has more parameters compared to the single-stream 2D CNN network, and it therefore exhibits high computational complexity. In the future, we will focus on the model’s optimization to effectively reduce the model’s parameters, and conduct further analyses to determine the most suitable network structure for different features and body parts.

Author Contributions

Investigation, Formal Analysis, Writing—S.G.; Investigation, Formal Analysis, Methodology, Software, Writing—D.Z.; Formal Analysis, Software, Writing—Z.T.; Conceptualization, Funding Acquisition, Resources, Supervision—H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61301258, the Key projects of Natural Science Foundation of Zhejiang Province under Grant LZ21F010002, China Postdoctoral Science Foundation under Grant 2016M590218, Foundation of State Key Laboratory of Complex Electromagnetic Environment Effects on Electronics and Information System under Grant CEMEE2023K0301, and Natural Science Foundation of Hunan Province of China under Grant 2022JJ50190.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in NTU RGB + D 120 at https://rose1.ntu.edu.sg/dataset/actionRecognition/, accessed on 6 June 2019, UTD-MHAD at https://personal.utdallas.edu/~kehtar/UTD-MHAD.html, accessed on 22 December 2014, and Northwestern-UCLA at http://users.eecs.northwestern.edu/~jwa368/data/, accessed on 2 June 2014.

Acknowledgments

The authors would like to thank Weichun Liu for his constructive suggestions, which greatly improved the quality of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, W.; Wu, M.; Zhao, M.; Xia, T. Fusion of Skeleton and RGB Features for RGB-D Human Action Recognition. IEEE Sens. J. 2021, 21, 19157–19164. [Google Scholar]
Li, X.; Hou, Y.; Wang, P.; Gao, Z.; Xu, M.; Li, W. Trear: Transformer-based RGB-D egocentric action recognition. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 246–252. [Google Scholar] [CrossRef]
Bulbul, M.F.; Tabussum, S.; Ali, H.; Zheng, W.; Lee, M.Y.; Ullah, A. Exploring 3D human action recognition using STACOG on multi-view depth motion maps sequences. Sensors 2021, 21, 3642. [Google Scholar] [CrossRef]
Mohammadzade, H.; Hosseini, S.; Rezaei-Dastjerdehei, M.R.; Tabejamaat, M. Dynamic time warping-based features with class-specific joint importance maps for action recognition using Kinect depth sensor. IEEE Sens. J. 2021, 21, 9300–9313. [Google Scholar] [CrossRef]
Siena, F.L.; Byrom, B.; Watts, P.; Breedon, P. Utilising the intel realsense camera for measuring health outcomes in clinical research. J. Med. Syst. 2018, 42, 53. [Google Scholar] [CrossRef] [PubMed]
Plizzari, H.; Marco, C.; Matteo, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
Ahmad, T.; Syed, T.; Neel, K. Transforming spatio-temporal self-attention using action embedding for skeleton-based action recognition. J. Visual Commun. Image Represent. 2023, 95, 103892. [Google Scholar] [CrossRef]
Zhang, J.; Lin, L.; Liu, J. Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. Proc. AAAI Conf. Artif. Intell. 2023, 37, 3427–3435. [Google Scholar] [CrossRef]
Liu, J.; Amir, S.; Mauricio, P.; Wang, G.; Duan, L.; Alex, C.K. NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef]
Wang, H.; Yang, Y.; Yang, E.; Deng, C. Exploring hybrid spatio-temporal convolutional networks for human action recognition. Multimed. Tools Appl. 2017, 76, 15065–15081. [Google Scholar] [CrossRef]
Li, Y.; Guo, T.; Xia, R.; Liu, X. A Novel Skeleton Spatial Pyramid Model for Skeleton-based Action Recognition. In Proceedings of the 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 19–21 July 2019; pp. 16–20. [Google Scholar]
Hoang, V.N.; Le, T.L.; Tran, T.H. 3D skeleton-based action recognition with convolutional neural networks. In Proceedings of the 2019 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), Ho Chi Minh City, Vietnam, 9–10 May 2019; pp. 1–6. [Google Scholar]
Yang, Y.; Deng, C.; Tao, D.; Zhang, S.; Liu, W.; Gao, X. Latent max-margin multitask learning with skelets for 3-D action recognition. IEEE Trans. Cybern. 2017, 47, 439–448. [Google Scholar] [CrossRef] [PubMed]
Tasnim, N.; Mohammad, K.I.; Joong, B. Deep learning based human activity recognition using spatio-temporal image formation of skeleton joints. Appl. Sci. 2021, 11, 2675. [Google Scholar] [CrossRef]
Xu, J.; Wei, H.; Guo, J.; Zhou, Y.; Huang, X. A fast human action recognition network based on spatio-temporal features. IJON 2021, 441, 350–358. [Google Scholar] [CrossRef]
Ding, W.; Ding, C.; Li, G.; Liu, K. Skeleton-based square grid for human action recognition with 3D convolutional neural network. IEEE Access 2021, 9, 54078–54089. [Google Scholar] [CrossRef]
Liu, Y.; Xue, J.; Li, D.; Zhang, W.; Chiew, T.K.; Xu, Z. Image recognition based on lightweight convolutional neural network: Recent advances. Image Vision Comput. 2024, 146, 105037. [Google Scholar] [CrossRef]
Saleem, N.; Gunawan, T.S.; Dhahbi, S.; Bourouis, S. Time Domain Speech Enhancement with CNN and Time-Attention Transformer. Digital Signal Process. 2024, 147, 104408. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Neural Information Processing Systems Conference and Workshop, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Oreifej, O.; Liu, Z. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 716–723. [Google Scholar]
Sujitha, P.; Simon, P. A computationally efficient method for human activity recognition based on spatio temporal cuboid and super normal vector. J. Intell. Fuzzy Syst. 2020, 38, 6247–6255. [Google Scholar] [CrossRef]
Zhang, H.; Ping, Z.; He, J.; Xia, C. Combining depth-skeleton feature with sparse coding for action recognition. Neurocomputing 2017, 230, 417–426. [Google Scholar] [CrossRef]
Avola, D.; Cascio, M.; Cinque, L.; Foresti, G.; Massaroni, C.; Rodola, E. 2D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs. IEEE Trans. Multimedia 2019, 22, 2481–2496. [Google Scholar] [CrossRef]
Tu, J.; Liu, M.; Liu, H. Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
Phyo, C.N.; Zin, T.T.; Tin, P. Deep learning for recognizing human activities using motions of skeletal joints. IEEE Trans. Consum. Electron. 2019, 65, 243–252. [Google Scholar] [CrossRef]
Almeida, M.; Concha, D.T.; Pedrini, H.; Brito, A.; Chaves, H.; Vieira, M.; Villela, S. Action recognition in videos using multi-stream convolutional neural networks. In Deep Learning Applications; Springer: Singapore, 2020; pp. 95–111. [Google Scholar]
Deng, Z.; Gao, Q.; Ju, Z.; Yu, X. Skeleton-based multifeatures and multistream network for real-time action recognition. IEEE Sens. J. 2023, 23, 7397–7409. [Google Scholar] [CrossRef]
Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-view action modeling, learning, and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar]
Li, R.; Fu, H.; Lo, W.; Chi, Z.; Song, Z.; Wen, D. Skeleton-based Action Recognition with Key-segment Descriptor and Temporal Step Matrix Model. IEEE Access 2019, 7, 82–95. [Google Scholar] [CrossRef]
Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 168–172. [Google Scholar]
Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 807–811. [Google Scholar] [CrossRef]
Bulbul, M.F.; Islam, S.; Ali, H. 3D human action analysis and recognition through GLAC descriptor on 2D motion and static posture images. Multimed. Tools Appl. 2019, 78, 85–111. [Google Scholar] [CrossRef]
McNally, W.; Wong, A.; McPhee, J. STAR-Net: Action Recognition using Spatio-Temporal Activation Reprojection. In Proceedings of the 2019 16th Conference on Computer and Robot Vision (CRV), Kingston, QC, Canada, 29–31 May 2019; pp. 49–56. [Google Scholar]

Figure 1. TS-CNN action recognition method with the fusion of skeletal spatial-temporal and motion information.

Figure 2. Skeletal coordinates of Kinect coordinate system.

Figure 3. Joint visualization in body coordinate system.

Figure 4. Labeling of 25 human joints.

Figure 5. JDM and the proposed SSTM.

Figure 6. Image enhancement color texture map.

Figure 7. Dual-stream convolutional neural network model.

Figure 8. Sample of NTU RGB-D dataset.

Figure 9. Cross-view verification confusion matrix based on NTU RGB-D dataset.

Figure 10. Comparison of joint pair correlations.

Table 1. Comparison of experimental results for NTU RGB-D dataset.

Data	Feature	Method	Crossed Subjects	Crossover View
Depth map	Manual extraction	HON4D [20]	30.56%	7.26%
Depth map	Manual extraction	SNV [21]	31.82%	13.61%
Mixed data	Manual extraction	HOG [22]	32.24%	22.27%
Skeletal sequences	Manual extraction	LARP [6]	50.08%	52.76%
	Manual extraction	Dynamic skeletons [7]	60.23%	65.22%
	RNN	HBRNN-L [8]	59.03%	63.97%
		Deep RNN [9]	56.29%	67.29%
		Prae-aware LSTM [9]	62.93%	70.27%
		ST-LSTM [23]	65.20%	76.10%
	CNN	Multi-temporal 3D CNN [24]	66.85%	72.58%
		D-CNN [25]	73.40%	80.40%
		TSRJI [16]	73.30%	80.30%
		Multi-CNN [26]	76.10%	82.64%
		JDM [14]	76.20%	82.30%
		MM-Net [27]	80%	88%
		The developed method	79.07%	86.25%

Table 2. Assessment of robustness against the input noise.

	0.1 cm	1 cm	4 cm	8 cm	16 cm	32 cm
MM-Net	87%	82%	78%	70%	62%	51%
The developed approach	86.1%	84%	80%	75%	68%	58%

Table 3. Experimental results for NORTHWESTERN-UCLA dataset.

Data Type	Feature	Method	Accuracy
Depth information	Manual features	HON4D [20]	39.90%
Depth information	Manual features	SNV [21]	42.80%
Skeletal information	Manual features	HOJ3D [29]	54.50%
	Manual features	LARP [6]	74.20%
	RNN	HBRNN-L [8]	78.52%
	CNN	The proposed method	87.37%

Table 4. Experimental results for UTD MHAD dataset.

Method	Accuracy
Kinect & Inertial [30]	79.10%
SOS [31]	86.91%
COv3DJ [32]	85.58%
JTM [33]	85.81%
JDM [14]	88.10%
The developed method	93.75%

Table 5. Validation of effectiveness of SSTM and JMSM.

Feature Descriptors	Dataset
Feature Descriptors	NTU RGB-D (Cross-Subject)	UTD-MHAD
SSTM	67.54%	86.71%
JMSM	72.65%	87.67%
SSTM + JMSM	79.07%	93.75%

Table 6. Effectiveness of motion and visual enhancement.

Feature Descriptor	Dataset
Feature Descriptor	NTU RGB-D (Cross-Subject)	UTD-MHAD
SSTM	64.21%	84.19%
SSTM + Motion Enhancement	67.54%	87.71%
JMSM	66.44%	79.30%
JMSM + Visual Enhancement	72.65%	87.67%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, S.; Zhang, D.; Tang, Z.; Wang, H. Deep Fusion of Skeleton Spatial–Temporal and Dynamic Information for Action Recognition. Sensors 2024, 24, 7609. https://doi.org/10.3390/s24237609

AMA Style

Gao S, Zhang D, Tang Z, Wang H. Deep Fusion of Skeleton Spatial–Temporal and Dynamic Information for Action Recognition. Sensors. 2024; 24(23):7609. https://doi.org/10.3390/s24237609

Chicago/Turabian Style

Gao, Song, Dingzhuo Zhang, Zhaoming Tang, and Hongyan Wang. 2024. "Deep Fusion of Skeleton Spatial–Temporal and Dynamic Information for Action Recognition" Sensors 24, no. 23: 7609. https://doi.org/10.3390/s24237609

APA Style

Gao, S., Zhang, D., Tang, Z., & Wang, H. (2024). Deep Fusion of Skeleton Spatial–Temporal and Dynamic Information for Action Recognition. Sensors, 24(23), 7609. https://doi.org/10.3390/s24237609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Fusion of Skeleton Spatial–Temporal and Dynamic Information for Action Recognition

Abstract

1. Introduction

2. Methods

2.1. Coordinate System Transformation

2.2. Skeletal Space-Time Feature Descriptor Construction

2.3. Construction of Skeletal Motion Feature Descriptors

2.4. Image Enhancement

2.5. TS-CNN Design for Action Recognition

3. Experiments and Discussion

3.1. Experimental Environment Configuration

3.2. NTU RGB-D Dataset

3.3. Northwestern-UCLA Dataset

3.4. UTD MHAD Dataset

3.5. Ablation Experiments

3.5.1. Effectiveness of Coordinate System Conversion

3.5.2. Verification of the Complementarity between SSTM and JMSM

3.5.3. Effectiveness of Image Enhancement

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI