4.1. Experiment Framework
The experiments were carried out on a PC with Intel Core i5-8250U, 1.8-GHz CPU, 24 GB RAM, 1 TB hard disk, and Windows 11 operating system. Python 3.8 is utilized to process the signal and construction models.
The framework of the experiment is shown in
Figure 4. Firstly, the experimental platform is introduced, and then 1200 groups of PMDCM’s current signal are collected by LabVIEW 2018 software. Then, the time-domain features and time-frequency-domain features are extracted from the successive multi-segment PMDCM’s current signal. The extracted features are normalized by utilizing the Min-Max normalization. A GNB classifier is used to optimize the number of iterations and the proportion of test samples for normalized features. After determining the number of iterations and the proportion of test samples, the proposed GVFS method and five other feature selection methods (in
Section 4.6) are used to sort the extracted features, and
-NN classifiers with different
values are used to evaluate the different numbers of features, respectively, so as to obtain the number of features with the optimal fault diagnosis accuracy and the number of features with the sub-optimal fault diagnosis accuracy. Optimization of the two hyperparameters (
and
) of the SVM classifier by utilizing the 70% samples with selected features is conducted; at the same time, 10-fold cross-validation is utilized to stabilize the performance. Then, training of the SVM fault diagnosis model is conducted with the optimized two hyperparameters and 70% samples, the remaining 30% samples are used to test the fault diagnosis accuracy of PMDCMs.
4.6. Feature Selection Methods
In this work, the proposed GVFS method is compared with the other five feature selection methods (e.g., Fisher score feature selection (FSFS) [
38], mutual information feature selection (MIFS) [
10,
15,
39], variance threshold feature selection (VTFS) [
40], F-test feature selection (F-testFS) [
41], chi-square feature selection (chi2FS) [
42]) by utilizing the PMDCM’s data.
- (1)
FSFS
Fisher score feature selection (FSFS) [
38] is one of the effective filter methods for feature selection. The main idea of FSFS is that find the features with the distances between different categories being as large as possible, while the distances of the same category are as small as possible. Specifically, let
and
be the mean and standard deviation of
-th category, corresponding to the
-th feature. Let
and
denote the mean and standard deviation of the
-th feature.
represents the number of categories.
represents the number of samples with the label
. Then the Fisher score of the
-th feature is computed according to Equation (7).
After the Fisher score was calculated for each feature, the features were then ranked in descending order according to the Fisher score.
- (2)
MIFS
The MI [
39] indicates the degree of correlation between two variables, i.e., the change degree of a variable when another variable is changed. There are three types of
calculations in total according to the type of variable: between discrete and discrete data, between continuous and continuous data, and between discrete and continuous data. The
between discrete and discrete data is calculated as shown in Equation (8). The MI between continuous and continuous data is calculated as shown in Equation (9).
In Equation (8) is the joint probability distribution function of and , and are the marginal probability distribution functions of and , respectively. In Equation (9) is the joint probability density function of and , and are the marginal probability density functions of and , respectively.
A novel calculation method of MI between discrete and continuous data was proposed by Ross [
39] in 2014 and is widely used. He first finds the
-th closest neighbor to point
among those
data points whose value of the discrete variable equals
using some distance metric. Define
as the distance to this
-th neighbor. Then count the number of neighbors
in the full dataset that lies within distance
to point
(including the
-th neighbor itself). Based on
and
calculate the MI by Equation (10).
where,
is the digamma function,
means calculating the average value. In Equation (10) larger
-values lead to lower sampling error but higher coarse-graining error, we set the
as 3. In this paper, take each feature as a variable, and the label as another variable to calculate the correlation between each feature and the label. Features are sorted in descent according to the MI, the first
features are selected to form the new feature subset, and this feature selection method is called MIFS.
- (3)
VTFS
Divide the same features of all samples into a group and calculate the variance according to Equation (11) which is recorded as
.
where
represents the number of
,
represents the average value of
, a lower variance indicates a more consistent feature between samples. By default, a feature with zero variance will be eliminated, as there is no difference in feature between samples, which is an ineffective feature. Calculating the variance of the same feature for all samples has been the most widely utilized in feature selection. Features are sorted in descending order of the variance. Generally, it is necessary to define a threshold to eliminate the features which have a variance below the threshold. In this paper, to compare with other feature selection methods, the threshold is not defined, and the features are sorted by variance. Finally, the first
features are selected to form the new feature subset, and this feature selection method is called VTFS.
- (4)
F-testFS
F-test feature selection (F-testFS) [
41] method ranks the features in descending order according to the f-score (Equation (12)) which calculates the ratio of the variance (Equation (11)) between categories and the variance within a category for a feature. A greater f-score means that the distance within the category is less and the distance between the category is more. The f-score is given by Equation (12).
where,
represents the variance within a category,
represents the variance between categories.
It is assumed that there are six samples, each of which contains one feature. Suppose that the six samples have three distribution states as shown in
Figure 7, respectively. C
1, C
2 and C
3 represent the three categories, respectively. In
Figure 7, the variance of each category is calculated according to Equation (11) and then averaged, which is denoted as
,
and
, respectively.
,
and
are calculated according to Equation (11), they represent the variance of the six samples in
Figure 7 C
1, C
2 and C
3, respectively.
- (5)
Chi2FS
The chi-square feature selection (chi2FS) [
42] method ranks the features in ascending order according to the chi-square value (Equation (13)) The basic idea of the chi-square test lies in the degree of deviation between the actual observed value and the theoretical inferred value of statistical samples. The degree of deviation between the actual observed value and the theoretical inferred value determines the magnitude of the chi-square value. If the chi-square value is larger, the greater the degree of deviation between them is. If the two values are completely equal, the chi-square value is zero, indicating that the theoretical value is completely consistent with the actual value.
where
represents the actual observed frequency of samples and
represents the theoretical inferred frequency of samples. In this work, we utilized the scikit-learn of python to calculate the chi-square value between discrete and continuous data.