Semi-Supervised Bearing Fault Diagnosis and Classification Using Variational Autoencoder-Based Deep Generative Models
Semi-Supervised Bearing Fault Diagnosis and Classification Using Variational Autoencoder-Based Deep Generative Models
Semi-Supervised Bearing Fault Diagnosis and Classification Using Variational Autoencoder-Based Deep Generative Models
5, MARCH 1, 2021
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: SEMI-SUPERVISED BEARING FAULT DIAGNOSIS AND CLASSIFICATION 6477
expensive [13]–[18], and it often requires human knowl- model itself as a classifier by also exploiting its generative
edge/expertise on the system states [12]. Therefore, the bearing capabilities.
dataset, especially the faulty data, are usually not labeled We summarize the detailed technical contributions of this
in real industrial applications [14], [19]. Even attempts are work as follows:
made to label these unlabeled samples, the accuracy of these 1) Semi-supervised deep generative model implementation:
labels cannot be guaranteed, since they are also subject to This paper applies two semi-supervised VAE-based deep
confirmational data biases of the engineers interpreting the generative models to leverage properties of both the
data [17]. Therefore, both label scarcity and label accuracy labeled and unlabeled data for bearing fault diagnosis.
issues will pose challenges to the mainstream supervised To mitigate the “KL vanishing” problem in VAE models
learning approaches for bearing fault diagnosis. and further promote the accuracy and robustness of the
A promising approach to overcome these challenges is to semi-supervised classifier, this study also adapt the KL
apply semi-supervised learning algorithms that can leverage cost annealing techniques [25], [26] on top of the original
the limited labeled data and the massive unlabeled data models presented in [27].
simultaneously [12]–[19]. Specifically, semi-supervised learn- 2) Strong performance mitigating the label scarcity issue:
ing considers the classification problem when only a small part This work utilizes the CWRU dataset to create test sce-
of the data has labels, and so far only a few semi-supervised narios where only a small subset of data for each fault cat-
learning paradigms have been applied to bearing fault diagno- egory has labels, which corresponds to the label scarcity
sis. For instance, the support vector data description method issue discussed in [12]–[19] for real-world applications.
in [19] uses cyclic spectral coherent domain indicators to The results show that the M2 model can greatly outper-
construct a feature space and fit a hypersphere, which then cal- form the baseline unsupervised and supervised learning
culates the Euclidean distance in order to distinguish the faulty algorithms. Additionally, the VAE-based semi-supervised
data from the healthy ones. In addition, both [15] and [16] use generative M2 model also compares favorably against
graph-based methods to construct graphs connecting similar four state-of-the-art semi-supervised learning methods.
samples in the dataset, so class labels can be propagated from 3) Solid performance mitigating the label accuracy
labeled nodes to unlabeled nodes through the graph. However, issue: This study also uses the IMS dataset with
these methods are very sensitive to their graph structure and naturally-evolved bearing defects to create test scenarios
need to analyze the graph’s Laplacian matrix, which limits with the label accuracy issue discussed in [17]. The
the scope of these methods. [12] uses α-shape instead of a results demonstrate that incorrect labeling will inevitably
graph-based method to capture the data structure, and the reduce the classifier performance of supervised learning
α-shape is mainly used to perform surface estimation and to algorithms, while adopting semi-supervised deep
reduce the efforts required for parameter tuning. generative models can be an effective way to mitigate
Moreover, the semi-supervised deep ladder network is also the label accuracy issue. This conclusion can be
applied in [13] to identify the failure of the primary parallel supported by the consistent dominance of the proposed
shaft helical gear in an induction motor system. The ladder net- model over CNN when a lot of healthy data were
work is implemented by modeling hierarchical latent variables mislabeled as faulty ones.
to integrate supervised and unsupervised learning strategies. The rest of the paper is organized as follows. In Section II,
However, the unsupervised components of the ladder network we introduce some of the background knowledge of VAE.
may not contribute to a semi-supervised task if those raw data Next, in Section III, we present the architecture of two
do not show obvious clustering on the 2-D manifold, which VAE-based deep generative models in the semi-supervised
is usually the case for vibration signals. Although GAN has setting, with detailed discussions on leveraging a dataset that
also been used for semi-supervised learning in [14], [17], includes both labeled and unlabeled data. In Section IV, two
[18], it is reported in [21] that good generators and good comparative studies of the proposed models against other
semi-supervised classifiers cannot be obtained simultaneously. popular machine learning and deep learning algorithms are
Additionally, the well-known difficulty in training GANs has performed using both the University of Cincinnati’s Cen-
further impacted their applications in semi-supervised learning ter for Intelligent Maintenance Systems (IMS) dataset [28]
tasks in practice [10]. and the Case Western Reserve University (CWRU) bearing
The motivation of the proposed research is both broad dataset [29]. Section V concludes the paper by highlighting
and specific, as we strive to tackle the label scarcity and its technical contributions.
label accuracy issues in bearing fault diagnosis by leverag-
ing both labeled and unlabeled data. Specifically, we adopt
a deep generative model based on solid Bayesian theory II. B ACKGROUND OF VARIATIONAL AUTOENCODERS
and use scalable variational inference in a semi-supervised The variational inference technique is often used in the
environment. Although some existing work using variational training and prediction process, which is effective for solv-
autoencoders (VAE) for bearing fault diagnosis can be found ing the posterior of the distribution obtained from neural
in [22]–[24], they only use the discriminative features in networks [20]. As demonstrated in Fig. 1, the VAE’s archi-
the latent space for dimension reduction, and then use these tecture specifies a joint distribution pθ (x, z) = pθ (x|z) p(z)
features to train other external classifiers. In this work, how- over observations x and latent variables z, which are usually
ever, we also take an integrated approach to train the VAE sampled from a prior density p(z) subject to a multivariate
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
6478 IEEE SENSORS JOURNAL, VOL. 21, NO. 5, MARCH 1, 2021
Gaussian distribution N (0, I). These latent variables are also this optimization can be performed using stochastic gradient
related to the observed variables x through the likelihood descent.
pθ (x|z), which can be regarded as a probabilistic decoder,
or generator, to decode z into a distribution over the observa- III. S EMI -S UPERVISED D EEP G ENERATIVE M ODELS
tion x. A neural network parameterized by θ is typically used B ASED ON VARIATIONAL AUTOENCODERS
to model the decoder.
This section presents two semi-supervised deep generative
After specifying the decoding process, it is necessary to
models based on VAE [27]. When only a small subset of
perform inference, or to calculate the posterior pθ (z|x) of
training data have labels, both models can exploit VAE’s
latent variables z given the observations x. In addition, we also
generative power to enhance the classifier’s performance.
seek to optimize the model parameters θ with respect to
By learning a good variational approximation of the posterior,
pθ (x), which is obtained by marginalizing out the latent
the VAE’s encoder can embed the input data x as a set of
variables z in the likelihood function pθ (x, z). Since the prior
low-dimension latent features z. The approximated posterior
p(z) is a Gaussian non-conjugate process, the true posterior
qφ (z|x) is formed by a nonlinear transformations, which can be
pθ (z|x) becomes analytically intractable. Therefore, the tech-
modeled as a deep neural network f (z; x, φ) with variational
nique of variational inference should be used to approximate
parameters φ. Similarly, the VAE’s generator takes a set of
a posterior qφ (z|x) with optimized variational parameters φ,
latent variables z and reproduces the observations x using
which minimizes the Kullback-Leibler (KL) divergence of the
pθ (x|z), which can also be modeled as a deep neural network
approximated posterior to the true posterior. This posterior
g(x; z, θ ) parameterized by θ .
approximation qφ (z|x) can also be observed as an encoder
with distribution N (z|μφ (x), diag(σφ2 (x))), of which μφ (x)
and σφ (x) will also be optimized using neural networks. A. Latent-Feature Discriminative M1 Model
By definition, the KL divergence measures the similarity The M1 model [27] trains the VAE-based encoder and
between two distributions, which is expressed as an expecta- decoder in an unsupervised manner. The trained encoder will
tion of the log of the first distribution minus the log of the sec- provide an embedding of input data x in the latent space,
ond distribution. Thus the KL divergence of the approximated which is defined by the latent variables z. In most cases,
posterior qφ (z|x) with respect to the true posterior pθ (z|x) the dimension of z is much smaller than that of x, and these
is shown Eqn. (1), shown at the bottom of the page, after low-dimensional features can often increase the accuracy of
applying the Bayes’ theorem. supervised learning models.
After moving log pθ (x) to the left hand side of Eqn. (1), As shown in Fig. 2, after training the M1 model, the actual
it can be written as the sum of a defined term known as classification task will be carried out in an external classifier,
the evidence lower bound (ELBO) and the KL divergence, such as support vector machine (SVM), polynomial regression,
which satisfies DKL qφ (z|x) pθ (z|x) ≥ 0. Specifically, based etc. Specifically, the VAE encoder will only process the labeled
on Jensen’s inequality, the optimal qφ (z|x) that maximizes the data xl to determine their corresponding latent variable zl ,
ELBO is pθ (z|x), which also makes the KL divergence term then they are combined with their corresponding labels yl to
equal to zero. Therefore, maximizing Eqn. (2), shown at the train this external classifier. The M1 model is considered a
bottom of the page with respect to θ and the variational para- semi-supervised method, since it leverages all available data to
meters φ is analogous to minimizing the KL divergence, and train the VAE-based encoder and decoder in an unsupervised
DKL qφ (z|x) pθ (z|x) = Ez∼qφ (z|x) log qφ (z|x) − log p(z|x)
= Ez∼qφ (z|x) [log qφ (z|x) − log p(z) − log pθ (x|z)] + log pθ (x) (1)
log pθ (x) = −Ez∼qφ (z|x) log qφ (z|x) − log p(z) − log pθ (x|z) + DKL qφ (z|x) pθ (z|x)
= Ez∼qφ (z|x) log pθ (z|x) + log pθ (x|z) − log qφ (z|x) + DKL qφ (z|x) pθ (z|x) (2)
Evidence Lower Bound (ELBO) ≥0
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: SEMI-SUPERVISED BEARING FAULT DIAGNOSIS AND CLASSIFICATION 6479
manner, and thereafter it also takes the labeled data (zl , yl ) 3) Combined Objective for the M2 Model: In Eqn. (5), the dis-
to train an external classifier in a supervised fashion. When tribution qφ (y|x), which is used to construct the discriminative
compared with purely supervised learning methods that can classifier, is only included in the variational objective of the
only be trained using a small subset of data with labels, unlabeled data. This is still an undesirable feature, since the
the M1 model usually promotes more accurate classification. labeled data will not be involved in learning this distribution or
This is because the VAE structure is also able to learn from the the variational parameter φ. Therefore, an additional loss term
vast majority of unlabeled data, which enables the extraction should be superimposed on the combined model objective,
of more representative latent features to train its subsequent such that both the labeled and unlabeled data can contribute
classifier. to the training process. Hence, the final objective of the
semi-supervised deep generative M2 model is:
B. Semi-Supervised Generative M2 Model Jα = U(x) + L(x, y) − α · log qφ (y|x) (7)
x∼ p̃u (x,y)∼ p̃l
As briefly mentioned earlier, the major limitation of the
M1 model is the disjoint nature of its training process, as it in which the hyper-parameter α controls the relative weight
needs to train the VAE network first and thereafter the external between the generative and the discriminative learning. A rule
classifier. Specifically, the initial VAE training phase of the of thumb is to set α to be α = 0.1 · N in all experiments,
M1 model’s VAE-based encoder and decoder is a purely where N is the number of labeled data samples.
unsupervised process and does not involve any scarce labels yl , With this combined objective function, we can integrate a
which is completely separated from the subsequent classifier large number of x as a mini-batch to enhance the stability
training phase that actually takes yl . To address this issue, of training two neural networks used as an encoder and
another semi-supervised deep generative model, referred to a decoder. Finally, we’ll run stochastic gradient descent to
as the M2 model, is also proposed in [27]. The M2 model update the model parameters θ and the variational para-
can handle two situations at the same time: one where the meters φ. The structure of the M2 model is presented in
data have labels, and the other where these labels are not Fig. 3.
available. Therefore, there are also two ways to construct the
approximated posterior q and its variational objective. C. Model Implementations
1) Variational Objective With Unlabeled Data: When labels
1) M1 Model Implementation: The M1 model constructs its
are not available, two separate posteriors qφ (y|x) and qφ (z|x)
encoder qφ (z|x) and decoder pθ (x|z) by using two deep
will be approximated during the VAE training stage, where
neural networks f (z; x, φ) and g(x; z, θ ), respectively. The
z is still the latent variables similar to the M1 model, while
encoder has 2 convolutional layers and 1 fully connected
y is the unobserved label yu . This newly defined posterior
layer using ReLu activation, aided by batch normalization and
approximation qφ (y|x) will be used to construct the best
dropout layers. The decoder consists of 1 fully connected layer
classifier as our inference model [27]. Given the observations
followed by 3 transpose convolutional layers, where the first
x, the two approximated posteriors of the corresponding class
2 layers use ReLU activation and the last layer uses linear
labels y and latent variables z can be defined as
activation.
qφ (y|x) = Cat y|πφ (x) Due to the “KL vanishing” problem, it is often difficult
to achieve a good balance between the likelihood and the KL
qφ (z|x) = N z|μφ (x), diag σφ2 (x) (3) divergence, as the KL loss can be undesirably reduced to zero,
though it is expected to remain a small value. To overcome
where Cat y|πφ (x) is the concatenated multinomial distribu- this problem, the implementation of the M1 model uses the
tion, πφ (x) can be modeled by a neural network parameterized “KL cost annealing” or “β VAE” [25], which includes a new
by φ. Combining the above two posteriors, a joint posterior weight factor β for the KL divergence. The revised ELBO
approximation can be defined as function for “β VAE” is
qφ (y, z|x) = qφ (z|x)qφ (y|x) (4) ELBO = Eqφ (z|x) log pθ (x|z) −β · D K L qφ (z|x) p(z) (8)
Reconstruction KL Regularization
Therefore, the revised ELBOU that determines the vari-
ational objective of the unlabeled data can be written as During training, β will be manipulated to gradually increase
Eqn. (5), shown at the bottom of the next page, where L(x, y) from 0 to 1. When β < 1, the latent variables z are trained with
is the original ELBO in Eqn. (2). an emphasis on capturing useful features for reconstructing the
2) Variational Objective With Labeled Data: Since the goal of observations x. When β = 1, the z learned in earlier epochs
semi-supervised learning is to train a classifier using a limited can be taken as a good initialization, which enables more
amount of labeled data and the vast majority of unlabeled data, informative latent features to be used by the decoder [26].
it would be beneficial to also include the scarce labels in the After training the M1 model that is able to balance its
training process of this deep generative M2 model. Similarly, reconstruction and generation features, the latent variable z
Eqn. (6), shown at the bottom of the next page shows the in latent space will be used as discriminative features for
revised ELBOL that determines the variational objective for the external classifier. This paper uses an SVM classifier,
the labeled data. though any personally preferred classifier can also be used.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
6480 IEEE SENSORS JOURNAL, VOL. 21, NO. 5, MARCH 1, 2021
Fig. 3. Illustration of the semi-supervised generative M2 model. IV. E XPERIMENTAL R ESULTS U SING
THE CWRU DATASET
The M1 model will perform discriminative feature extraction In this section, we seek to use the CWRU dataset to verify
and reduce the dimensionality of the input data, which is the effectiveness of the two VAE-based semi-supervised deep
expected to increase the performance of the external classifier. generative models for bearing fault diagnosis. The developed
In this study, the input data has a dimension of 1,024, which diagnostic framework will be described in detail, and the
will be reduced to 128 in the latent space. performance of the classifier will be first compared with
2) M2 Model Implementation: The deep generative three baseline supervised/unsupervised algorithms, includ-
M2 model uses the same structure for qφ (z|x) as the ing principal component analysis (PCA), autoencoder (AE),
M1 model, while the decoder pθ (x|y, z) also has the same and convolutional neural network (CNN). Then, we’ll also
settings as M1’s pθ (x|z). In addition, the classifier qφ (y|x) compare the proposed methods against some state-of-the-art
consists of 2 convolutional layers and 2 max-pooling layers semi-supervised learning algorithms, such as low density
with dropout and ReLU activation, followed by the final separation (LDS) [30], safe semi-supervised support vector
Softmax layer. machine (S4VM) [31], SemiBoost [32], and semi-supervised
Two independent neural networks are used, one for the smooth alpha layering (S3AL) [12].
labeled data and one for the unlabeled data, with the same
network structure, but different input/output specifications and
loss functions. For instance, for labeled data, both xl and y A. CWRU Dataset
are considered as input to minimize the labeled (x, y) ∼ p̃l The CWRU dataset contains vibration signals collected from
part in Eqn. (7), and the output will be the reconstructed the drive-end bearing and fan-end bearing in a 2 hp induction
as xl∗ and y ∗ . For unlabeled data, xu is the only input to motor dyno setup [29]. Single-point defects are manually
reconstruct xu . Other hyper-parameters of the M2 model are created onto the bearing inner race (IR), outer race (OR),
also selected empirically. We use a batch size of 200 for and rolling elements by electro-discharge machining. Different
training, the latent variable z has a dimension of 128. For opti- defect diameters of 7 mil, 14 mil, 21 mil, 28 mil, and 40 mil
mizer settings, we use RMSprop with a 10−4 initial learning are used to represent different levels of fault severity. Two
rate. accelerometers mounted on the drive-end and fan-end of the
3) M1 Vs. M2 Model: By comparing the M1 and M2 models, motor housing are used to collect vibration data at a motor load
it’s obvious to find that the significance of the M1 model from 0 to 3 hp and a motor speed from 1,720 to 1,797 rpm
lies in its simple and clear network structure, which is easy at a sampling frequency of 12 kHz or 48 kHz.
to implement and saves training time. As shown in Fig. 2, The purpose of the proposed bearing fault diagnostic model
the M1 model is a simple and straightforward implementation is to reveal the location and severity of bearing defects,
of VAE that only includes an encoder and a decoder trained in and vibration data collected for the same failure type but at
an unsupervised manner, then the learned latent features and different speeds and load conditions will be considered as
labels (zl , yl ) of the labeled data are subsequently used to train having the same class label. Based on this standard, 10 classes
an external classifier. are specified according to the size and location of the bearing
On the other hand, the M2 model deals with both labeled defect, and TABLE I identifies a detailed list of all 10 classes
and unlabeled data by using two identical encoder networks. featured in this study.
ELBOU = Eqφ (y,z|x) log pθ (x|y, z) + log pθ (y) + log pθ (z) − log qφ (y, z|x)
= Eqφ (y|x) − L(x, y)) − log qφ (y|x)
= qφ (y|x)(−L(x, y)) + H(qφ (y|x)) = −U(x) (5)
y
ELBOL = Eqφ (z|x,y ) log pθ (x|y, z) + log pθ (y) + log pθ (z) − log qφ (z|x, y)
= Eqφ (z|x) log pθ (x|y, z) + log pθ (y) + log pθ (z) − log qφ (z|x) = −L(x, y) (6)
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: SEMI-SUPERVISED BEARING FAULT DIAGNOSIS AND CLASSIFICATION 6481
Fig. 4. Comparison of the original (top row) and the reconstructed (bottom row) bearing vibration signals after training the VAE M1 model.
TABLE I carried out to ensure that both training and test set represent
C LASS L ABELS S ELECTED F ROM THE CWRU D ATASET the overall distribution of the CWRU dataset, which can
further enhance model generalization and makes it less prone
to overfitting. Classical standardization techniques are also
implemented to the training and test set to ensure the vibration
data have zero mean and unit variance, which is enabled by
subtracting the mean of the original data and then dividing the
result by its standard deviation.
C. Experimental Results
After training the VAE-based M1 model, the reconstructed
bearing vibration signal should be very similar to the actual
vibration signal, and their comparisons are demonstrated in
Fig. 4. Although a perfect reconstruction may impact the
VAE’s generative capabilities and reduce its versatility, a rea-
B. Data Preprocessing sonably close reconstruction with a small error indicates
The diagnosis process starts from data segmentation, which that VAE has achieved a balance between reconstruction and
divides the collected vibration signal into multiple segments generation, which is critical to leverage the generative features
of equal length. For the CWRU dataset, the number of data of the algorithm.
samples of the drive-end vibration signal for each kind of The network structure of the VAE-based deep genera-
bearing failure is approximately 120,000 at three different tive M1 and M2 models have been discussed in detail in
speeds (i.e., 1,730 rpm, 1,750 rpm, and 1,772 rpm). Data Section III. C. In addition to implementing these models
collected at these speeds constitute the complete data for for bearing fault diagnosis, other popular unsupervised learn-
each class, which will later be segmented using a fixed ing schemes such as PCA and autoencoder, as well as the
window size of 1,024 samples (or 85.3 ms time span for supervised CNN, are also trained to serve as baselines. Their
a sampling rate of 12 kHz) and with a sliding rate of 0.2. parameters are either selected to be consistent with the M1 and
Finally, the numbers of training and test data segments are M2 model, or are obtained through parameter tuning. For
12,900 and 900, respectively. All of the test set data will be example, we use the same optimizer settings as the VAE model
labeled. Although the percentage of test data appears to be (RMSprop with an initial learning rate of 10−4 ) to train both
small at first glance, only a maximum of 2,150 training data CNN and autoencoder benchmarks. More details are provided
segments will have labels in the later experiment, indicating as follows:
the percentage of test data over labeled training data is 1) PCA+SVM: the PCA+SVM benchmark is trained using
around 30%. low-dimensional features extracted from the labeled data
After the initial data import and segmentation stage, these segments (each consists of 1,024 data samples) using
data segments are still arranged in the order of their class PCA. The dimension of the feature space is 128, which
labels, or fault types. Therefore, data shuffling needs to be is consistent with the M1 and M2 model’s latent space
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
6482 IEEE SENSORS JOURNAL, VOL. 21, NO. 5, MARCH 1, 2021
TABLE II
E XPERIMENTAL R ESULTS OF VAE-B ASED S EMI -S UPERVISED C LASSIFICATION ON CWRU B EARING D ATASET W ITH L IMITED L ABELS
TABLE III
C OMPARISON OF D IFFERENT S EMI -S UPERVISED L EARNING A LGORITHMS W ITH D IFFERENT L ABELED D ATA P ERCENTAGE ν
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: SEMI-SUPERVISED BEARING FAULT DIAGNOSIS AND CLASSIFICATION 6483
TABLE V
B EARING FAULT S CENARIOS AND T HEIR D EGRADATION S TARTING P OINTS IN THE IMS D ATASET [34]
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
6484 IEEE SENSORS JOURNAL, VOL. 21, NO. 5, MARCH 1, 2021
TABLE VI
E XPERIMENTAL R ESULTS OF S EMI -S UPERVISED C LASSIFICATION ON IMS B EARING D ATASET W ITH L IMITED L ABELS
15 of them are chosen after their degradation starting points Similar to the comparison results shown in TABLE II,
determined by the auto-encoder-correlation (AEC) algorithm. the classifier’s performance of the VAE-based M2 model is
For instance, data files 1,832 to 2,042 will be selected for the superior to the other four algorithms, showing the advantage
“Subset 1 Bearing 3” scenario, since its estimated degradation of integrating the training process of the VAE model and its
starting point is 2,027. On the other hand, healthy data are built-in classifier. Critical observations can be drawn when the
picked from the first 110 files of “Subset 1 Bearing 3”. number of labeled training data increases from N = 4, 000 to
Each fault scenario has 210 consecutive vibration data files, N = 8, 000, the accuracy of the supervised algorithm CNN is
of which the last 10 files will serve as the test set, and the reduced by more than 6%, and the loss of the semi-supervised
first 200 files will constitute the entire training set. Each VAE M2 model is 4%. The performance of unsupervised
file contains 20,480 data points, which can be divided into learning algorithms, which do not use labels in their training
20 data segments. Therefore, each fault scenario has 4,000 data process, remains intact. This can be largely attributed to the
segments, or all 4 categories (healthy, rolling elements, outer many healthy data samples that have been incorrectly labeled
race, inner race) have 16,000 data segments. as faulty data, which also creates a dilemma that impairs
In order to simulate the challenges related to accurate data the classifier’s accuracy using either insufficient data or more
labeling in practical applications, labels will be assigned start- data with inaccurate labels. Specifically, the best attainable
ing from the last of the 40,000 training data segments for each accuracy for three baselines algorithms are 87.74% when
fault scenario and proceed backward. For these data segments, N = 4, 000 and 84% when N = 8, 000, while the VAE-based
we should have the highest confidence in the accuracy of their M2 model can achieve an average of 92.01% and 88.11%,
labels. Then, by labeling more preceding files, but with lower respectively.
confidence, more case studies can be performed. The purpose In summary, the experimental results obtained using the
is to investigate whether incorrect labeling will negatively IMS dataset consistently supports the previous findings on the
impact the accuracy of the supervised learning benchmark – CWRU dataset, that is, taking advantage of the large amount
CNN, and to assess if semi-supervised deep generative models of unlabeled data can effectively enhance the classifier’s per-
can still improve the accuracy of the bearing fault classifier formance using semi-supervised VAE-based deep generative
by leaving these data segments unlabeled. models, especially the M2 model. In addition, the results
also imply that inaccurate labeling can reduce the accuracy
C. Experimental Results
of supervised learning algorithms. Therefore, in diagnosing
A total of 10 rounds of independent semi-supervised exper- naturally evolved bearing faults in real-world applications,
iments were performed using the IMS dataset, and 10 case it is desirable to leverage semi-supervised learning methods,
studies were conducted by labeling the last 40, 100, 200, which only requires a small set of data that we can label with
400, 800, 1,000, 2,000, 4,000, and 8,000 data segments of confidence while retaining the majority of data unlabeled.
the training set, which account for 0.25%, 0.63%, 1.25%,
2.5%, 5%, 6.25%, 12.5%, 25%, and 50% of the training data,
respectively. VI. C ONCLUSION
TABLE VI presents the classification results after 10 rounds This paper implemented two semi-supervised deep gener-
of experiments. The performance of the M1 is better than that ative models based on VAE for bearing fault diagnosis with
of PCA, but it has almost the same performance as the vanilla limited labels. The results on the CWRU dataset show that the
autoencoder. This shows that the VAE’s discriminant feature M2 model can greatly outperform the baseline supervised and
space has no obvious advantage over the vanilla autoen- unsupervised learning algorithms, and this advantage can be up
coder’s encoded space. Nevertheless, the performance of the to 27% when only 2.3% of training data have labels. Addi-
M1 model is also superior to the supervised learning algorithm tionally, the VAE-based M2 model also compares favorably
CNN. By incorporating the vast majority of unlabeled data in against four state-of-the-art semi-supervised learning methods.
the training process, the improvement is approximately 5% to The CWRU dataset only contains vibration data from
15% when the number of labeled data segments varies from manually initiated bearing defects, which is inconsistent with
N = 40 to N = 1, 000, and the standard deviation is also the real-world scenario where these defects are evolved
much smaller. naturally over time. Therefore, we also used the IMS
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: SEMI-SUPERVISED BEARING FAULT DIAGNOSIS AND CLASSIFICATION 6485
dataset to verify the performance of the two VAE-based [20] D. P Kingma and M. Welling, “Auto-encoding variational bayes,” 2013,
semi-supervised deep generative models. The results demon- arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312.6114
[21] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, “Good
strate that incorrect labeling will reduce the classifier perfor- semi-supervised learning that requires a bad GAN,” in Proc. Adv. Neural
mance of mainstream supervised learning algorithms, while Inf. Process. Syst., 2017, pp. 6510–6520.
adopting semi-supervised deep generative models and keeping [22] G. San Martin, E. López Droguett, V. Meruane, and M. das Chagas
Moura, “Deep variational auto-encoders: A promising tool for dimen-
data with label uncertainties unlabeled can be an effective way sionality reduction and ball bearing elements fault diagnosis,” Struct.
to mitigate this issue. Health Monitor., vol. 18, no. 4, pp. 1092–1128, Jul. 2019.
R EFERENCES [23] A. L. Ellefsen, E. Bjørlykhaug, V. Æsøy, and H. Zhang, “An unsu-
pervised reconstruction-based fault detection algorithm for maritime
[1] M. Burgess. (May 2020). What Is the Internet of Things? Wired components,” IEEE Access, vol. 7, pp. 16101–16109, 2019.
Explains. [Online]. Available: https://www.wired.co.uk/article/internet- [24] M. Hemmer, A. Klausen, H. V. Khang, K. G. Robbersmyr, and
of-things-what-is-explained-io%t T. I. Waag, “Health indicator for low-speed axial bearings using varia-
[2] J. Krakauer. (Apr. 2020). Data in Action: Iot and the Smart Bearing. tional autoencoders,” IEEE Access, vol. 8, pp. 35842–35852, 2020.
[Online]. Available: https://blogs.oracle.com/bigdata/data-in-action-iot- [25] C. P. Burgess et al., “Understanding disentangling in β-VAE,” 2018,
and-the-smart-beari%ng arXiv:1804.03599. [Online]. Available: http://arxiv.org/abs/1804.03599
[3] J. Harmouche, C. Delpha, and D. Diallo, “Improved fault diagnosis of [26] H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin,
ball bearings based on the global spectrum of vibration signals,” IEEE “Cyclical annealing schedule: A simple approach to mitigat-
Trans. Energy Convers., vol. 30, no. 1, pp. 376–383, Mar. 2015. ing KL vanishing,” 2019, arXiv:1903.10145. [Online]. Available:
[4] F. Immovilli, A. Bellini, R. Rubini, and C. Tassoni, “Diagnosis of http://arxiv.org/abs/1903.10145
bearing faults in induction machines by vibration or current signals: [27] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-
A critical comparison,” IEEE Trans. Ind. Appl., vol. 46, no. 4, supervised learning with deep generative models,” in Proc. Adv. Neural
pp. 1350–1359, Jul. 2010. Inf. Process. Syst., 2014, pp. 3581–3589.
[5] M. Kang, J. Kim, and J.-M. Kim, “An FPGA-based multicore system for [28] J. Lee, H. Qiu, G. Yu, and J. Lin, “Bearing data set,” IMS, Univ. Cincin-
real-time bearing fault diagnosis using ultrasampling rate AE signals,” nati, Rexnord Tech. Services, NASA Ames Prognostics Data Repository,
IEEE Trans. Ind. Electron., vol. 62, no. 4, pp. 2319–2329, Apr. 2015. NASA Ames Res. Center, Moffett Field, CA, USA, 2007. [Online].
[6] A.-B. Ming, W. Zhang, Z.-Y. Qin, and F.-L. Chu, “Dual-impulse Available: http://ti.arc.nasa.gov/project/prognostic-data-repository
response model for the acoustic emission produced by a spall and the [29] (Nov. 2020). Case Western Reserve University (CWRU) Bear-
size evaluation in rolling element bearings,” IEEE Trans. Ind. Electron., ing Data Center. [Online]. Available: https://csegroups.case.edu/
vol. 62, no. 10, pp. 6606–6615, Oct. 2015. bearingdatacenter/pages/project-history
[7] M. Blodt, P. Granjon, B. Raison, and G. Rostaing, “Models for [30] O. Chapelle and A. Zien, “Semi-supervised classification by low density
bearing damage detection in induction motors using stator current separation,” in Proc. Int. Workshop Artif. Intell. Statist. (AISTATS),
monitoring,” IEEE Trans. Ind. Electron., vol. 55, no. 4, pp. 1813–1822, Barbados, Caribbean, Jan. 2005, pp. 57–64.
Apr. 2008. [31] Y.-F. Li and Z.-H. Zhou, “Towards making unlabeled data never hurt,”
[8] S. Zhang et al., “Model-based analysis and quantification of bearing IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 175–188,
faults in induction machines,” IEEE Trans. Ind. Appl., vol. 56, no. 3, Jan. 2015.
pp. 2158–2170, May 2020. [32] P. K. Mallapragada, R. Jin, A. K. Jain, and Y. Liu, “SemiBoost: Boosting
[9] R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, and R. X. Gao, “Deep for semi-supervised learning,” IEEE Trans. Pattern Anal. Mach. Intell.,
learning and its applications to machine health monitoring,” Mech. Syst. vol. 31, no. 11, pp. 2000–2014, Nov. 2009.
Signal Process., vol. 115, pp. 213–237, Jan. 2019. [33] L. Wen, X. Li, L. Gao, and Y. Zhang, “A new convolutional neural
[10] S. Zhang, S. Zhang, B. Wang, and T. G. Habetler, “Deep learning network-based data-driven fault diagnosis method,” IEEE Trans. Ind.
algorithms for bearing fault diagnostics—A comprehensive review,” Electron., vol. 65, no. 7, pp. 5990–5998, Jul. 2018.
IEEE Access, vol. 8, pp. 29857–29881, 2020. [34] R. M. Hasani, G. Wang, and R. Grosu, “An automated auto-
[11] S. R. Saufi, Z. Ahmad, M. S. Leong, and M. H. Lim, “Challenges and encoder correlation-based health-monitoring and prognostic method
opportunities of deep learning models for machinery fault detection and for machine bearings,” 2017, arXiv:1703.06272. [Online]. Available:
diagnosis: A review,” IEEE Access, vol. 7, pp. 122644–122662, 2019. http://arxiv.org/abs/1703.06272
[12] R. Razavi-Far, E. Hallaji, M. Farajzadeh-Zanjani, and M. Saif,
“A semi-supervised diagnostic framework based on the surface estima-
tion of faulty distributions,” IEEE Trans. Ind. Informat., vol. 15, no. 3,
Shen Zhang (Member, IEEE) received the B.S.
pp. 1277–1286, Mar. 2019.
(Hons.) degree in electrical engineering from the
[13] R. Razavi-Far et al., “Information fusion and semi-supervised deep
Harbin Institute of Technology, Harbin, China,
learning scheme for diagnosing gear faults in induction machine sys-
in 2014, and the M.S. and Ph.D. degrees in
tems,” IEEE Trans. Ind. Electron., vol. 66, no. 8, pp. 6331–6342,
electrical and computer engineering from the
Aug. 2019.
Georgia Institute of Technology, Atlanta, GA,
[14] P. Liang, C. Deng, J. Wu, Z. Yang, J. Zhu, and Z. Zhang, “Single and
USA, in 2017 and 2019, respectively.
simultaneous fault diagnosis of gearbox via a semi-supervised and high-
His research interests include design, control,
accuracy adversarial learning framework,” Knowl.-Based Syst., vol. 198,
condition monitoring, and fault diagnostics of
Jun. 2020, Art. no. 105895.
electric machines, control of power electronics,
[15] X. Chen, Z. Wang, Z. Zhang, L. Jia, and Y. Qin, “A semi-supervised
powertrain engineering for electric propulsion,
approach to bearing fault diagnosis under variable conditions towards
deep learning, and reinforcement learning applied to energy systems.
imbalanced unlabeled data,” Sensors, vol. 18, no. 7, p. 2097, Jun. 2018.
[16] M. Zhao, B. Li, J. Qi, and Y. Ding, “Semi-supervised classification for
rolling fault diagnosis via robust sparse and low-rank model,” in Proc.
IEEE 15th Int. Conf. Ind. Informat. (INDIN), Jul. 2017, pp. 1062–1067. Fei Ye (Member, IEEE) received the M.S. degree
[17] D. B. Verstraete, E. L. Droguett, V. Meruane, M. Modarres, and from Northeastern University, Boston, MA, USA,
A. Ferrada, “Deep semi-supervised generative adversarial fault diagnos- in 2014, and the Ph.D. degree from the Uni-
tics of rolling element bearings,” Structural Health Monitor., vol. 19, versity of California at Riverside, Riverside, CA,
no. 2, pp. 390–411, Dec. 2019. USA, in 2019, all in electrical and computer
[18] T. Pan, J. Chen, J. Xie, Y. Chang, and Z. Zhou, “Intelligent fault engineering.
identification for industrial automation system via multi-scale convo- She is currently a Postdoctoral Researcher with
lutional generative adversarial network with partially labeled samples,” the University of California at Berkeley, Berkeley,
ISA Trans., vol. 101, pp. 379–389, Jun. 2020. CA, USA. Her research interests include intelli-
[19] C. Liu and K. Gryllias, “A semi-supervised support vector data gent vehicles and transportation, deep learning,
description-based fault detection method for rolling element bearings deep reinforcement learning with domain adap-
based on cyclic spectral analysis,” Mech. Syst. Signal Process., vol. 140, tation and their applications in behavior prediction, decision making, and
Jun. 2020, Art. no. 106682. trajectory planning.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.
6486 IEEE SENSORS JOURNAL, VOL. 21, NO. 5, MARCH 1, 2021
Bingnan Wang (Senior Member, IEEE) Thomas G. Habetler (Fellow, IEEE) received
received the B.S. degree from Fudan University, the B.S.E.E. and M.S. degrees in electrical
Shanghai, China, in 2003, and the Ph.D. degree engineering from Marquette University, Milwau-
from Iowa State University, Ames, IA, USA, kee, WI, USA, in 1981 and 1984, respectively,
in 2009, all in physics. and the Ph.D. degree from the University of
Since then, he has been with the Mitsubishi Wisconsin–Madison in 1989.
Electric Research Laboratories (MERL), From 1983 to 1985, he was employed with
Cambridge, MA, USA, where he is a Senior the Electro-Motive Division of General Motors as
Principal Research Scientist. His research a Project Engineer. Since 1989, he has been
interests include electromagnetics and with the Georgia Institute of Technology, Atlanta,
photonics, and their applications to wireless where he is currently a Professor of Electrical and
communications, wireless power transfer, sensing, electric machines, Computer Engineering. His research interests include electric machine
and energy systems. protection and condition monitoring, switching converter technology, and
drives. He has published over 300 technical articles in the field. He is
also a Regular Consultant to industry in the field of condition-based
diagnostics for electrical systems.
Dr. Habetler was the inaugural recipient of the IEEE-PELS “Diagnostics
Achievement Award,” and a recipient of the EPE-PEMC “Outstanding
Achievement Award,” the 2012 IEEE Power Electronics Society Harry
A. Owen Distinguished Service Award, and the 2012 IEEE Industry
Application Society Gerald Kliman Innovator Award. He received one
transaction and four conference prize paper awards from the Industry
Applications Society. He has served on the IEEE Board of Directors as the
Division II Director and the Technical Activities Board, and the Member
and Geographic Activities Board, as a Director of IEEE-USA. He is also
a Past President of the Power Electronics Society. He has also served as
an Associate Editor for the IEEE TRANSACTIONS ON POWER ELECTRONICS.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on August 23,2023 at 22:46:08 UTC from IEEE Xplore. Restrictions apply.