1 s2.0 S0952197623006358 Main

Engineering Applications of Artificial Intelligence 123 (2023) 106451
Contents lists available at ScienceDirect
Engineering Applications of Artificial Intelligence

journal homepage: www.elsevier.com/locate/engappai
DCDA-Net: Dual-convolutional dual-attention network for obstructive sleep

apnea diagnosis from single-lead electrocardiograms
Nadeem Ullah, Tahir Mahmood, Seung Gu Kim, Se Hyun Nam, Haseeb Sultan,
Kang Ryoung Park ∗
Division of Electronics and Electrical Engineering, Dongguk University, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
ARTICLE INFO ABSTRACT

Keywords: Obstructive sleep apnea (OSA) is a breathing-related chronic disease in which the soft palate and tongue
Electrocardiogram collapse and block the upper airway for at least 10 s during sleep. It can lead to many heart diseases
Artificial intelligence such as hypertension, myocardial infarction, and coronary heart syndrome if not detected early. Artificial
Obstructive sleep apnea
intelligence has facilitated the diagnosis of many diseases in healthcare. Polysomnography is a widely used
Scalograms
but unpleasant, time-consuming, technically demanding, and financially expensive procedure to detect OSA.
Spectrograms
Dual-convolutional dual-attention network
Some previous methods have detected OSA using time-domain information from an electrocardiogram (ECG),
whereas others have used frequency-domain information. The limitations of these two approaches can be
handled using the data’s time–frequency representation. Nevertheless, there is room for enhancing the detection
accuracy of OSA using the time–frequency representation approach. Therefore, we propose a novel technique
that takes the ECG signal and detects R-peaks from the QRS complexes. Afterward, we interpolate those R-
peaks by linear interpolation and get an interpolated-R signal. Then we magnify the interpolated-R signal
corresponding to the apnea and normal frequency ranges. After magnification in the time domain, we
transformed the magnified version into a scalogram. We also transformed the original one-minute ECG signal
into a spectrogram after denoising. Overall, we used ECG signals to generate scalograms and spectrograms
for 2 dimensional convolutional neural network (2D CNN) to classify obstructive sleep apnea. For apnea
classification, we proposed a dual convolutional dual attention network (DCDA-Net) that includes a dual
convolutionally modified inception module, a spatial attention module, and a channel attention module.
Finally, we apply a support vector machine to the probability scores obtained from DCDA-Net based on
the scalogram and spectrogram. Extensive experimental results using the open PhysioNet apnea ECG dataset
confirm the effectiveness of our method in terms of accuracy and F1 score of 98% and 97.5%, respectively,
which outperforms state-of-the-art methods.
1. Introduction hypertension (Shao et al., 2022). It can also be fatal and greatly affects
a patient’s life. Therefore, early diagnosis is the key to avoid various
Obstructive sleep apnea (OSA) is a chronic disease caused by the negative effects. Low blood oxygen saturation decreases the formation
collapse of the upper airway, which interrupts the airflow completely of adenosine triphosphate (ATP), which decreases the amount of basic
or partially for at least 10 s during sleep (Tang and Liu, 2021; Yang energy available to millions of cells. The cells die after a few seconds to
et al., 2022). It is a sleep disorder that prevents patients from falling
a minute if the oxygen is continuously interrupted. In contrast, healthy
asleep and endangers their health and well-being by affecting oxygen
sleep is vital because it refreshes the body and brain and resets the
deprivation and arousal (Gutiérrez-Tobal et al., 2015). Various studies
emotions. Learning capabilities and physical development also depend
(Gutiérrez-Tobal et al., 2019) have revealed that almost 200 million
people suffer from sleep apnea (SA). It is more prevalent in males than on sleep (Sharma et al., 2022). Sleep apnea (SA) can be classified into
females, and some studies have found a rate of 4% in adult males and three categories based on its cause: (a) obstructive SA, (b) central SA,
2% in adult females (Mostafa et al., 2019). The dominant factor of SA and (c) mixed SA (Kushida et al., 2005). Obstructive SA appears during
increases with age; therefore, rationally, it is the greatest factor in the sleep when the airflow is blocked by the throat muscles, reducing
elderly (Ernst et al., 2019). If not diagnosed and treated early, it can airflow by more than 90%, and central SA occurs when the brain signals
cause many heart diseases, including heart attacks, arrhythmias, and that regulate breathing are interrupted (Van Steenkiste et al., 2019).
∗ Corresponding author.
E-mail address: [email protected] (K.R. Park).
https://doi.org/10.1016/j.engappai.2023.106451
Received 2 February 2023; Received in revised form 8 May 2023; Accepted 9 May 2023
Available online 22 May 2023
0952-1976/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
N. Ullah, T. Mahmood, S.G. Kim et al. Engineering Applications of Artificial Intelligence 123 (2023) 106451
Mixed sleep apnea occurs as central SA and becomes obstructive at localizes features well in the time domain and poorly in the frequency
the end of the apnea episode (Gutiérrez-Tobal et al., 2013). The most domain.
common of the three types of sleep apnea is OSA (Huang et al., 2012). Considering the above limitations, a new preprocessing called mag-
Polysomnography is an early technique that is considered the gold nification and a new deep learning technique called Dual-Convolutional
standard for the diagnosis of SA. However, this technique involves Dual-Attention Network (DCDA-Net) are proposed in this paper. Exper-
taping various sensors to human body parts to record brain waves iments were conducted on a publicly available dataset, and the major
(electroencephalograms (EEG)), eye movements (electrooculograms contributions of this study are highlighted below.
(EOG)), muscle movements (electromyograms (EMG)), breathing pat-
• - We have proposed a novel preprocessing method to detect the
terns (ThorRes/AbdoRes), and heart rhythm (electrocardiogram (ECG)).
‘R’ amplitudes of the ECG signal, interpolate them, and magnify
Although polysomnography (PSG) has high diagnostic accuracy (Ali
the interpolated signal by a magnification factor. The magnifi-
et al., 2019), the high cost, complexity, time, need for expert techni-
cation factor is calculated based on the frequency band of the
cians, synthesis/analysis of many signals, and inconvenience to patients
interpolated signal in which apnea and normal occur.
have prevented this technology from being widely used in public health
• We transformed the magnified signal into a scalogram and bi-
centers.
narized it to determine the region of the largest contribution.
Facilitating and simplifying OSA diagnosis was a major research fo-
Then we localized it in the frequency domain throughout the
cus several years ago. Therefore, many researchers have used machine
time domain and finally input it to the proposed DCDA-Net. We
learning (ML) techniques, such as the support vector machine used by
also transformed the ECG segment into the spectrogram without
(Surrel et al., 2018), to extract essential features using the R amplitudes
targeting the largest contributing region to extract compromised
and R-R intervals (RRI) of ECG signals to diagnose SA. Unfortunately,
features localized in the time and frequency domains to increase
the performance of ML is significantly limited due to the demands of
the number of features.
handcrafted feature engineering.
• We proposed a DCDA network that includes spatial attention
Thanks to advances in machine learning that made it possible to
followed by channel attention to allow the network to focus on
automatically extract salient features to reduce the likelihood of error
salient features in scalograms and spectrograms. In DCDA-Net,
that results from manual processing. Deep Learning (DL) is a branch of
we used the concept of dual-convolution, which splits the input
ML, which can learn key features end-to-end from input to reduce the
feature map channels into groups and convolves them one-by-
errors of the handcrafted ML method and produce promising results.
one by translating through all channels. To compensate for the
DL is impressive when it comes to automatically extracting important
channel communication problem, we use a 1 × 1 convolution in
features and aggregating them for classification purposes. In a previous
parallel with each 3 × 3 group convolution.
study (Fatimah et al., 2020), SA was efficiently diagnosed by proposing
• Our network is available publicly to researchers through (DCDA-
a fast Fourier decomposition method using a support vector machine
Net with algorithms, 2023).
(SVM). In previous research, the authors proposed an expert system for
detecting apnea based on the respiratory signal (Macey et al., 1998). The remainder of this paper is organized as follows. Section 2 presents
(Cheng et al., 2017) detected OSA using a recurrent neural network the previous methods for detecting OSA, and Section 3 details the
(RNN). (Bahrami and Forouzanfar, 2021) used numerous DL models, proposed methods. Section 4 presents the results, an evaluation of
such as long short-term memory (LSTM) and gated recurrent units, to the proposed method, a comparison with existing methods, and a
diagnose sleep apnea. In addition to sleep apnea, previous research discussion. The conclusion of the study is presented in Section 5.
has used SVM to detect ECG changes in patients with partial epilepsies
(Übeyli, 2008). 2. Literature review
To minimize errors due to handcrafted feature engineering in ML, to
reduce memory requirements and feature redundancy due to the end- The diagnosis of OSA is crucial at an early stage (specifically in
to-end method of DL, and to facilitate services in wearable devices, old age) to avoid various cardiovascular abnormalities. Therefore, re-
researchers have recently used convolutional neural networks (CNN). searchers have proposed many techniques to detect it, some based on
Sharan et al. (2020) used a 1-D CNN to diagnose SA disorders from 1D signal processing and others related to image processing (TFRs).
time-domain information. Niroshana et al. (2021) fused two time– The former methods extract features from 1-dimensional data (the time
frequency representations (scalogram and spectrogram) and used a domain), and the latter methods extract them from 2-dimensional data
two-dimensional CNN (2D CNN) to identify OSA. In addition to physio- (time–frequency representation). In the early days, ML was used to
logical signals, CNN is also used to detect various diseases successfully. detect OSA by extracting handcrafted features for classification using
For example, Arsalan et al. detected retinal vasculature using CNN to SVM, K-nearest neighbor (KNN), etc. DL automated the process to
analyze diabetic and hypertensive retinopathy (Arsalan et al., 2022). extract the features of the signals using a deep neural network (DNN)
Similarly, Haider et al. proposed SLS-Net and SLSR-Net to diagnose for classification. Signal and image processing and pattern recognition
glaucoma using retinal fundus images (Haider et al., 2022). In previous techniques were further enhanced after the introduction of CNN. This
research, the authors proposed a randomized CNN to detect human section explains the methods for extracting features from different
emotions using EEG signals (Cheng et al., 2022). domains, as described in detail in the following subsections.
Although the above techniques based on 1D and 2D CNN have
improved the accuracy of detecting OSA compared with ML and DL, 2.1. Time domain-based methods
they still have some limitations: (1) 1D CNN extracts features from
one-dimensional ECG signals (time domain), but the time domain alone The change in the concentration of sodium ions (Na+) and potas-
does not contain enough information to detect OSA well in many sium ions (K+) results in the generation of an electrical potential called
cases. In this case, the time course of the signal is known, but the the action potential, which contracts and expands the heart to pump
information about the frequency is unknown. (2) The problem with the blood. The rhythmicity of the heart is called electrocardiograms, which
time–frequency representation (TFR) (spectrogram and scalogram) is represent the time dynamics of the heart. The observation and process-
the target conflict between time and frequency localization. Increasing ing of these signals with respect to time are called time-domain-based
the width of the splitting window for the short-time Fourier transform methods. These methods use transformation techniques to process sig-
(STFT) and the continuous wavelet transform (CWT) results in the nals in the time domain and extract features to classify them. For
extraction of features that are better localized in the frequency domain instance, Liu et al. (2023) proposed a CNN-transformer network for
and worse localized in the time domain, whereas decreasing the width obstructive sleep apnea detection that used the ECG signal as input and
2
paid attention to some salient features while maintaining the temporal 2.3. Time domain and time–frequency domain-based methods
dependencies. Chang et al. (2020) proposed a preprocessing method
that filtered the ECG signals through a Butterworth fourth-order filter Extracting one-dimensional features from the time domain and two-
with a bandpass of 0.5 to 15 Hz, which reduced the baseline drift dimensional features from the time–frequency domain significantly im-
high-frequency noise. They used a 1D CNN from scratch to identify SA. pacts accuracy. This method provides time-domain and time–frequency
Chen et al. (2022) proposed a novel technique-multi-scaled lightweight domain features to maintain the position and frequency of apneic
network, which fuses the multi-scaled features by using a squeeze and episodes in the count. Therefore, Mendez et al. suggested two decompo-
excitation module to compensate for the missing information during sition approaches in the time domain: empirical mode decomposition,
preprocessing. Razi et al. extracted features (in the time domain) of which is the signal decomposition in the intrinsic mode function, and
the QRS wave of the ECG record using the Pan-Tompkins algorithm. wavelet analysis, which offers a time-scale assessment of the signal
They used linear discriminant analysis (LDA) and principal component (Mendez et al., 2010). They decomposed the ECG signals by scaling
analysis (PCA) to reduce the dimension to five features and evaluated and translating a mother wavelet to obtain every existing feature of
them using a random forest (RF), decision tree (DT), and support the ECG signal and diagnose the problem correctly. However, these
vector machine (SVM) (Razi et al., 2021). Pombo et al. extracted in- methods have the disadvantage of using 1D and 2D CNN, resulting in
formation about ECG-derived respiration and heart rate variability from large memory requirements, and considerable inference time does not
preprocessed ECG signals and used SVM, artificial neural network, LDA, allow it in real-time applications. Furthermore, considering the time
augmented Naïve Bayes classifier, and partial least square regression to domain alone and in the time–frequency domain results in duplicate
detect apnea (Pombo et al., 2020). Sharan et al. proposed a technique to features, which sometimes overfits the dominant class during the score
calculate the R-to-R interval from the QRS complex and used a 1D CNN or feature fusion if not dropped out before the classification layers.
to diagnose apnea segment-wise (Sharan et al., 2020). These methods This is because time–frequency domain methods cover both time and
have the advantage of well-localized features that consider the position frequency information, and only a careful trade-off is needed between
of the apnea episode. However, they lack adequate information, such time and frequency localization.
as frequency, because most of the physiological signals-based problems
are frequency oriented and vary in the designated time window, and 2.4. Time domain, frequency domain, and time–frequency domain-based
the apnea episodes vary in the 60-s window. Therefore, a 1D CNN methods
can extract useful information from 1-dimensional data, but its lack
of features results in comparatively low performance because the time Some researchers have extracted time and frequency information
information alone is unfruitful in many problem-based physiological individually and extracted the combined time–frequency information
signals. to increase the number of features and diagnose SA. For example, Shao
et al. manually extracted HRV and its frequency information from ECG
2.2. Frequency domain-based methods signals. They transformed the same segment into a time–frequency
domain (having both time and frequency information) to diagnose OSA
The transformation from the time domain to the frequency domain remotely using the Internet of Medical Things (IoMT) (Shao et al.,
localizes all the frequency components in the representation. To further 2022). This enhanced the accuracy of OSA detection but required much
enhance the frequency components, power spectral density (PSD) allo- preprocessing, and, therefore, was computationally expensive. How-
cates different densities to each of the frequency components, enabling ever, these methods require abundant preprocessing and comparatively
the network to extract and classify the features well. This method more inference time owing to many computations. The large memory
is helpful in many cases to enhance performance in the diagnostic requirement is another limitation owing to the training of LSTM and
era. For example, Hassan et al. proposed a wavelet transformation CNN individually.
technique, called the tunable-Q factor wavelet transform (WT), to
extract frequency bands from the time-domain ECG segment, and used 2.5. time–frequency domain-based methods
symmetric normal inverse Gaussian (NIG) to extract features that can
identify apnea and non-apnea (Hassan, 2016). Jafari extracted features A composite signal such as an ECG is not easy to interpret and
from the reconstructed phase space and various frequency domains express the data distribution when dealing with one-dimensional data
to classify the data effectively (Jafari, 2013). Sharma et al. used a (either time or frequency) because it does not have more explicit and lo-
preprocessing technique called variational mode decomposition to di- calized information to extract and classify easily. Therefore, TFR plays
agnose SA from ECG data (Sharma and Sharma, 2020). They extracted a vital role in enhancing data demonstration (which is not obvious
spectral entropies, energy, and interquartile range from morphological in the time domain or frequency domain individually) and helps the
variations in the QRS complex due to respiration induction in ECG. network extract key features. STFT and CWT are powerful tools for
They used PCA to reduce the dimension. Atri et al. developed an creating frequency spectra, such as spectrograms and scalograms. TFRs
algorithm to extract features from ECG-derived respiration (EDR), heart have both time and frequency information; therefore, it is easy for a
rate variability (HRV) signal spectra, and high-order spectra (Atri and CNN to extract position and frequency information. Niroshana et al.
Mohebbi, 2015). They used a least-squares SVM to learn these features transformed a one-minute ECG segment into a spectrogram and scalo-
and identify the OSA correctly; thus, they improved the accuracy gram and fused them. They proposed a lightweight network to extract
considerably. These methods are rich in frequency-localized features features and classify them (Niroshana et al., 2021). Thus, the overall
that help the model diagnose many problems based on the frequency accuracy was improved. Nasifoglu et al. proposed a CNN model that
bands of the physiological signals. However, they consider only the modified ResNet18 to identify obstructive SA (Nasifoglu and Erogul,
frequency domain results in mitigating the time localization in the 2021). They transformed the ECG segment into spectrograms and scalo-
signals, which can cause the extraction of frequency information from grams to intensify apneic events using power spectral density. Gupta
contaminated inherited spikes (whose frequency looks like that of the et al. proposed a simple and robust CNN model called OSACN-Net to
QRS complex) because of the lack of positional information of the QRS extract features from spectrograms (Gupta et al., 2022). They success-
complex. Frequency domain methods mainly focus on the frequency fully enhanced the accuracy of their method. Lin et al. proposed an
components included in the ECG signal and ignore the contextual algorithm for SA diagnosis using machine learning methods and a bag-
(spatial or time) information, which sometimes leads to false detection of-features derived from ECG spectrograms to improve accuracy (Lin
due to the contaminated spikes in the signals. et al., 2022). They used a continuous wavelet transform to create a
3
time–frequency representation. Mashrur et al. proposed a scalogram-

based CNN network to diagnose OSA (Mashrur et al., 2021). They
converted the ECG signal to scalograms using CWT to set the input
image for the CNN. Wang et al. suggested the Hamilton algorithm to
extract the R-R interval and R amplitude to detect SA (Wang et al.,
2019). They used cubic interpolation to generate another signal and
evaluated it using LeNet-5. These methods are simple and require
less preprocessing and more related information. A time–frequency
representation has the position and the frequency of events. Therefore,
a CNN can learn the localized and contextual features of the events.
However, they have the disadvantage of a trade-off between time and
frequency localization. The uncertainty principle of signal analysis is
crucial because it determines the resolution of the representation in
term of time or frequency. Usually, it depends on the nature of the
diagnosis problems and can be solved by time variations or frequency
distributions. Considering the disadvantages of previous studies, we
proposed a novel method based on the time–frequency domain using
DCDA-Net and score-level fusion using SVM.
Beyond the above-related studies, transformer-based architectures
are widely being used in different domains such as GeneViT (Gokhale
et al., 2023), a variant of vision transformer to classify cancer. Lv et al.
(2023) proposed a vision transformer-based model SCViT to preserve
spatial-channel information to classify image scenes for remote sensing.
Similarly, Meng et al. (2022) proposed adaptive vision transformers
(AdaViT) to recognize images efficiently.
Table 1 shows a comparison of our proposed methods with other
methods for OSA detection.
3. Proposed methods
Spectrograms and scalograms (time–frequency representation-TFR)

have a more explicit and localized representation than signals that
vary only for one independent variable (time or frequency). This is
because there are abundant features (time vs. frequency) in the TFR; Fig. 1. Complete flow of the proposed method.
thus, any researcher can analyze the information distribution in a TFR,
and a CNN can learn the distinguishable features and optimize the
weights easily. The TFR had different power densities for each spectral
To further explore Fig. 1, we have broadly split it into two cat-
component in a specific time interval. Therefore, TFR is effective in
egories: (a) spectrogram-based classification and (b) scalogram-based
diagnosing OSA from single-lead ECG records because of the dynamic
classification, and explained each step in their process in detail in their
nature of ECG. Thus, we used the same representations (spectrogram
subsections. In summary, before the spectrogram transformation, we
and scalogram) to create input images (TFRs) to evaluate our pro-
denoised the ECG segment in the preprocessing using the smooth data
posed CNN. The major contributions to the proposed preprocessing
function of MATLAB and then resized the spectrogram to 299 × 299 × 3
are magnification in the time domain and binarization of the region
and provided it to the input layer of our proposed CNN. In the scalo-
of interest (ROI). Similarly, the major contributions to CNN are the
gram preprocessing, we targeted specific frequency bands (apnea and
dual-convolution, spatial attention, and channel attention modules.
normal) of the R-interpolated signal, used a time-magnified signal to
make a scalogram, and then used a binarization technique to make the
3.1. Overall procedure of the proposed method
band more distinguishable before input to the CNN.
Fig. 1 shows the complete flow of our proposed methods. We split
the entire ECG record into one-minute segments and transform them 3.2. Spectrogram-based classification
into spectrograms (after removing the artifacts such as high-frequency
noise components) and scalograms (after acquiring the magnified R- We evaluated the proposed model using spectrograms and scalo-
signal) and obtain their individual scores using a lightweight CNN. grams. For the spectrograms, we do not derive the R-peaks of the ECG
The magnification used here was to obtain only a low-frequency R- segments to have more variation in data such that the model learns
interpolated signal in which an apnea episode occurred. Binarization useful features and does not get overfitted.
was used to eliminate small regions (which were not in the relevant fre-
quency range) to allow the network to learn the relevant contributing 3.2.1. Preprocessing and transformation to spectrogram
components in the scalograms easily. We trained the proposed shallow A raw ECG signal contains noise from device vibration, poor elec-
CNN on both spectrograms and scalograms individually and obtained trode connection, and body movements. Removing this noise is nec-
the training probabilities from its soft-max layer to train the SVM. Sim- essary to free the data distribution artifacts. Therefore, we used the
ilarly, we obtained the test data probabilities of both spectrograms and smoothing function in the MATLAB library (MATLAB R2021b, 2022),
corresponding scalograms and tested the trained SVM. The SVM helped which takes the moving average (using a filter with a size of four
to fit the test data well in our case because of the good distribution samples) on samples to eliminate high-frequency noise components.
of our individual training scores, spectrograms, and scalograms which We proceed with one-minute segments of ECG records as there are
clustered the problem well during SVM training and fit the test data annotations per minute in the dataset (60 s and the sampling rate is
accurately. 100 Hz, such that the total samples in one minute are 6000); however,
4
Table 1
Comparison of our proposed methods with other methods.
Domains Methods Advantages Disadvantages Accuracy Dataset
(%)
CNN transformer network (Liu et al., 2023) No preprocessing Low sensitivity metric, 88.2
Apnea-ECG
imbalance data
Time Whole ECG signal + 1D deep CNN (Chang Simple preprocessing Low sensitivity 87.9
et al., 2020)
R-peaks + Multi-scale lightweight network Multi-scale features Computational 90.6 Private
(Chen et al., 2022) complexity
Mean Normal-to-normal interval, RMSSD, Multiple time features Complex preprocessing 95.0
etc. + SVM, Decision Tree, Random Forest
(Razi et al., 2021)
HRV, EDR + SVM (Pombo et al., 2020) Used effective features Complex features 82.1
selection technique extraction Apnea-ECG
R-R intervals’ interpolation + Custom 1-D Not requires feature Costly preprocessing 88.2
CNN (Sharan et al., 2020) processing
Frequency bands (VLF, LF, HF) + SVM Diversified extracted Complex preprocessing 94.8
(Jafari, 2013) features
Frequency
Normal inverse Gaussian (bands) + The rational transfer TQWT is 87.3
Adaptive boosting (Hassan, 2016) function of TQWT computationally
makes it easier to be expensive because of
specified in the the rational transfer
frequency domain function
Spectral entropies + K-NN (Sharma and Efficient preprocessing Difficult selection of 87.5
Sharma, 2020) VMD parameters
HRV, EDR + Least-square SVM (Atri and The bi-spectral analysis Abundant preprocessing 95.5
Mohebbi, 2015) provides
non-Gaussianity for
features generality
Time and R-amplitudes, CWT + Discriminant classifier Effective decomposition Feature selection 89.0
time–frequency (Mendez et al., 2010) problem
Time, frequency, Linear/non-linear, TFR + Custom network Enables remote Computational cost 91.0
time–frequency (Shao et al., 2022) diagnosis of OSA
Spectrogram/scalogram fusion + Lightweight network The challenging 92.4
Lightweight network (Niroshana et al., trade-off between time
2021) and frequency
Time–frequency localizations and lots of
data augmentation is
used
Spectrogram and scalogram + Modified A robust model for Segmenting the ECG 85.2 HomePap
ResNet18 (Nasifoglu and Erogul, 2021) different dataset record as 30 s events and ABC
for Pre-OSA, OSA, and
Non-OSA are
challenging
Spectrogram + OSACN-Net (Gupta et al., Gabor Transform is The challenging 94.8 Apnea-ECG
2022) used to capture trade-off between time
relevant features of and frequency
OSA localizations and lots of
data augmentation is
used
Spectrogram + SVM, KNN (Lin et al., 2022) Different frequency Feature engineering is 91.4 Apnea-ECG
bands and time used, which is less and
segments are accurate than DL/CNN NCKUHSCAD
considered to best suit
the problem
Scalogram + Scalogram-based CNN Lightweight CNN Complex preprocessing 94.3
(Mashrur et al., 2021) Apnea-ECG
RMSSD, SDNN, VLF, LF/HF, etc. + Abundant of features Expensive preprocessing 87.6
Modified-LeNet-5 (Wang et al., 2019)
R-interpolated signal, magnification, Region - Features Requires preprocessing 98.0
of interest + DCDA-Net (Proposed method) diversification and assembly of two
- High accuracy of OSA CNNs and a SVM
classification (increasing resource
consumption).
RMSSD: Root mean square of successive differences, SDNN: Standard deviation of the normal-to-normal interval, VLF: Very low frequency, LF: Low frequency, HF: High frequency,
VMD: Variational mode decomposition, and TQWT: Tunable-Q factor wavelet transform.
for the sake of understanding a visible difference between the raw ECG denoised by moving averaging, as shown in Fig. 2(b). After removing
and denoised, we present segments of 7 s (700 samples) in Fig. 2 (a and the high-frequency noise, the amplitude of the denoised segment is
b). There is high-frequency noise in the raw segment (Fig. 2(a)) that is reduced slightly because moving averaging results in smoothing the
5
Fig. 2. Preprocessed ECG signal by denoising: (a) raw ECG signal, (b) denoised signal.
Fig. 3. Spectrogram representation: (a) Spectrogram of raw ECG segment, (b) Spectrogram of denoised ECG segment.
sharp variations. However, this is not a problem because it is con- Where P is the length of the window, 𝐸𝐶𝐺 [𝑘] is the ECG segment, 𝑊 𝐹
stantly applied throughout the denoising process such that the relative is the window function, and TECG[m, n] is the transformed ECG signal
variations remain the same. by STFT, which creates an RGB-colored spectrogram.
After denoising the segment, we transformed it into a spectrogram
using STFT (STFT, 2022). After trying different combinations of pa- 3.3. Scalogram-based classification
rameters, we selected the optimal STFT parameters (Blackman window
with 640 ms length and 600 ms overlap) to most accurately classify After the spectrograms, we evaluated our proposed CNN using
apnea and normal cases with our training data. The STFT has a very scalograms after significant preprocessing, which is discussed in the
short and fixed window to transform the signal (translating it), which following subsections. We considered the frequency range of the in-
can only be adjusted at the start (Eq. (1)). Fig. 3 shows that the power
terpolated R-signal based on apnea and normal occurrence ranges to
of the high-frequency noise components that exist in (a) are filtered
allow the CNN to extract the most relevant features. An explanation is
out in (b). Furthermore, the spectrogram of the denoised segment has
provided in the following subsections.
sharper intensity variations, helping a CNN to extract the features and
classify them accurately.
3.3.1. Preprocessing and acquisition of magnified R-signal
𝑃∑
−1
−j2𝜋𝑛𝑘∕𝑃 The layout for acquiring the magnified R-signal is shown in Fig. 4.
𝑇 𝐸𝐶𝐺[𝑚, 𝑛] = 𝐸𝐶𝐺 [𝑘] 𝑊 𝐹 [𝑘 − 𝑚] 𝑒 (1)
We manually detected the QRS complex’s R-peaks to generate the
𝑘=0
6
Fig. 4. Layout for acquiring the magnified R-signal (block in Fig. 1).
magnified R-signal. We interpolated them by linear interpolation, as the signal weakness, we considered the relatively major components in
shown in Fig. 5 (a1 and a2), and obtained another signal that varied those cases.
with respect to the R-peak amplitudes. We corrected the baseline to ( ) ( )
4𝜋𝑛 2𝜋𝑛
𝑊 𝐹 (𝑛) = 0.08 − 0.5 cos + 0.42 (2)
analyze it with zero as a reference, as shown in Fig. 5 (b1 and b2). 𝑃 −1 𝑃 −1
The R-interpolated signal is a low variational signal that shows how Where WF is the window function, and P is the window length.
the R-peaks of the ECG segment vary for normal and apneic episodes ∞ ( )
1 𝑡−𝜏
with respect to time. 𝑊𝑥 (𝑎, 𝜏) = √ 𝑥 (𝑡) 𝜓 ∗ 𝑑𝑡 (3)
We analyzed the ECG segments and found that during the apneic 𝑎 ∫−∞ 𝑎
period, a very low-frequency respiratory variations (due to blockage Where 𝑊𝑥 (𝑎, 𝜏) is the wavelet coefficient, 𝑎 is the scaling factor used
of the airway) affect R-peaks of the ECG segment and thus affect to scale the width of the conjugated mother wavelet 𝜓 ∗ while trans-
the overall rhythmicity of the R-interpolated signal by low-frequency forming the input ECG segment 𝑥 (𝑡), and 𝜏 is the time parameter
variations as shown in Fig. 5 (b1). In the case of normal, the respiratory that translates the wavelet window over the input signal. After getting
variations interfere with the ECG signals in a higher frequency band, the scalograms, we binarized them by component labeling method to
and no low-frequency variations affect the R-peaks, as depicted in Fig. 5 consider the major region only and used it as a mask for the original
(b2). Therefore, low-frequency versions of the R-interpolated signals scalogram. The complete process for only the apnea scalogram is shown
indicate that apnea has lower variations as depicted in Fig. 5 (e1 and in Fig. 7. To make the learning of the CNN model more accurate and
e2). According to Huang et al. apnea occurs in the frequency range efficient, we eliminated all the small energies of the scalogram.
of 0.01–0.03 Hz of the R-interpolated signal and the normal occurs in For that, we labeled all the regions in the scalogram, as shown in
the range of 0.15–0.3 Hz (Huang et al., 2012). Therefore, we analyzed Fig. 7(c), and took only the region of interest (ROI) (major region) of
the ECG segments by fast Fourier transform, as shown in Fig. 5 (c1 the scalogram and component-labeled image, as shown in Fig. 7(b)
& c2), and found the highest frequency amplitudes for apnea and and (d). We multiplied both ROIs to mask the scalogram ROI, as
normal, which falls in the above-mentioned frequency bands. Thus, we shown in (e) of Fig. 7. Finally, we resized the masked ROI to 299-by-
considered only the frequency band ranging from 0 to 0.35 Hz, which 299 in Fig. 7(f) to provide input to the CNN. As the apnea episodes
contains both the apnea and normal ranges as depicted in Fig. 5 (d1 occur independently (not on a fixed time) in one minute segment, we
& d2). Fig. 5 (d1 and d2) is constructed by taking the small range (0– considered the whole time axis for the scalogram, as shown in Fig. 7(f).
0.3 Hz) of Fig. 5 (c1 and c2). The apnea has the highest amplitude in Apnea and normal are differentiated by the frequency ranges in the
the 0.01–0.03 Hz frequency range, and the highest amplitude of normal scalograms.
occurs in the range of 0.15–0.3 Hz.
After targeting specific low-frequency components, we transformed 3.4. DCDA-Net
back that frequency range to the time domain using an inverse-Fourier
transform (IFFT) to see how only those frequency components con- After preparing the input image of size 299 × 299 × 3, we designed a
tributed to the R-interpolated signal, as shown in Fig. 5 (e1 and e2). CNN model named dual-convolutional dual-attention network (DCDA-
The signal after IFFT is the same signal as the R-interpolated signal but Net), which contains the inception block modified by dual convolution
contains its low frequencies only ranging from 0 to 0.35 Hz. That is followed by spatial attention and a squeeze and excitation block, as
the range in which the apnea and normal frequencies occur in the R- shown in Fig. 8. A convolutional neural network (CNN) is an ad-
interpolated signal. Subsequently, we scaled the R-interpolated signal vanced deep learning tool that extracts features by convolving the input
by the low-frequency time signal to magnify the information (in the R- feature map with convolutional kernels to learn weights in order to
interpolated signal) relevant to the above-mentioned frequency range, help predict the test data. The CNN structure and feature extraction
as encircled in Fig. 5 (f1 and f2). We call this the magnified R-signal. mode depend on the arrangement of different layers. The selection
and arrangement of layers depend on the input data distribution. Our
3.3.2. Transformation to scalogram and ROI detection model comprises three main blocks: (a) a modified inception block,
After obtaining the magnified R-signal, we transformed it into a (b) a spatial attention block, and (c) a squeeze and excitation block.
scalogram using the Morse wavelet function (Eq. (2)) for the CWT, as We designed our proposed model such that it extracts the correlated
shown in Eq. (3). The scalogram is a time–frequency representation features in groups by taking advantage of the grouped convolution.
(TFR) with time and frequency resolution depending on the width of The demerit of grouped convolution is poor communication among
the mother wavelet, which changes after each iteration of transfor- the output channels because each group takes features from a specific
mation by the CWT. The more the wavelet width shrinks, the finer group of input channels. Therefore, we used dual-convolution which
the time-localized feature extraction; the wider the wavelet width, the uses point-wise convolution to make the channels rich communicated.
more frequency-localized features are extracted. The scalograms for the Dual-convolution effectively extracts the relevant features in a group
magnified R-signals in Fig. 5 (e1) and (e2) are shown in Fig. 6 (a1) and and takes advantage of point-wise convolution to let the channels
(a2), respectively. We can see the major components in both scalograms communicate in order to compensate for the missing features from
are better localized in time and that the apnea range is lower. We other groups. Comprehensive details about dual convolution are given
only considered the largest contributing component in the component in Section 3.4.1. Afterward, the attention mechanisms are used to
analysis and ignored the remaining small components. Furthermore, in scale some key features and degrade some less important features such
some cases, there are only small components contributing because of that the model learns easily and quickly. Detailed explanations about
7
Fig. 5. Generation of magnified R-signal: (a1–f1) represent the apnea segment and (a2–f2) show the normal segment, (a1 & a2) Detection of R-peaks of QRS-complex and their
linear interpolation, (b1 & b2) Baseline adjustment, (c1 & c2) fast Fourier transformation (whole frequency axis), (d1 & d2) frequency range relevant to apnea and normal, (e1 &
e2) Signal by inverse fast Fourier transform. (f1 & f2) Magnified R-signal.
attention mechanisms are in Sections 3.4.2 and 3.4.3. The complete pooling (which takes the maximum value of the receptive area) in
details of our network are presented in the following paragraphs. the network takes distinctive (from the background) features to the
We used spectrograms and scalograms (both transformed from the output. In addition, we use max-pooling, where we need to extract the
same one-minute ECG segment) of size 299 × 299×3 to train our model sharp changes in pixel intensities, such as edges. We used max-pooling
and obtained scores at the soft-max layer to train the SVM. A separable to extract the above features, especially from the scalograms because
convolution is used to keep the computational cost low. Similarly, we have very distinctive features in the scalograms. The network still
stridden convolution is used to avoid feature redundancy and down- learns many redundant features during training; therefore, we used
sample the feature map quickly to allow the network to extract more a dropout layer to drop duplicate features to avoid overfitting the
detailed features with comparatively low computational cost. The max network. The details of the main parts of the network are as follows:
8
Fig. 5. (continued).
3.4.1. Modified inception block to extract multiple features with relevant features in groups. The dual-
The purpose of the modified inception block is to extract multiple Conv used in the modified inception block combines heterogeneous
and group convolution. In dual convolution, the 3 × 3 and 1 × 1
features from the input feature map and concatenate them at the
grouped convolutions convolve the input feature map simultaneously
output. Extracting multiple features is important because we have used
to group the features well and simultaneously eliminate poor chan-
STFT to make the spectrograms, and it uses fixed windows for transfor- nel communication problems (among the channels) of the grouped
mation. Thus, the features from the spectrogram do not contain various convolutions.
features. Therefore, we used an inception block and modified it by In Fig. 9, 𝑁 is the total number of filters, N/G is the number of filters
replacing the standard convolution with dual convolution (dual-Conv) per group, M denotes the number of channels of the input feature map,
9
Fig. 5. (continued).
and M/G is the number of input channels per group. First, we convolved 3 × 3 and 1 × 1 grouped convolutions moves to the next M/G input
all input channels (M) by a 1 × 1 convolutional kernel to enrich the channels to be convolved by the N/G number of filters.
communication among channels. Then we used group convolution of Similarly, the group convolutions go to the last group of the in-
size 3 × 3 with N/G number of filters per group, which convolved put channels, as shown in Fig. 9. The dual convolution reduces the
the M/G number of input channels per group. Simultaneously, the computations of the backbone networks using the group convolution
same group (M/G) of input channels is convolved by a pointwise technique because it convolves the input channels in a depth-wise
grouped convolution to fuse them for better channels’ communication. separable fashion which does not use addition operation, rather a point-
Pointwise (1 × 1) grouped convolution is used to embed the overall wise convolution is used to fuse the feature. It promotes information
context of the M/G input channels into the 3 × 3 grouped convolution sharing among convolutional layers for maximum cross-channel com-
to improve the channels’ communication further. The combination of munication by pointwise convolutions while preserving the original
10
Fig. 6. Examples of scalograms: (a) Apnea scalogram, (b) normal scalogram.
Fig. 7. Binarization for ROI: (a) apnea scalogram, (b) scalogram ROI, (c) component labeling, (d) ROI of component labeled image, (e) masked ROI, (f) resizing.
information of the input feature maps. The effectiveness of the dual- output feature map and makes its features’ distribution distinguishable.
convolution is validated in Table 6 under the ablation study. Further A sigmoid function maps the entire domain (from - ∞ to +∞) to a range
details regarding dual convolution can be found in Zhong et al. (2022). of probabilities from 0 to 1.
The spatial dimension of the input feature map remains unchanged,
3.4.2. Spatial attention module and only the number of channels is squeezed into one during spatial
Spatial attention (Zhu et al., 2019) is a concept that dominates the attention. The squeezed feature map (H × W × 1) has abstract features
key features in the input feature map by merging the overall context of all the channels, which can be used to enhance the key features
of the total channels into one channel (which has some meaning in of each spatial position of all the input channels by multiplying them
spatial dimension). Each pixel of that one-channel spatial dimension is with the corresponding positions. The magnitudes of pixel intensities
multiplied by the input feature map to provide a scaling factor to each of the same position of all channels are enhanced when the intensity of
corresponding position and thereby modulate the features throughout the relevant position of the squeezed channel (H × W × 1) multiplies.
the input feature map. The output of the prior block was used as an Suppose that we have a position 𝑥111 (where the subscript indicates the
input feature map to the spatial attention module in Fig. 10, depicted pixel position and the superscript represents the channel number) in
as a CNN feature map with height (H), width (W), and the number the squeezed feature map, which is multiplied by the corresponding
of channels (C). It is convolved by a convolutional layer with a C/2 positions in the input feature map of all channels, which can be repre-
number of filters; thus, the next of the output feature map becomes H sented as 𝑥𝑐11 . The spatial attention module dominates the same feature
× W × C/2 while keeping the spatial dimension the same, which has throughout the channels, making it easier for the network to learn
the overall context of C channels. Another convolutional layer with one the dominant features efficiently and quickly. Spatial attention focuses
filter is used to merge all channels into one and multiply it after passing on sharp intensity variations in the image. For example, if the image
it from the sigmoid function. The sigmoid function is used here because dataset has very deep features, but the features are not distinguishable
it allows some negative values (from the domain) to map (in range) of well, researchers use spatial attention to provide extra weights to the
the sigmoid function and dominate the positive values. It enhances the key features, making them distinguishable spatially. In our problem,
11
Fig. 8. Proposed network diagram (DCDA-Net): (Conv: convolution, ReLU: Rectified linear unit, BN: Batch normalization, FC: fully connected, GAP: global average pooling,
Concatenation, multiplication).
Fig. 9. Construction of dual convolution: M is the number of channels of the input feature map, G is the number of groups in grouped convolution, and 𝑁 is the number of
convolutional filters.
spectrograms have deep features but do not have distinguishable spatial each channel to a scalar value and then multiplies it to the entire spa-
features; therefore, we used spatial attention to provide additional tial dimension of the corresponding channels. Global average pooling
attention to the desired features. The spatial attention module is used (GAP) is used to squeeze the spatial dimension of each channel to 1,
in the network before the squeeze and excitation (SE) module because which represents the abstract context of the entire spatial dimension.
it needs some more spatial size (as compared to the SE module) to Furthermore, a fully connected layer (FC) with a reduced number of
learn useful features and make the network self-attentive to useful channels (reduced by the r-factor) was used to reduce the computa-
features. Thus, it effectively extracted key features and helped the tional cost. Another fully connected layer (FC2) was considered to make
overall network to classify the problem well, the effectiveness of this the number of channels the same as that of the CNN feature map. The
spatial dimension of each channel becomes a distinct scalar value, as
module is obvious in Table 6 under the ablation study.
shown by the sigmoid function in Fig. 11. Each scalar value of the
squeezed feature map is multiplied by the entire spatial dimension of
3.4.3. Squeeze and excitation module the corresponding channel of the CNN feature map, further enhancing
The squeeze-and-excitation module (Hu et al., 2018) is used as the the corresponding channel by scaling the dominating features. The
channel attention mechanism. This further enhances the key features squeeze and excitation module is used when the image dataset has
by channel-wise scaling. It squeezes the entire spatial dimension of deep and distinct features, but the original context of the features
12
Fig. 10. Spatial attention module.
Fig. 11. Squeeze and excitation module.
vanishes after some convolutional operations. Thus, researchers use the takes two training scores to optimize the weights (𝑤1 and 𝑤2 in Eqs. (4)
squeeze and excitation module to regain and enhance those features and (5)), and use those weights for test samples to predict it finally.
depth-wise. We used this module for the scalograms to regain and In the weighted product, the optimized weights are in the exponent
enhance the original context. In scalograms, we have a distinct region; (to make the boundary line nonlinear) followed by the product as
however, after some convolutional transformations, the intensities of shown in Eqs. (4) and (5) (Mateo, 2012). Additionally, we also apply
some channels lose distinction. To compensate, we used the SE module SVM (Vapnik, 1998) to the probability score. SVM takes these scores
in more depth (more number of channels) to get more number of (two true values) of both the spectrograms and scalograms as two
channels with distinct and effective features that help the model to inputs and uses a kernel to cluster the input data efficiently to well fit
learn and classify the problem intelligently without going very deeper. the test scores. Subsequently, score-level fusion was performed on the
Table 6 under the ablation study validates the existence of the SE two scores using SVM (Eq. (6)) [45]. Eqs. (7)–(10) show various SVM
module in our proposed DCDA-Net model. kernels compared with the experiment conducted using the training
Overall, while considering our classification problem, we proposed data, and the radial basis function (RBF) kernel was selected as the
a novel preprocessing method to enhance the differentiation between
optimal one.
the classes and used the proposed model DCDA-Net to classify the
data efficiently by grouping communication-rich relevant features using Weighted Sum = 𝑤1 𝑠𝑖 + 𝑤2 𝑠𝑗 (4)
dual-convolution technique and highlighting the features further using 𝑤 𝑤
Weighted Product = 𝑠𝑖 1 𝑠𝑗 2 (5)
attention mechanisms to let the model learns the key features quickly
to detect the apnea efficiently. ∑
𝑘
( )
𝑓 (𝑠) = 𝑠𝑖𝑔𝑛( 𝑝𝑖 𝑞𝑖 𝐾 𝑠𝑖 , 𝑠𝑗 + 𝑟) (6)
𝑖=1
3.5. Score-level fusion using weighted sum, weighted product, and SVM
( )
The scores (probabilistic values) for spectrograms and scalograms Linear kernel ∶ 𝐾 𝑠𝑖 , 𝑠𝑗 = 𝑠𝑇𝑖 𝑠𝑗 (7)
( ) 2
were obtained from the soft-max layer of the two DCDA-Nets, as shown RBF kernel ∶ 𝐾 𝑠𝑖 , 𝑠𝑗 = 𝑒𝑥𝑝−𝛾‖𝑠𝑖 −𝑠𝑗 ‖ (8)
in Fig. 1. For each spectrogram (or scalogram), DCDA-Net provided two ( )
prediction values, true and false. For example, if an apnea spectrogram Polynomial kernel ∶ 𝐾 𝑠𝑖 , 𝑠𝑗 = (𝛾(𝑠𝑇𝑖 𝑠𝑗 ) + 𝑐𝑜𝑒𝑓 )𝑑𝑒𝑔𝑟𝑒𝑒 (9)
( ) ( )
predicts 0.95 as true (apnea), the false (normal) value would be 0.05. Sigmoid kernel ∶ 𝐾 𝑠𝑖 , 𝑠𝑗 = 𝑡𝑎𝑛ℎ(𝛾 𝑠𝑇𝑖 𝑠𝑗 + 𝑐𝑜𝑒𝑓 ) (10)
We only considered the true values e.g., true positives and true nega-
tives to the input of all the score fusion methods, which generate the In the above equations, 𝑠𝑖 denotes the score (probability value)
same results if we consider the counterpart of each true value. Thus, for DCDA-Net in the spectrogram-based classification and 𝑠𝑗 denotes
we omit the clustering of false cases to keep the computations low. the score for DCDA-Net in the scalogram-based classification shown in
The weighted sum is a post-processing method that fuses the scores of Fig. 1. 𝑤1 and 𝑤2 are the weights to be optimized on training data and
ensemble models to make a final decision about the test data sample. It used for test samples. 𝑝𝑖 , 𝑞𝑖 , r, and 𝛾 represent the hyperparameters. The
13
Fig. 12. Training and validation accuracy/loss vs. epochs graphs: (a) Training accuracy/loss vs. epochs graphs. (b) Validation accuracy/loss vs. epochs graphs.
Table 2
Summarized dataset description for each fold of 10-fold cross-validation.
Apnea-ECG dataset Total samples Training samples Validation samples Test samples
Normal 20,000 14,400 3,600 2,000
Apnea 13,060 9,403 2,351 1,306
optimal SVM kernel and hyperparameters were obtained from training records from ×01 to ×35. The summary of the dataset description is
data to achieve the highest classification accuracy. shown in Table 2. Furthermore, we considered subject-independent
validation to make our model robust and invariant to inter-subject
4. Results and discussion variance.
4.1. Experimental setup 4.3. Data augmentation
The architectural implementation, training, and testing of the After transforming all ECG records from the time domain to time–
proposed DCDA-Net were executed using MATLAB 2021b (MATLAB frequency representations (spectrograms and scalograms), we obtained
R2021b, 2022) using a desktop computer with an Intel® Core™ i7- a total of 20,000 representations of normal and 13,060 representations
3770K CPU and NVIDIA GTX 1070 (GeForce GTX 1070, 2022), and of apnea. We used 10-fold cross-validation to evaluate our model by
RAM of 16 GB. We developed our model from scratch, trained it with 30 splitting the total data into ten folders. Each folder of 10 contains 2000
epochs and 16 mini-batch sizes, and used the Adam optimizer (Kingma normal and 1306 apnea images. Each folder was tested using our pro-
and Ba, 2017) to optimize the weights with an initial learning rate posed model after being trained using the remaining nine folders. The
of 0.001. We shuffled the data for each epoch in our experiments. remaining images (18,000 normal and 11,754 apnea) are distributed
As shown in Fig. 12(a), the training accuracy and loss converged, as; 80% training set contained 14,400 normal and 9403 apnea images
confirming that our model was sufficiently trained with the training and 20% validation set contained 3600 normal and 2351 apnea. To
data. Furthermore, we validate the proposed model during training by address the data imbalance problem, we oversampled random images
randomly taking 20% of the training data as validation data to allow in the apnea class in the training folder to make the apnea class
the model to train well and predict the test data efficiently during 18,000 and prevent the model from overfitting. More training data are
testing. As shown in Fig. 12(b), the validation accuracy and loss also required to allow the model to fit the data well; thus, we used online
converged, confirming that our model did not overfit the training data. data augmentation of rotation (−8 to +8 pixels), random shearing (−5
to +5 degrees), random horizontal translation (−30 to +30 pixels),
4.2. Dataset and random horizontal flipping. We avoided vertical translation and
flipping in the scalograms because the vertical axis of the scalogram
We used the well-known PhysioNet apnea ECG dataset (Penzel et al., is the frequency axis. This process was repeated to test each fold once
2000). The dataset contains 70 recordings of ECG (single lead) signals while keeping the remaining nine folders as training.
from healthy subjects, and each record is split into one-minute seg-
ments to annotate as apnea or normal by the dataset provider (Penzel 4.4. Metrics used
et al., 2000). Therefore, we used these ECG instances for training and
testing in deep learning model. There were two sets of the recording The model proposed here predicted a scalogram or a spectrogram
released set and a withheld set, each comprising 35 subjects. The as apnea or a normal (non-apnea) based on some differences in the
duration of each signal lies in the range of 420–600 min. This dataset data distribution of apnea and normal samples, then compared it to the
contains recordings from men and women aged 27–63 years, weighing expert annotations and created a matrix called the confusion matrix,
53–135 kg. The sampling rate and resolution of the signals during which contains true positive (TP), false negative (FN), false positive
the recording were set to 100 Hz and 16-bit of the signals during (FP), and true negative (TN) samples. True positive is the case when
recording, respectively. Men and women contributed to the dataset in a a model predicts an expert annotated positive sample as positive, and
25:7 ratio containing both OSA and healthy subjects (Niroshana et al., false negative is when a model predicts an expert annotated positive
2021). Records were obtained twice from some subjects to complete the sample as negative. False positive is when a model predicts the negative
released set of 35 records. The released set included records from a01 to annotated sample as positive. True negative is when a model predicts
a20, b01 to b05, and c01 to c10. Similarly, the withheld set contained a negative annotated sample as negative.
14
The evaluation metrics used here were accuracy, precision, recall, Table 3
Results by the proposed model with or without spectrogram-based and scalogram-based
specificity, and the F1 score. Accuracy is the rate of true cases among
classifications of Fig. 1 (unit: %).
the entire dataset (accuracy can be expressed by Eq. (11)), which shows
TFRs Accuracy Precision Recall Specificity F1 score
how many cases are predicted as true by the model. Precision is an
Spectrograms (STFT) 93.1 93.8 88.5 96.0 91.0
evaluation measure expressing the model’s ability to detect positive
Scalograms (CWT) 93.8 99.3 84.5 99.6 91.3
samples among the true positives and false positives (FP) (negatives Stockwell transform 81.0 86.5 59.5 94.2 70.5
that look positive) (expressed by Eq. (12)). The recall is also called ((Stockwell et al.,
the model’s sensitivity or true positive rate, which shows how a model 1996))
is sensitive to the positive samples (diseased samples) in the positive
domain. A lower value means that the model predicts many positive Table 4
cases as negative. Thus, a subject with the disease would probably be Comparative score-level fusion by weighted sum, weighted product, and SVM methods
excluded from further diagnoses, which results in many complications using the scores of spectrograms (𝑤1 ) and scalogram (𝑤2 ) (weighted sum weights:
later on (recall can be expressed by Eq. (13)). It has a low value, usually 𝑤1 =0.23, 𝑤2 =0.77; weighted product weights: 𝑤1 =0.23, 𝑤2 =0.77).
because the positive class is smaller than the negative in most binary Scores fusion methods Accuracy Precision Recall Specificity F1 score
classification problems. Although, some data imbalance techniques as- Weighted product 92.7 95.4 85.5 97.1 89.9
sist the minor class, but still the major class contains more samples with Weighted sum 94.6 96.1 90.0 97.5 92.8
SVM 98.0 97.4 97.7 98.2 97.5
various features and thus, better predicts it. Nevertheless, the recall
must be sufficiently high to avoid any carelessness about a patient.
Specificity also called the true negative rate, shows how accurately Table 5
a model predicts the negative class (expressed by Eq. (14)). In the case Results by the proposed model with or without the preprocessing of scalogram-based
classification of Fig. 1 (unit: %).
of a data imbalance problem, a model usually predicts the negative
Preprocessing Accuracy Precision Recall Specificity F1 score
class better than the positive class owing to overfitting. The F1 score
is the harmonic mean of precision and recall (expressed by Eq. (15)). Without magnified R-signal 92.2 99.9 79.8 99.9 88.7
With magnified R-signal 93.8 99.3 84.5 99.6 91.3
There is a need to calculate the F1 score; because the best accuracy
does not mean that the network is efficient while testing the model,
However, their accuracies were low and there was room for enhancing
most of the datasets have much bigger benign classes in the test folder
them. Therefore, we performed the fusion of two scores from CNNs us-
than malignant classes. Usually, the model predicts the benign class
well because of its greater variety of data samples (even if there is ing scalograms and spectrograms by weighted sum, weighted product,
augmentation or oversampling in the malignant class, but there is a and SVM as shown in Table 4. By comparing Tables 3 and 4, the score-
lack of variety compared to the benign class). Accuracy gives more pro- level fusion by SVM shows the higher accuracy than other score-level
portional weight to the benign class. It increases even if the malignant fusions and the end-to-end training of CNN model using scalograms and
class is low, which makes the model inefficient and can only predict the spectrograms. In previous researches, this kind of score-level fusion by
negative class well and poorly predicts the positive class. F1 overcomes SVM has been also adopted, which enhanced the accuracy compared
this issue to a great extent, as shown in Eq. (15), which considers the to that without score-level fusion by SVM. For example, Vetrekar
positive predictive value and true positive rate and ignores the true et al. (2023) obtained scores from eight bands of spectral band images
negative rate. and combined them using SVM to detect artificially ripened banana.
TP + TN Similarly, Kim et al. (2022) took probability scores from SoftMax layer
Accuracy = (11) of the DenseNet-161 and DenseNet-169 models and combined them
TP + TN + FN + FP
using SVM to improve the detection performance of spoof detection.
TP Table 5 is based on our contribution to the paper regarding prepro-
Precision = (12)
TP + FP cessing (acquiring a magnified R-signal in this case). The contribution is
TP
Recall = (13) to obtain the magnified R signal, whose frequency spectrum shows that
TP + FN
TN the apnea lies more in the lower frequencies than that of normal and
Specificity = (14) contains most of the energy in the corresponding bands. An obvious
TN + FP
2TP overview of the difference between the normal and magnified scalo-
F1 score = (15)
2TP + FN + FP grams is listed in Table 5. The reason might be the distinct frequency
bands for apnea and normal, which the normal scalograms do not
4.5. Ablation study of the proposed model
have explicitly. Subsequently, we transformed it to a scalogram only
and did not consider the above preprocessing (magnified R-signal) for
A summarized evaluation of the DCDA-Net, using the scalograms
spectrograms because the magnified R-signal has very few specified
and spectrograms individually, is listed in Table 3, and we consid-
features. Therefore, a CWT uses the variable width of the translating
ered the normal spectrograms and magnified scalograms. To obtain
window (after each iteration). It can still extract many features. How-
the scalogram, we used CWT using Morse wavelet function. We also
ever, STFT uses a fixed window to transform the ECG segment; thus,
evaluated our proposed model using a different time–frequency based
a spectrogram cannot extract various features from the magnified R-
method such as Stockwell transform (Stockwell et al., 1996) with
Gaussian window for ablation study. As shown in Table 3, the accuracy signal. It is evident from Table 5 that the accuracy of the model is
by Stockwell transform was lower than that by our spectrograms or improved by the proposed preprocessing.
scalograms. Moreover, we fused the scores of the scalograms and The main contributions of the network design are the implemen-
spectrograms by applying weighted sum, weighted product, and SVM. tation of dual-Conv (instead of the standard Conv in the inception
The SVM-based method has the best results among them as summarized module), spatial attention, and squeeze and excitation modules. There-
in Table 4. Score-level fusion using an SVM resulted in a substantial fore, we evaluated the various combinations in Table 6 to determine
improvement in model accuracy. An SVM learns the training scores (of how our model behaves. We removed one or two module(s) at a
both the scalograms and spectrograms) and generalizes the problem time (details are given in Table 6) from the DCDA-Net and evaluated
efficiently to improve the accuracy and F1 score further. them using the same protocols mentioned above. Thus, we confirmed
Originally, we performed the end-to-end training of CNN model that each module contributed to the overall accuracy of the proposed
using scalograms and spectrograms individually as shown in Table 3. network.
15
Table 6 there is no object, instead, there is pseudo-colored time and fre-

Results by the proposed model with or without the modules of DCDA-Net of Fig. 8
quency localized information. Therefore, our proposed method,
(unit: %).
Magnification/ROI +DCDA-Net, outperformed all SOTA methods
DCDA-Net modules Accuracy Precision Recall Specificity F1 score
including MaxVit-T.
Dual- SA SE
Conv module module
× O O 97.6 96.8 97.3 97.8 97.0 4.7. Visual results of the proposed method
O × × 97.7 97.0 97.2 98.0 97.1
O O × 97.8 97.7 96.5 98.5 97.1 Visual observations for apnea and normal in both the time and
O × O 97.5 97.6 96.3 98.3 96.6
O O O 98.0 97.4 97.7 98.2 97.5
time–frequency domains are shown in Figs. 13 and 14, respectively.
An obvious difference in the time domain is the formation of a com-
paratively low-frequency interpolated R-peak signal, which was even
more obvious when we targeted a specific frequency range for apnea
4.6. Comparison with state-of-the-art methods and normal, as discussed in d1 and d2 in Fig. 5. During an apnea, when
the airway pauses for at least 10 s and then opens, one cycle completes
Comparisons of our proposed method with state-of-the-art (SOTA) in a large time, and a very low-frequency respiratory signal interferes
methods are mainly based on accuracy, sensitivity, and specificity.
with the ECG signals in such a way that the distances among some R-
Table 7 summarizes that the SOTA methods based on 1D CNN (relative
peaks are different from some other R-peaks, while the distances among
to time–frequency representation-TFR) have low results, such as the
the same group of R-peaks might be the same, which directly affects the
following methods: SRE +SVM (Viswabhargav et al., 2019), EDR/HBI
frequency of the R-peak signal, such as bringing it to a lower frequency,
+SVM (Singh et al., 2020), QRS +LS-SVM (Sharma and Sharma, 2016),
etc., having accuracies of 78.1%, 82.4%, 83.8%, etc. Some 1D methods as depicted in column (a) of Fig. 13. In contrast, the ECG of normal
even outperformed some of the SOTA methods based on TFR, such is also affected by respirational signals, but with normal rhythms of
as R-R interval +1D CNN (Sharan et al., 2020), R-peaks/R-R interval respiration; hence, the frequency is relatively higher than the apnea
+SE-MSCNN (Chen et al., 2022), and Frequency bands +SVM (Jafari, case, as shown in column (a) of Fig. 14. Feature extraction from the
2013), with accuracies of 88.2%, 90.6%, and 94.8%, respectively. Most time domain alone is usually not intuitive, and it is difficult for models
of the SOTA methods that detect OSA very well are based on either TFR to rely on one-dimensional information. Therefore, we used scalograms
or mixed time +TFR methods, such as CWT/STFT +CNN (Niroshana and spectrograms (TFRs) to train our network on detailed features, to
et al., 2021), SGS +OSACN-Net (Gupta et al., 2022), Scalogram +SCNN improve the detection accuracy of the problem.
(Mashrur et al., 2021), with accuracies of 92.4%, 94.8%, and 94.3%, To make the distinction between apnea and normal more obvious,
respectively. We also evaluated our preprocessed data using Multi- our proposed method of acquiring the R-magnified signal and trans-
Axis Vision Transformer-Tiny (MaxViT-T) (Tu et al., 2022) with an forming it into a scalogram is presented here. Column (b) of Figs. 13
accuracy of 78.8%. Some possible reasons for degraded performance and 14 demonstrates that the energy in the apnea cases is in the lower
using MaxViT-T could be as follows: frequency range as compared to normal, which makes the two classes
– The MaxViT-T has 31M parameters and the Physionet apnea- distinct to allow the network to classify the problem easily (note that
ECG dataset has around 30K training instances only. Therefore, the vertical axis of both the scalogram and spectrogram is the frequency
the MaxViT-T deeply over-learns such a huge amount of param- in the log scale because such low frequencies could not be analyzed on
eters using comparatively fewer data with low variance. Thus, it a normal scale). The frequency range for apnea is from 0.01 to 0.03 Hz,
possibly traps by overfitting. and the normal range is from 0.15 to 0.3 Hz. The spectrograms in
– The parameters of MaxVit-T have been trained on the ImageNet column (c) of Figs. 13 and 14 for apnea and normal, respectively, also
database. ImageNet database’s samples have some original in- contain most of the energy in the lower frequency range for apnea and
tensity objects to be classified or segmented, while in our case a relatively high range for normal.
Table 7
Performance comparisons of the proposed method with state-of-the-art methods per segment (unit: %).
Methods Accuracy Sensitivity Specificity F1 score
FAEMD + SVM (Tripathy et al., 2020) 79.0 78.7 79.4 79.5
CWT/STFT + CNN (Niroshana et al., 2021) 92.4 92.3 92.6 90.6
SGS + OSACN-Net (Gupta et al., 2022) 94.8 94.5 94.9 94.7
SRE + SVM (Viswabhargav et al., 2019) 78.1 78.0 78.1 –
EDR/HBI + SVM (Singh et al., 2020) 82.4 79.7 – –
QRS + LS-SVM (Sharma and Sharma, 2016) 83.8 79.5 88.4 –
Z-score Normalization + 1D CNN (Chang et al., 2020) 87.9 81.1 92.0 –
HRV/EDR + ANN (Pombo et al., 2020) 82.1 88.4 72.2 –
NIG bands + Adaptive boosting ((Hassan, 2016)) 87.3 81.9 90.7 –
Spectral entropy/IQR/Energy/SD of EDR & R-R interval + 87.5 84.9 88.2
K-NN (Sharma and Sharma, 2020)
HRV/TFR + Bi-LSTM/SqueezeNet (Shao et al., 2022) 91.5 91.0 91.9
CWT/STFT + Modified ResNet18 (Nasifoglu and Erogul, 85.2 86.2 85.0
2021)
CWT + SVM/KNN(Lin et al., 2022) 91.4 89.8 92.4
HMM + SVM (Song et al., 2016) 86.2 82.6 88.4
EDR/HMM + SAE/DNN (Li et al., 2018) 84.7 88.9 82.1 –
CWT + DNN (Singh and Majumder, 2019) 86.2 90.0 83.8 –
R-R interval + 1D CNN (Sharan et al., 2020) 88.2 82.7 91.6
RR/QRS + LS-SVM (Varon et al., 2015) 84.7 84.7 84.7 –
R-peaks/R-R interval + SE-MSCNN (Chen et al., 2022) 90.6 86.0 93.5 –
Frequency bands + SVM (Jafari, 2013) 94.8 94.1 95.4
Scalogram + SCNN (Mashrur et al., 2021) 94.3 94.3 94.5 –
MaxViT-T (Tu et al., 2022) 78.8 67.0 86.1 70.7
Proposed method 98.0 97.7 98.2 97.5
16
Fig. 13. Visual results for true positive (TP) cases: (a) ECG segment labeled as apnea, (b) corresponding scalograms predicted as apnea by the proposed model, and (c) corresponding
spectrograms predicted as apnea by the proposed model.
Fig. 14. Visual results for true negative (TN) cases: (a) ECG segments labeled as normal, (b) corresponding scalograms predicted as normal by the proposed model, and (c)
corresponding spectrograms predicted as normal by the proposed model.
17
Fig. 15. Visual results for false negative (FN) cases by our proposed model: (a) shows the apnea annotated ECG segments, (b) shows the corresponding proposed scalograms
predicted as normal, and (c) shows the corresponding spectrograms predicted as normal.
4.8. Discussion the aforementioned technique was used to make the classes distinctive
and improve the prediction accuracy, while spectrograms were used to
Our model is robust to this problem and allows energy to change provide detailed features about the apnea. We proposed a CNN model
in the designated distinct frequency ranges for apnea and normal. DCDA-Net to extract the effective features in groups while communicat-
However, it still fails to predict some samples with ambiguous fre- ing to compensate for the missing features in each group. The spatial
quency ranges for energy concentration. The false-negative cases are attention module enhanced the key features by scaling them spatially,
shown in Fig. 15. Observing and understanding the reason behind false and the SE module emphasized the key features along channels by
cases in the time domain is difficult. Nevertheless, it is clear from our modulating them along the depth. The effectiveness of DCDA-Net can
proposed scalogram transformation (from the magnified R signal) that be described in terms of the network’s main components e.g., dual-
these samples have energy in the overlapped frequency ranges of apnea convolution, spatial attention module, and SE module. Specifically,
and normal. Therefore, the model learns it as normal, whereas it is DCDA-Net extracted well-communicated grouped features followed by
annotated as apnea. Similarly, in the case of normal prediction, a few channel mechanisms to let the model focus on the key features that are
cases were predicted to be apnea (FP). In this case, the frequency bands necessary for efficient classification. The proposed model was evaluated
were also shared between apnea and normal conditions, as shown in on scalograms and spectrograms individually, and then the score fusion
Fig. 16. methods were applied to finally classify the test samples. The SVM out-
Additional limitation of our proposed method is that it requires performed the weighted sum and weighted product. According to our
preprocessing (obtaining the interpolated R signal, magnifying it, and understanding from the literature study about sleep apnea, DCDA-Net
converting it to a scalogram) and assembly of two CNNs and a SVM outperformed the state-of-the-art methods.
(increasing resource consumption). In the future, we plan to extend our approach to other physiological
signals, e.g., EEG for motor imagery classification, to compensate for
5. Conclusion physiological disabilities by modifying our proposed model to preserve
sequential and contextual information for motor imaging tasks. As this
This study focuses on the TFR instead of processing ECG segments in information is very important in sequential signals. We would also add
the time domain. We targeted the R-peaks of the ECG signals affected some preprocessing mechanisms to sift the time sequence for noise and
by the apnea condition and obtained an interpolated signal of those redundancy and improve the SNR of the sequence in order to extract the
R-peaks. Then, we applied some magnification to the R-interpolated task features easily and outperform the existing methods. We evaluated
signal in the time domain to see where the most energy lied and trans- the proposed method for apnea classification, and could not evaluate
formed it into the scalogram. After performing some image processing the apnea hypopnea index for each subject based on the predicted class
on the magnified scalogram, we observed a gap between the frequency labels because the dataset used in this study does not have hypopnea
ranges of apnea and normal, in which most of the energy was laid annotations. In future work, we will also detect hypopnea using other
making the predictions easy for the model. The scalogram obtained by datasets and calculate the apnea-hypopnea index.
18
Fig. 16. Visual results for false positive (FP) cases by our proposed model: (a) shows the normal annotated ECG segments, (b) shows the corresponding proposed scalograms
predicted as apnea, and (c) shows the corresponding spectrograms predicted as apnea.
CRediT authorship contribution statement Arsalan, M., Haider, A., Lee, Y.W., Park, K.R., 2022. Detecting retinal vasculature
as a key biomarker for deep learning-based intelligent screening and analysis
of diabetic and hypertensive retinopathy. Expert Syst. Appl. 200, 117009. http:
Nadeem Ullah: Methodology, Writing – original draft. Tahir Mah-
//dx.doi.org/10.1016/j.eswa.2022.117009.
mood: Conceptualization. Seung Gu Kim: Data curation. Se Hyun Atri, R., Mohebbi, M., 2015. Obstructive sleep apnea detection using spectrum and
Nam: Data curation. Haseeb Sultan: Investigation. Kang Ryoung bispectrum analysis of single-lead ECG signal. Physiol. Meas. 36, 80. http://dx.doi.
Park: Supervision, Writing – review & editing. org/10.1088/0967-3334/36/9/1963.
Bahrami, M., Forouzanfar, M., 2021. Detection of sleep apnea from single-lead
ECG: Comparison of deep learning algorithms. In: Proceedings of IEEE Interna-
Declaration of competing interest tional Symposium on Medical Measurements and Applications. MeMeA, Lausanne,
Switzerland, 23–25 June 2021, pp. 1–5. http://dx.doi.org/10.1109/MeMeA52024.
The authors declare that they have no known competing finan- 2021.9478745.
cial interests or personal relationships that could have appeared to Chang, H.-Y., Yeh, C.-Y., Lee, C.-T., Lin, C.-C., 2020. A sleep apnea detection
system based on a one-dimensional deep convolution neural network model
influence the work reported in this paper. using single-lead electrocardiogram. Sensors 20 (4157), http://dx.doi.org/10.3390/
s20154157.
Data availability Chen, X., Chen, Y., Ma, W., Fan, X., Li, Y., 2022. Toward sleep apnea detection
with lightweight multi-scaled fusion network. Knowl.-Based Syst. 247, 108783.
http://dx.doi.org/10.1016/j.knosys.2022.108783.
Data will be made available on request. Cheng, W.X., Gao, R., Suganthan, P.N., Yuen, K.F., 2022. EEG-based emotion recog-
nition using random convolutional neural networks. Eng. Appl. Artif. Intell. 116,
Acknowledgments 105349. http://dx.doi.org/10.1016/j.engappai.2022.105349.
Cheng, M., Sori, W.J., Jiang, F., Khan, A., Liu, S., 2017. Recurrent neural network
based classification of ECG signal features for obstruction of sleep apnea detection.
This research was supported in part by the National Research Foun- In: Proceedings of IEEE International Conference on Computational Science and
dation of Korea (NRF) funded by the Ministry of Science and ICT (MSIT) Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous
through the Basic Science Research Program (NRF-2021R1F1A104 Computing. EUC, 21–24 July 2017, pp. 199–202. http://dx.doi.org/10.1109/CSE-
5587), in part by the NRF, Republic of Korea funded by the MSIT EUC.2017.220.
DCDA-Net with algorithms, 2023. Available online: https://github.com/dguispr/
through the Basic Science Research Program (NRF-2022R1F1A
Obstructive-Sleep-Apnea-OSA-diagnosis.git. (Accessed 1 February 2023).
1064291), and in part by the MSIT, Korea, under the ITRC (Infor- Ernst, G., Mariani, J., Blanco, M., Finn, B., Salvado, A., Borsini, E., 2019. Increase in
mation Technology Research Center) support program (IITP-2023- the frequency of obstructive sleep apnea in elderly people. Sleep Sci. 12, 222–226.
2020-0-01789) supervised by the IITP (Institute for Information & http://dx.doi.org/10.5935/1984-0063.20190081.
Communications Technology Planning & Evaluation). Fatimah, B., Singh, P., Singhal, A., Pachori, R.B., 2020. Detection of apnea events
from ECG segments using Fourier decomposition method. Biomed. Signal Process.
Control 61, 102005. http://dx.doi.org/10.1016/j.bspc.2020.102005.
References GeForce GTX 1070, 2022. Available online: https://www.nvidia.com/en-gb/geforce/
products/10series/geforce-gtx-1070/. (Accessed 15 March 2022).
Ali, S.Q., Khalid, S., Brahim Belhaouari, S., 2019. A novel technique to diagnose sleep Gokhale, M., Mohanty, S.K., Ojha, A., 2023. OSACN-Net: GeneViT: Gene Vision
apnea in suspected patients using their ECG data. IEEE Access 7, 35184–35194. Transformer with Improved DeepInsight for cancer classification. Comput. Biol.
http://dx.doi.org/10.1109/ACCESS.2019.2904601. Med. 155, 106643. http://dx.doi.org/10.1016/j.compbiomed.2023.106643.
19
Gupta, K., Bajaj, V., Ansari, I.A., 2022. OSACN-Net: Automated classification of sleep based on empirical mode decomposition and wavelet analysis. Physiol. Meas. 31,
apnea using deep learning model and smoothed gabor spectrograms of ECG signal. 273–289. http://dx.doi.org/10.1088/0967-3334/31/3/001.
IEEE Trans. Instrum. Meas. 71, 1–9. http://dx.doi.org/10.1109/TIM.2021.3132072. Meng, L., Li, H., Chen, B.C., Lan, S., Wu, Z., Jiang, Y.G., Lim, S.N., 2022. AdaViT:
Gutiérrez-Tobal, G.C., Alonso-Álvarez, M.L., Álvarez, D., del Campo, F., Terán-Santos, J., Adaptive vision transformers for efficient image recognition. In: Proceedings of
Hornero, R., 2015. Diagnosis of pediatric obstructive sleep apnea: Preliminary IEEE Conference on Computer Vision and Pattern Recognition. CVPR, New Orleans,
findings using automatic analysis of airflow and oximetry recordings obtained at Louisiana, 19–24 June 2022, pp. 12309–12318. http://dx.doi.org/10.48550/arXiv.
patients’ home. Biomed. Signal Process. Control 18, 401–407. http://dx.doi.org/10. 2111.15668.
1016/j.bspc.2015.02.014. Mostafa, S.S., Mendonça, F., G. Ravelo-García, A., Morgado-Dias, F., 2019. A systematic
review of detecting sleep apnea using deep learning. Sensors 19 (4934), http:
Gutiérrez-Tobal, G.C., Álvarez, D., Crespo, A., del Campo, F., Hornero, R., 2019. //dx.doi.org/10.3390/s19224934.
Evaluation of machine-learning approaches to estimate sleep apnea severity from
Nasifoglu, H., Erogul, O., 2021. Obstructive sleep apnea prediction from electrocardio-
at-home oximetry recordings. IEEE J. Biomed. Health Inf. 23, 882–892. http:
gram scalograms and spectrograms using convolutional neural networks. Physiol.
//dx.doi.org/10.1109/JBHI.2018.2823384.
Meas. http://dx.doi.org/10.1088/1361-6579/ac0a9c.
Gutiérrez-Tobal, G.C., Álvarez, D., Marcos, J.V., del Campo, F., Hornero, R., 2013. Niroshana, S.M.I., Zhu, X., Nakamura, K., Chen, W., 2021. A fused-image-based
Pattern recognition in airflow recordings to assist in the sleep apnoea–hypopnoea approach to detect obstructive sleep apnea using a single-lead ECG and a 2D
syndrome diagnosis. Med. Biol. Eng. Comput. 51, 1367–1380. http://dx.doi.org/ convolutional neural network. PLoS One 16, e0250618. http://dx.doi.org/10.1371/
10.1007/s11517-013-1109-7. journal.pone.0250618.
Haider, A., Arsalan, M., Lee, M.B., Owais, M., Mahmood, T., Sultan, H., Park, K.R., Penzel, T., Moody, G., Mark, R., Goldberger, A., Peter, J., 2000. The apnea-ECG
2022. Artificial intelligence-based computer-aided diagnosis of glaucoma using database. Comput. Cardiol. 27, 255–258.
retinal fundus images. Expert Syst. Appl. 207, 117968. http://dx.doi.org/10.1016/ Pombo, N., Silva, B.M.C., Pinho, A.M., Garcia, N., 2020. Classifier precision analysis
j.eswa.2022.117968. for sleep apnea detection using ECG signals. IEEE Access 8, 200477–200485.
http://dx.doi.org/10.1109/ACCESS.2020.3036024.
Hassan, A.R., 2016. Computer-aided obstructive sleep apnea detection using normal
Razi, A.P., Einalou, Z., Manthouri, M., 2021. Sleep apnea classification using random
inverse Gaussian parameters and adaptive boosting. Biomed. Signal Process. Control
forest via ECG. Sleep Vigil. 5, 141–146. http://dx.doi.org/10.1007/s41782-021-
29, 22–30. http://dx.doi.org/10.1016/j.bspc.2016.05.009.
00138-4.
Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks. In: Proceedings of Shao, S., Han, G., Wang, T., Song, C., Yao, C., Hou, J., 2022. Obstructive sleep apnea
IEEE Conference on Computer Vision and Pattern Recognition. CVPR, Salt Lake detection scheme based on manually generated features and parallel heterogeneous
City, USA, 18–22 June 2018, pp. 7132–7714. http://dx.doi.org/10.48550/arXiv. deep learning model under IoMT. IEEE J. Biomed. Health Inform. PP, http://dx.
1709.01507. doi.org/10.1109/JBHI.2022.3166859.
Huang, T., Chen, H., Fang, W.-C., 2012. Real-time obstructive sleep apnea detection Sharan, R.V., Berkovsky, S., Xiong, H., Coiera, E., 2020. ECG-derived heart rate
based on ECG derived respiration signal. In: Proceedings of IEEE International variability interpolation and 1-D convolutional neural networks for detecting sleep
Symposium on Circuits and Systems. ISCAS, Seoul, Korea (South). 20–23 May 2012, apnea. In: 2020 42nd Annual International Conference of the IEEE Engineering
pp. 341–344. http://dx.doi.org/10.1109/ISCAS.2012.6272031. in Medicine & Biology Society. EMBC, pp. 637–640. http://dx.doi.org/10.1109/
EMBC44109.2020.9175998.
Jafari, A., 2013. Sleep apnoea detection from ECG using features extracted from Sharma, M., Kumbhani, D., Tiwari, J., Kumar, T.S., Acharya, U.R., 2022. Automated
reconstructed phase space and frequency domain. Biomed. Signal Process. Control detection of obstructive sleep apnea in more than 8000 subjects using frequency
8, 551–558. http://dx.doi.org/10.1016/j.bspc.2013.05.007. optimized orthogonal wavelet filter bank with respiratory and oximetry signals.
Kim, S.G., Choi, J., Hong, j.S., Park, K.R., 2022. Spoof detection based on score fusion Comput. Biol. Med. 144, 105364. http://dx.doi.org/10.1016/j.compbiomed.2022.
using ensemble networks robust against adversarial attacks of fake finger-vein 105364.
images. J. King Saud Univ. Comput. Inf. Sci. 34, 9343–9362. http://dx.doi.org/ Sharma, H., Sharma, K.K., 2016. An algorithm for sleep apnea detection from single-
10.1016/j.jksuci.2022.09.012. lead ECG using Hermite basis functions. Comput. Biol. Med. 77, 116–124. http:
//dx.doi.org/10.1016/j.compbiomed.2016.08.012.
Kingma, D.P., Ba, J., 2017. Adam: A method for stochastic optimization.
Sharma, H., Sharma, K.K., 2020. Sleep apnea detection from ECG using variational
Kushida, C.A., Littner, M.R., Morgenthaler, T., Alessi, C.A., Bailey, D., Coleman, Jr., J., mode decomposition. Biomed. Phys. Eng. Express 6, 015026. http://dx.doi.org/10.
Friedman, L., Hirshkowitz, M., Kapen, S., Kramer, M., Lee-Chiong, T., Loube, D.L., 1088/2057-1976/ab68e9.
Owens, J., Pancer, J.P., Wise, M., 2005. Practice parameters for the indications for Singh, S.A., Majumder, S., 2019. A novel approach OSA detection using single-lead
polysomnography and related procedures: An update for 2005. Sleep 28, 499–523. ECG scalogram based on deep neural network. J. Mech. Med. Biol. http://dx.doi.
http://dx.doi.org/10.1093/sleep/28.4.499. org/10.1142/S021951941950026X.
Li, K., Pan, W., Li, Y., Jiang, Q., Liu, G., 2018. A method to detect sleep apnea based Singh, H., Tripathy, R.K., Pachori, R.B., 2020. Detection of sleep apnea from heart beat
on deep neural network and hidden Markov model using single-lead ECG signal. interval and ECG derived respiration signals using sliding mode singular spectrum
Neurocomputing 294, 94–101. http://dx.doi.org/10.1016/j.neucom.2018.03.011. analysis. Digit. Signal Process. 104, 102796. http://dx.doi.org/10.1016/j.dsp.2020.
102796.
Lin, C.-Y., Wang, Y.-W., Setiawan, F., Trang, N.T.H., Lin, C.-W., 2022. Sleep apnea
Song, C., Liu, K., Zhang, X., Chen, L., Xian, X., 2016. An obstructive sleep apnea
classification algorithm development using a machine-learning framework and bag-
detection approach using a discriminative hidden Markov model from ECG signals.
of-features derived from electrocardiogram spectrograms. J. Clin. Med. 11 (192),
IEEE Trans. Biomed. Eng. 63, 1532–1542. http://dx.doi.org/10.1109/TBME.2015.
http://dx.doi.org/10.3390/jcm11010192.
2498199.
Liu, H., Cui, S., Zhao, X., Cong, F., 2023. SCViT: A spatial-channel feature preserving STFT, 2022. Available online: https://en.wikipedia.org/wiki/Short-time_Fourier_
vision transformer for remote sensing image scene classification. IEEE Trans. Geosci. transform. (Accessed 15 March 2022).
Remote Sens. 60, 4409512. http://dx.doi.org/10.1109/TGRS.2022.3157671. Stockwell, R.G., Mansinha, L., Lowe, R.P., 1996. Localization of the complex spectrum:
Lv, P., Wu, W., Zhong, Y., Du, F., Zhang, L., 2023. Detection of obstructive sleep apnea The S transform. IEEE Trans. Signal Process. 44, 998–1001. http://dx.doi.org/10.
from single-channel ECG signals using a CNN-transformer architecture. Biomed. 1109/78.492555.
Signal Process. Control 82, 104581. http://dx.doi.org/10.1016/j.bspc.2023.104581. Surrel, G., Aminifar, A., Rincón, F., Murali, S., Atienza, D., 2018. Online obstructive
sleep apnea detection on medical wearable sensors. IEEE Trans. Biomed. Circuits
Macey, P.M., Li, J.S.J., Ford, R.P.K., 1998. Expert system for the detection of apnoea.
Syst. 12, 762–773. http://dx.doi.org/10.1109/TBCAS.2018.2824659.
Eng. Appl. Artif. Intell. 11, 425–438. http://dx.doi.org/10.1016/S0952-1976(98)
Tang, L., Liu, G., 2021. The novel approach of temporal dependency complexity
00007-4.
analysis of heart rate variability in obstructive sleep apnea. Comput. Biol. Med.
Mashrur, F.R., Islam, Md. S., Saha, D.K., Islam, S.M.R., Moni, M.A., 2021. SCNN: 135, 104632. http://dx.doi.org/10.1016/j.compbiomed.2021.104632.
Scalogram-based convolutional neural network to detect obstructive sleep apnea Tripathy, R.K., Gajbhiye, P., Acharya, U.R., 2020. Automated sleep apnea detection
using single-lead electrocardiogram signals. Comput. Biol. Med. 134, 104532. http: from cardio-pulmonary signal using bivariate fast and adaptive EMD coupled with
//dx.doi.org/10.1016/j.compbiomed.2021.104532. cross time–frequency analysis. Comput. Biol. Med. 120, 103769. http://dx.doi.org/
Mateo, J.R.S.C., 2012. Weighted sum method and weighted product method. In: Multi 10.1016/j.compbiomed.2020.103769.
Criteria Analysis in the Renewable Energy Industry, Vol. 82. Springer Sci. Bus. Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y., 2022. MaxViT:
Media, pp. 19–22. http://dx.doi.org/10.1007/978-1-4471-2346-0_4. Multi-axis vision transformer. In: Proceedings of European Conference on Computer
Vision. ECCV, Tel Aviv, Israel, 23–27 October, 2022, Springer, pp. 459–479.
MATLAB R2021b, 2022. Available online: https://www.mathworks.com/products/ http://dx.doi.org/10.48550/arXiv.2204.01697.
matlab.html. (Accessed 15 March 2022).
Übeyli, E.D., 2008. Support vector machines for detection of electrocardiographic
Mendez, M.O., Corthout, J., Van Huffel, S., Matteucci, M., Penzel, T., Cerutti, S., changes in partial epileptic patients. Eng. Appl. Artif. Intell. 21, 1196–1203.
Bianchi, A.M., 2010. Automatic screening of obstructive sleep apnea from the ECG http://dx.doi.org/10.1016/j.engappai.2008.03.012.
20
Van Steenkiste, T., Groenendaal, W., Deschrijver, D., Dhaene, T., 2019. Automated Wang, T., Lu, C., Shen, G., Hong, F., 2019. Sleep apnea detection from a single-lead ECG
sleep apnea detection in raw respiratory signals using long short-term memory signal with automatic feature-extraction through a modified LeNet-5 convolutional
neural networks. IEEE J. Biomed. Health Inf. 23, 2354–2364. http://dx.doi.org/10. neural network. PeerJ 7, e7731. http://dx.doi.org/10.7717/peerj.7731.
1109/JBHI.2018.2886064. Yang, Q., Zou, L., Wei, K., Liu, G., 2022. Obstructive sleep apnea detection from
Vapnik, V., 1998. Statistical Learning Theory. Wiley. single-lead electrocardiogram signals using one-dimensional squeeze-and-excitation
Varon, C., Caicedo, A., Testelmans, D., Buyse, B., Van Huffel, S., 2015. A novel algo- residual group network. Comput. Biol. Med. 140, 105124. http://dx.doi.org/10.
rithm for the automatic detection of sleep apnea from single-lead ECG. IEEE Trans. 1016/j.compbiomed.2021.105124.
Biomed. Eng. 62, 2269–2278. http://dx.doi.org/10.1109/TBME.2015.2422378. Zhong, J., Chen, J., Mian, A., 2022. DualConv: Dual convolutional kernels for
Vetrekar, N., Ramachandra, R., Gad, R.S., 2023. Multilevel fusion of multispectral lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 1–8.
images to detect the artificially ripened banana. IEEE Sens. Lett. 7, http://dx.doi.
http://dx.doi.org/10.1109/TNNLS.2022.3151138.
org/10.1109/LSENS.2022.3233464.
Zhu, X., Cheng, D., Zhang, Z., Lin, S., Dai, J., 2019. An empirical study of spatial
Viswabhargav, C., Tripathy, R.K., Acharya, U.R., 2019. Automated detection of sleep
attention mechanisms in deep networks. In: Proceedings of IEEE/CVF International
apnea using sparse residual entropy features with various dictionaries extracted
Conference on Computer Vision. ICCV, Seoul, Korea (South), 27 Oct–02 Nov 2019,
from heart rate and EDR signals. Comput. Biol. Med. 108, 20–30. http://dx.doi.
pp. 6688–6697. http://dx.doi.org/10.1109/ICCV.2019.00679.
org/10.1016/j.compbiomed.2019.03.016.
21

1 s2.0 S0952197623006358 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S0952197623006358 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0952197623006358 Main

Uploaded by

Copyright:

Available Formats

Engineering Applications of Artificial Intelligence 123 (2023) 106451

Contents lists available at ScienceDirect

Engineering Applications of Artificial Intelligence

DCDA-Net: Dual-convolutional dual-attention network for obstructive sleep

ARTICLE INFO ABSTRACT

time–frequency representation. Mashrur et al. proposed a scalogram-

Spectrograms and scalograms (time–frequency representation-TFR)

Fig. 6. Examples of scalograms: (a) Apnea scalogram, (b) normal scalogram.

Fig. 10. Spatial attention module.

Fig. 11. Squeeze and excitation module.

4.1. Experimental setup 4.3. Data augmentation

Table 6 there is no object, instead, there is pseudo-colored time and fre-

You might also like