Mass Spectral Substance Detections Using Long Short-Term Memory Networks

Received November 26, 2018, accepted December 23, 2018, date of publication January 9, 2019, date of current version
January 29, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2891548
Mass Spectral Substance Detections Using

Long Short-Term Memory Networks
JUNXIU LIU 1 , (Member, IEEE), JINLEI ZHANG1 , YULING LUO 1, SU YANG2 ,
JINLING WANG3 , AND QIANG FU4
1 Faculty of Electronic Engineering, Guangxi Normal University, Guilin 541004, China
2 School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry BT48 7JJ, U.K.
3 School of Computing, Ulster University, Belfast BT37 0QB, U.K.
4 College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
Corresponding author: Yuling Luo ([email protected])

This work was supported in part by the National Natural Science Foundation of China under Grant 61603104, in part by the Guangxi
Natural Science Foundation under Grant 2017GXNSFAA198180 and Grant 2016GXNSFCA380017, in part by the funding of Overseas
100 Talents Program of Guangxi Higher Education, in part by the Doctoral Research Foundation of Guangxi Normal University under
Grant 2016BQ005, and in part by the Innovation Project of Guangxi Graduate Education under Grant XYCSZ2018080.
ABSTRACT In this paper, mass spectral substance detection methods are proposed, which employ long
short-term memory (LSTM) recurrent neural networks to classify the mass spectrometry data and can
accurately detect chemical substances. As the LSTM has the excellent understanding ability for the historical
information and classification capability for the time series data, a high detection rate is obtained for the
dataset which was collected by a time-of-flight proton-transfer mass spectrometer. In addition, the differential
operation is used as the pre-processing method to determine the start time points of the detections which
significantly improve the accuracy performance by 123%. The feature selection algorithm of Relief is also
used in this paper to select the most significant channels for the mass spectrometer. It can reduce the
computing resource cost, and the results show that the network size is reduced by 28% and the training
speed is improved by 35%. By using these two pre-processing methods, the LSTM-based substance detection
system can achieve the tradeoff between high detection rate and low computing resource consumption,
which is beneficial to the devices with constraint computing resources such as low-cost embedded hardware
systems.
INDEX TERMS Mass spectral substance detections, long short-term memory networks, chemometrics.
I. INTRODUCTION the current conditions, where the unusual substance is pre-

Unusual substance detections are ubiquitous in daily life sented by the time related mass spectrometry data. Mass
with significant importance especially in the security and spectrometry data is recorded over time and contains specific
safety applications, such as hazard detections, environmental patterns of temporal information. The unusual substances can
monitoring, testing and identification of chemical or biolog- be detected via the anomalies in the mass spectrometry data.
ical substances, analyzing inorganic, organic and biological Detecting outliers or anomalies in data has been studied from
aerosol components, and even explosives and drugs detec- 19th century [3]. Anomaly detection refers to finding patterns
tions [1]. There are two major solutions in detecting the in data that do not conform to expected behavior. These non-
unusual substances: i) the sniffer dogs, and ii) the chemi- conforming patterns are often referred as anomalies, outliers,
cal detector such as mass spectrometer. Although using the discordant observations, or exceptions in different application
sniffer dogs for detection has always been an effective solu- domains [4]. Various anomaly detection techniques have been
tion [2], it still has some drawbacks, such as the resource proposed for particular or general application domains [4].
cost for the training and feeding, the limited working hours. However due to the data missing, irregular sampling, and
They are also vulnerable to the deliberate distractions and different recording length, the anomaly detection has some
diseases. The other approach is to use mass spectrometer challenges. For the mass spectrum applications, this problem
to measure the chemical or physical properties of the envi- becomes even worse since the sensors often change their
ronment to produce time-series mass spectra that describe properties over time, leading to increment of the complexity
2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
10734 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
J. Liu et al.: Mass Spectral Substance Detections Using LSTM Networks
degree in the mass-spectra. Therefore the identification of learning technique and has excellent performance for the
unusual substance in the mass spectrometry data is investi- classification tasks [11]. It is composed of input, hidden and
gated in this paper. output layers, where the data transmission between layers
Mass spectrum can be used to analyze a wide range of com- has one-way propagation. The multi-layer feed forward arti-
pounds. The development of different ionization techniques ficial neural networks (FFNNs) have been used for spec-
allows the analysis of gas, liquid and solid samples without troscopy [12], where the data reduction, robust regression,
considering the nature of samples (e.g. metallic, inorganic, and instrumental drifts were also considered. However the
organic, polymeric or biological) [1]. The mass spectrum is ANNs and other types of feed forward neural networks such
a plot of the relative abundance of each ion versus mass-to- as FFCC [13] also have some constraints [14]. For example,
charge (m/z) ratio, which can be generated from the mass it is a challenge to design an ANN with appropriate size
spectrometer. The mass spectrometer is designed to convert and structure, and this should be based on three aspects: the
neutral atoms or molecules into a beam of positive or negative complexity of the solution, the desired prediction accuracy,
ions, to separate the ions on the basis of their mass-to-charge and data characteristics [15]. For the first two aspects, i.e.
(m/z) ratio, and to measure the relative abundance of each precision and complexity, good performance can be obtained
type of ion [5]. Using the mass spectrum, both the molecular by using the FFNNs. However, for the data characteristics,
mass and the molecular formula of an unknown compound Recurrent Neural Networks (RNNs) is found to be more suit-
can be determined. In the real applications, a small volume of able than FFNNs [16]. FFNNs face over-fitting, convergence
substance is injected into the mass spectrometer, which then problems, and are difficult for implementation when they are
actuates the data acquisition. However, the mass spectrum applied in the time series data analysis [17]. In the conven-
data from a mass spectrometer is always affected by elec- tional neural network model, the flow of information is from
tronic noise, sensor drift etc. [6]. Simultaneously, the mass the input to the hidden layers and then to the output layer.
spectrum of each elemental distribution of the substances The pre and post-layer are fully connected, and there is no
is not completely pure and is disturbed by the presence connections between the neurons in the same layer. This neu-
of isotopes. Some mass spectrometers also need manually ral network architecture cannot achieve a good performance
calibrations after working for a period of time [6]. Hence, for some applications such as the time series data processing
it is important to extract the key features from complete tasks [11]. Compared to the conventional neural network,
spectra, especially in the presence of noise and distortion. the RNN adds a weighted sum of the hidden layer with the
This is also beneficial to improve the system integration previous input when calculating the output of the hidden
and portability. In addition, the ion abundances measured by layer. Therefore the input of the hidden layer includes not
the mass spectrum are not ideal inputs for classifiers, since only the output of the pre-layer, but also the previous output of
they do not correlate well with the presence of structural the hidden layer. This introduces a feedback mechanism in the
features in different compounds [7]. It is therefore desirable to hidden layer to learn the context-related information, which
transform the raw spectrum into a more suitable feature set. can effectively process the sequence data (e.g. time series).
Although it is assumed that using all the features as inputs The RNN has been applied to many applications, such as
to the neural network may give a good result, but in practice action recognition [18], multilingual machine translation [19]
this leads to a quite large neural network and unnecessarily etc. As it is capable of dealing with time series data, this work
long training time [6]. Thus, to select and feed the features aims to investigate and design substance detection systems
with significant characteristic for classification to the neural based on the RNNs. As a type of the RNNs, Elman networks
network is the key to improve the system performance. As a use simplified derivative calculations but have some draw-
feature selection method, the Relief algorithm and its variants backs for the reliable learning. Recent research show that the
are known to be relatively efficient in practice, which can long short-term memory recurrent neural networks (LSTMs)
estimate features according to how well their values distin- [20] achieve a better performance than Elman networks. The
guish among the instances that are near to each other [8]. main contributions of this work are as follows:
The Relief algorithm is initially proposed and used for binary (a). Novel substance detection methods are proposed,
classification [9]. The aim is to seek the two nearest neighbors which are based on the LSTMs. A good detection perfor-
from both the same and different classes, which are defined mance of time series mass spectrometry data and low com-
as nearest hit and nearest miss, respectively. Based on Relief, puting resource cost are achieved.
the feature selection process is improved in the approach of (b). By using the differential operation and ReliefF algo-
Kononenko et al. [8] which is known as the ReliefF rithm, the performances of classification accuracy, speed and
algorithm, i.e. an variant of the Relief algorithm. It can required computing resources are improved.
cope with incomplete and noisy data, and solve multi-class (c). Results demonstrate that the detection accuracy
problems [10]. of 81.81% is achieved by one of the proposed substance
After the mass spectrometry data is pre-processed, they can detection system where the dimension of the raw dataset is
be feed into the detection system. The artificial neural net- significantly reduced from 270 to 50, the training speed of
work (ANN) processes information by imitating the structure the neural network is increased by 35% and the network size
of the neural network in the brain. It is an important machine is reduced by 28.46%.
VOLUME 7, 2019 10735

The remainder of this paper is organized as follows. in time series processing [32]. However the DNNs can only
Section II provides the motivation and related works. be applied to problems whose inputs and targets are encoded
Section III describes the differential operation, ReliefF with fixed dimensional vectors. For the sequence classifica-
algorithm and the proposed substance detection system. tion, the input sequence of the DNN is required to be divided
Section IV provides the results and performance analysis. into small overlapping sub-sequences. The time steps of the
Section V concludes the paper. input sequence become features to the network and the sub-
sequences overlap to simulate a window along the sequence.
II. MOTIVATION AND PREVIOUS STUDIES The limitations include (a). the size of the sliding window
Various learning tasks in practice require dealing with is fixed and must be imposed on all inputs, and (b). the
sequential data [21]. Sequence classification is closely related size of the output is also fixed. The DNN has capability
to the sequential supervised learning problem, which is differ- for sequence classification but still suffer from this key
ent from the classical supervised learning problem [22]. The limitation, i.e. to specify the scope of temporal dependence
sequence imposes an order on the observations which must between observations which needs to be done before the
be preserved during the model training and decision-making. model development. This is a constraint as many problems
Most of the existing research on detecting anomalies in are expressed by sequences whose lengths (dimensions) are
discrete sequences focus on one of the following three prob- unknown in advance [14]. For time-series mass spectrometry
lem formulations [23]: (a) Sequence-based anomaly detec- data, observing only one mass spectrum at a specific time
tion, i.e. detecting anomalous sequences from a database of (i.e. point anomaly detection) is difficult to classify the sub-
test sequences; (b) Contiguous subsequence-based anomaly stances. Time-series data has been extensively investigated
detection, i.e. detecting anomalous contiguous subsequence by the contextual anomaly detection strategies [33]–[35].
within a long sequence; (c) Pattern frequency-based anomaly Compared to the point anomaly detection techniques, the con-
detection, i.e. detecting patterns in a test sequence with textual anomaly detection can achieve a better perfor-
anomalous frequency of occurrence. These formulations are mance [4]. The RNN is one of contextual anomaly detection
fundamentally different, and hence require exclusive solu- method, which is able to exploit a dynamically changing
tions. Using neural networks to solve the problems in the contextual window over the input sequence history [36].
chemical application domains has been proposed and imple- Furthermore, the LSTMs can solve many time series tasks
mented in spectroscopy (mass, infrared, nuclear magnetic which are impossible for the FFNNs with fixed time window
resonance, ultraviolet), and structure/activity relationships sizes [37]. The LSTM networks do not need a pre-defined
etc. [24]. Results show advantages of high precision and low time window and are capable of accurately modeling complex
computing complexity [25]. Therefore in this paper, we focus multivariate sequences [38].
on the substance detections using the time-series mass spec- In summary, the ANN-based models are widely used in
trometry data. the chemical application domains, especially for analyzing
In previous research, neural network has been used in spectra/structure correlations. However, the research by using
the fields of mass spectra [6], [26], where good perfor- contextual anomaly detection needs to be further investigated.
mance is achieved. The FFNNs can classify low-resolution Recent research show that the RNN is a powerful and prac-
mass spectra of unknown compounds [7]. A method tical tool for the supervised learning from sequences [21].
for identification of the structural features of compounds One key challenge of the RNNs is how to train the networks
from mass spectrometry data is proposed in the approach effectively, e.g. to avoid the vanishing and exploding gra-
of Eghbaldar et al. [27], which uses an optimized artificial dients. The LSTM overcomes this challenge [39], thus it is
neural network. Ion mobility spectra is successfully classified employed in this approach. In the meantime, the computa-
through the neural networks [28], which uses a combination tional complexity of an anomaly detection technique should
drift times, number, intensity and shape of peaks. Based on also be considered, especially when it is deployed to the
the selection of the relevant input data, an optimized ANN devices with limited computing resources [4]. Therefore,
model is used to analyze instrumentation spectra, where the the LSTM-based substance classification system is proposed
reduction of the input dimension improves the robustness of in this paper where some challenges such as improving
the model [29]. However, all these aforementioned networks detection rate and reducing computing resource overhead will
are based on the feed-forward structure with slight variations, be addressed. This work is a contiguous subsequence-based,
and they are not suitable for the temporal correlated data. multi-classification anomaly detection system [23], where the
In addition to the conventional neural networks, deep neural high accuracy, low overhead, fast response and sensitivity
networks (DNNs) have also been applied to spectral data. will be obtained. It will be described in detail in the following
Deep learning [30] is a method which can extract features sections.
directly from original data. Deep belief network, one of the
deep learning methods, has been used to predict molecular III. THE LSTM-BASED MASS SPECTRAL SUBSTANCE
substructure in the mass spectral data [31]. They can approx- DETECTIONS
imate arbitrary nonlinear functions, which overcomes the The chemometric community aims to analyze instrumen-
limitations of using classical linear methods and is beneficial tation spectra efficiently and accurately, and overcome the
10736 VOLUME 7, 2019

FIGURE 1. Mass spectrum of dopamine.
noise and instrumental drifts [29]. Chemometric uses math- identifying the compound type. Fig. 1 is the mass spectrum
ematics, statistics and formal logic to design, select opti- of dopamine (C8 H11 NO2 ), where the molecular ion nom-
mal experimental procedures, to provide maximum rele- inal masses appears at m/z 153. But there is a small peak
vant chemical information by analyzing chemical data, and at m/z 154, which is from an ion 1 atomic mass units (amu)
to obtain knowledge about chemical systems [40]. It is that is heavier than the molecular ion of dopamine, and cor-
important to discriminate the composition of substance by responds to the presence in the ion of a single heavier isotope
estimating the material activity ratio in spectra. If the redun- of H , C, N , or O in dopamine.
dant and irrelevant elements of spectra are not completely It is common to use electrons with energies of 70 eV
separated, they will interfere with the determination of the (approximately 6750 KJ/mol) for electron ionization. This
substance type. Therefore, the overlapping of peaks makes energy is sufficient to dislodge one or more electrons from
the analysis of spectra and the interpretation of results dif- a molecule, and to cause extensive fragmentation [5]. These
ficult. In addition, in order to detect the substance types fragments may be unstable and, in turn break apart into
quickly and accurately, training speed of neural network need even smaller fragments. The molecular ions of some com-
to be improved and system computing overhead need to be pounds have a sufficiently long lifetime in the analyzing
reduced. In the following subsections, mass spectrum and chamber. They are observed in the mass spectrum, some-
its characteristics are explained, then the proposed systems times as the base (most intense) peaks. Molecular ions of
using the differential operation, the ReliefF algorithm and the other compounds have a shorter lifetime and present in low
LSTM are presented. abundance or not at all. As a result, the mass spectrum
of a compound ionized consists of a peak for the molec-
A. MASS SPECTRUM ular ion and a series of peaks for fragment ions. In mass
The quality of mass spectrum obtained on a given instrument spectrum, the peak resulting from the most abundant cation
is highly dependent on the purity of the mass spectrum, and is defined as the base peak and it is assigned an arbitrary
the condition of the mass spectrometer. The relationships intensity of 100. The relative abundances of all other cations
between mass spectrum and resolution, the presence of iso- in a mass spectrum are reported as percentages of the base
topes, and the fragmentation of molecules and molecular ions peak.
are crucial in understanding a mass spectrum, which will be Based on the aforementioned three aspects, for a realistic
discussed in the following subsections. mass spectrum, the resolution of the mass spectrometer for
The mass spectrum has different resolutions. Low- measurement can be high or low. Both the number of iso-
resolution mass spectrometry refers to instruments which are topes contained in the element of the measured substance,
capable of distinguishing among ions of different nominal and the distribution of the fragmentation have effects for
masses [5]. High-resolution mass spectrometry calculates the analysis. Fig. 2 is a real mass spectrometry data of
the precise mass of each compound that can aid distin- shower gel, where the mass-to-charge ranges from 1 to 270
guishing nominal mass. For example, compounds with the (i.e. a total of 270 features). There is a challenge of that how
molecular formulas C3 H6 O and C3 H8 O have nominal the optimal analysis results can be achieved while analyzing
masses of 58 and 60, respectively, and can be distinguished the similar mass spectrums, which will be addressed in the
by low-resolution mass spectrometry. However, the com- next subsection.
pounds C3 H8 O and C2 H4 O2 have the same nominal
mass of 60 and cannot be distinguished by a low-resolution B. DIFFERENTIAL OPERATION
mass spectrometry, where high-resolution mass spectrometry Differential operation can eliminate the noise interference
is needed. The presence of isotopes also has an effect on and highlight the signal changes. In this approach, it is
VOLUME 7, 2019 10737

FIGURE 2. Mass spectrum of shower gel.
described by
xd (n) = x(n) − x(n − 1) (1)
where xd (n) is the difference signal, x(n) and x(n − 1) are the
raw spectrum data at time step n and n − 1, respectively.
When a mass spectrometer is used for detection, the sub-
stance can appear regularly or randomly. Once it is detected,
the mass spectrometer produces a period of time-related
mass spectrometry data. These data can be clearly observed
after differential operation and the time point when the sub-
stance appears can be quickly located. Therefore, the dif-
ferential operation is the first data processing step in this
approach. In addition, the raw mass spectrometry data has a
significant magnitude difference which is not conducive to FIGURE 3. Mass spectrum channel selection by using the ReliefF
algorithm for the DSTL spectrum dataset.
neural network training. However the post-processing data
is more uniform, which is beneficial to train the neural net-
works. By using the differential operation, the data that is discrete different samples and aggregates similar samples.
used to input to the neural network becomes more concise Finally, according to the features rearranged by the ReliefF
and less interference. algorithm, the top 50 most weighted features are selected as
the final input of the neural network.
C. FEATURE SELECTION OF THE MASS SPECTRUM For the DSTL spectrum dataset, by using the ReliefF algo-
The raw mass spectrum has 270 features. Through the rithm, an example of processed dataset is shown by Fig. 3.
ReliefF algorithm, the raw mass spectrum was eventually The x-axis corresponds to the time step, and the y-axis is
reduced to 50 features. The specific operational process is the actual measured values of mass spectrum. The details of
as follows. Suppose there are K category tags in a given DSTL spectrum dataset will be provided in the Section IV.
single dataset, the training dataset is defined as D = The processed dataset is ultimately used as input to the
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}, where xi ∈ Rp and yi ∈ Rk LSTM-based substance classification system. The ReliefF
represent feature set and class label space of classified sam- algorithm is used in this approach because not all mass spec-
ples, respectively. If the sample xi belongs to class k, then trometer channels are used. It can find the channels that have
yi (k) = 1, otherwise yi (k) = 0. Thus the p × n feature high correlations with the output result. Mass spectrometer
matrix X = [x1 , x2 , . . . , xn ] and the k × n label matrix channels with weak correlations (e.g. isotope-induced) can
Y = [y1 , y2 , . . . , yn ] constitute the classification sample D. be removed. This can also reduce the computing overhead of
According to the ReliefF algorithm, the features of the raw the classification systems.
mass spectrum are rearranged. The ReliefF algorithm chooses
an instance Ri randomly and seeks for k of its nearest hits Hj D. RECURRENT NEURAL NETWORK
and nearest misses Mj (C) respectively. The basic idea of One problem of the RNN is the vanishing (or explod-
ReliefF is to assign the weights to each feature in the feature ing) gradient problem. In this section, the architecture of
set of the classification samples, and then iterate to update the RNNs and the gradient vanishing problem are briefly dis-
weights. Secondly, the feature subsets are selected according cussed. Then the LSTM that can address this problem is
to the weight of feature, which makes the good features introduced.
10738 VOLUME 7, 2019

FIGURE 4. The architecture of RNNs [16].

FIGURE 5. The architecture of LSTM [20].
1) RECURRENT NEURAL NETWORKS

of the input, forget and output gates, respectively. They are
The RNN is the extension of the conventional FFNNs in
described by
the time scale. Assuming that an input sequence, the hidden
state sequence, and output vector sequence denoted by x, it = σ (Wxi xt + Whi ht−1 + bi ) (4)
h and y, respectively. RNNs combine the input vector with ft = σ (Wxf xt + Whf ht−1 + bf ) (5)
the previous state vector to produce a new state vector. Hidden
state ht is described as ot = σ (Wxo xt + Who ht−1 + bo ) (6)
ht = f (Uxt + Wht−1 + b1 ) (2) In LSTM, three gates control the information flow. The
input gate decides which values will be updated. The forget
where U , W are the weights for the connections from the gate defines how much of the previous state ht−1 are allowed
input layer to the hidden layer, hidden layer to the hidden to pass through, and the output gate defines how much of
layer, respectively. Hidden state ht equivalent to a memory the internal state are exposed to the next layer. The candidate
container captures information which happened in all the value gt is computed by the current input xt and the previous
previous time steps. Similar to the FFNNs, the output vector hidden state ht−1 . The key to LSTM is the cell state ct , as the
of RNN yt is described as it , ft and gt interact with ct . The gt and ct are described by
yt = g(Vht + b2 ) (3) gt = tanh(Wxg xt + Whg ht−1 + bg ) (7)
where V is the weights for the connections from the hidden ct = ft ct−1 + it gt (8)
layer to the output layer. In (2) and (3), the f and g are
activation functions that squash the dot products to a specific where Wxi , Wxf , Wxo and Wxg are the weights for the con-
range. The function f is usually tanh or ReLU. The g can nections. These weights propagate from the input data to
also be a softmax. The b1 and b2 are biases that help offset the input gate, forget gate, output gate and candidate value.
the outputs from the origin. Fig. 4 shows a common topology Similarly, this also applies to Whi , Whf , Who , Whg , which are
of RNNs. the connection weights from the hidden layer (at previous
The conventional RNNs use Back Propagation Training time-step) to the input gate, forget gate, output gate and
Time (BPTT) to handle a variable-length sequence input [41]. candidate value (at the current time-step), and bi , bf , bo and
There are two widely known issues during the training bg are the corresponding bias.
of RNNs, the vanishing and the exploding gradient Finally, the hidden state ht at time t is computed by multi-
problems [42], [43]. Unfortunately, the range of contextual plying the tanh(ct ) with the output gate. This can be described
information that standard RNNs can access is quite limited by
in practice [44]. Similar to RNN, the LSTM has recurrent ht = ot tanh(ct ) (9)
connections so that the state from previous activations of
the neuron in the previous time step is used as context for In the LSTM, the three gates (input, forget, output) are used
formulating an output. But unlike other RNNs, the LSTM has to solve the vanishing and exploding gradient problems. In the
a unique formulation that allows to avoid two aforementioned conventional RNNs, the recurrent hidden layer is replaced by
problems. LSTM cell.
2) LONG SHORT-TERM MEMORY IV. EXPERIMENTAL RESULTS

The LSTM is an architecture which was first proposed by The performance of the proposed LSTM-based substance
Hochreiter and Schmidhuber [39] and refined by many other classification systems are analyzed and discussed in this
researchers. Fig. 5 shows a single LSTM cell. section. The experimental environment is the Anaconda plat-
For each time step t, xt is the input to the memory cell form using Python, i3-6100 CPU @ 3.70GHz, 12.0GB RAM.
layer, σ is the logistic sigmoid function, it , ft and ot are values To demonstrate that the proposed classification systems can
VOLUME 7, 2019 10739

FIGURE 6. Mass spectrum data of brewed coffee.

FIGURE 7. Rapid detection mechanism where the spectrum data is shown
from time step 360 to 450.
be easily applied to spectrometry data, the DSTL spectrum

TABLE 1. Detection rates and training times of the LSTMs with different
dataset [2] is used in the experiment. It is a mass spectrom- window sizes.
etry dataset which was collected by using a highly sensitive
time-of-flight proton-transfer mass spectrometer. At various
intervals a number of different substances were introduced to
the sensor which gradually change its properties over time at
different distances and strengths. The strength was manually
marked as weak, medium or strong. The mass spectrometry
data within a time period represents a substance, and each
substance has a unique profile. It has a total of 58,500 data
samples where 20,000 samples are used for testing and the
rest for training set, i.e. the ratio of testing and training from 360 to 400 and from 400 to 440) in Fig. 7, it can be
dataset is about 7:3. It has various degrees of complexity and seen that the severe changes in amplitude corresponds to the
different levels of adulteration with anomalous substances. presence of a substance, and the amplitude change is very
Fig. 6 shows the mass spectrometry data example of brewed small if no substance is present. By using the differential
coffee between time step 1,540 and 1,590. The color codes operations, when a substance is present, the rapid detection
represent the features (i.e. channels) in mass spectrometer. mechanism ensures that the location of the substance is
The main advantages of this work include accurately detect- clearly and accurately detected.
ing the presence of substances, reducing system computing
resource overhead, and handling multiple time-related mass B. OPTIMAL WINDOW SIZE FOR THE LSTM-BASED
spectrometry data. SUBSTANCE CLASSIFICATION SYSTEMS
In the LSTM used in this approach, the recurrent connections
A. RAPID DETECTION MECHANISM add state or memory to the network and allow it to learn
In order to classify the substances, the first step is to detect and harness the ordered nature of observations within input
whether the substance is present at the mass spectrometer. sequences. Due to recurrent connections, the states from
To distinguish the substances from the background samples, previous activations of the neurons are used as context for
a rapid detection mechanism is critical. In this work, the dif- formulating an output. The LSTM has internal states, and
ferential operation is used to detect whether the substance is explicitly aware of the temporal structures in the inputs
is present. If it is present, the spectrum data has an intense and is able to model multiple parallel input series separately.
change which can indicate the starting time point of the It possesses memory which can overcome the issues of long-
substances. In this experiment, as the substance locations in term temporal dependency with input sequences. Therefore,
DSTL spectrum dataset are randomly distributed, it becomes the number of memory cells (i.e. the optimal window size in
crucial to locate these substances. Selecting a suitable period this work) should be determined, as memory cells with differ-
to observe the data is a challenge. If the selected period is not ent sizes produce different results. Table 1 shows the detec-
appropriate, there is a risk that data integrity will be cut apart. tion rates and training times of the LSTM-based substance
However, the proposed rapid detection mechanism is suitable classification systems with different window sizes. It can
for either periodic or non-periodic distribution. Comparing be seen that the lowest detection rate is 78.78% when the
the two examples in the DSTL spectrum dataset (time steps window size is 5 time steps, and the detection rates are same
10740 VOLUME 7, 2019

TABLE 2. Different classification systems with and without ReliefF algorithm and differential operation.
TABLE 3. Detection rates of the LSTM substance classification system.
TABLE 4. Detection rates of the R-LSTM substance classification system.
(81.81%) for the other window sizes. Table 1 also provides the The ReliefF algorithm and differential operation are not used
results of training times, where the training time of the LSTM in this experiment. It can be seen that only about one-third
with the window size of 2 time steps is used as the baseline. of the total substances can be detected. The numbers of
For example if the window size is 5 time steps, the training matched (13) and missed (11) substances are almost the
time is shorter, i.e. 74.03% of the baseline. It can be seen that same. The number of missed and misclassified substances is
the LSTM with window size of 10 time steps achieves the relatively high. In total, 13 of the 33 substances are matched,
lowest training time, 67.8% of the baseline. If the window which gives a detection rate of 39.39%.
size continues to increase, the training time increases again
due to the intensive computing of large number of memory 2) R-LSTM SUBSTANCE CLASSIFICATION SYSTEM
cells. Thus the window size of 10 time steps is optimal in this The raw DSTL spectrum dataset is processed by ReliefF
experiment due to the high detection rate and low training algorithm and then fed to the LSTM classifier in this exper-
time. iment. The dataset dimension is reduced from 270 to 50 by
using the ReliefF algorithm. Table 4 shows the result of the
C. DETECTION RATES OF THE LSTM-BASED SUBSTANCE R-LSTM substance classification system. The proportion of
CLASSIFICATION SYSTEMS matched (15) and missed (14) substances is still close to
In this experiment, the detection rates of different classifi- 1:1. However, the number of misclassified substances has
cation systems are provided and these classification systems been reduced to 4. It can be seen that the overall detection
include the LSTMs with and without ReliefF algorithm and rate is 45.45%, which is better than the LSTM substance
differential operation. Table 2 shows the four different clas- classification system in Table 3. This R-LSTM system has
sification systems and their performances under the optimal the advantages of high training speed and small network size
window size will be compared in this section. which will be further discussed in the R-D-LSTM system.
1) LSTM SUBSTANCE CLASSIFICATION SYSTEM 3) D-LSTM SUBSTANCE CLASSIFICATION SYSTEM

In this experiment, the LSTM is used for the classifier In this experiment, differential operation is used only for the
and the raw DSTL spectrum dataset is used for the sub- raw DSTL spectrum dataset and then the data is inputted to
stance classifications. The results are shown by Table 3. the LSTM. Table 5 shows the detection rates where a high
VOLUME 7, 2019 10741

TABLE 5. Detection rates of the D-LSTM substance classification system.
TABLE 6. Detection rates of the R-D-LSTM substance classification system.
overall detection rate of 87.88% is obtained by this system. system has the best overall detection rate of 87.88%. The
The D-LSTM effectively detects most instances of shampoo, LSTM substance classification system has the lowest detec-
shaving gel, brewed coffee and olive oil. It misses 1, 2 and tion rate of 39.39% under the raw dataset. Compared to
1 substances for the shower gel, coffee beans and smoked the LSTM system, R-LSTM improves the detection rate to
ham, respectively. In total, 4 of the 33 substances are missed, 45.45% as the weak contribution features in the dataset have
which gives a successful detection rate of 87.88% and no been removed. R-D-LSTM system combines the advantages
false-positive detection occurs. of differential operation and the RelieF algorithm where the
former is used to improve the detection rate, and the latter
is used to reduce computing resource overhead. Although its
4) R-D-LSTM SUBSTANCE CLASSIFICATION SYSTEM overall detection rate of 81.81% is lower than the D-LSTM
In this experiment, the raw DSTL dataset is pre-processed but the computing resource overhead is greatly reduced.
by the ReliefF algorithm and differential operation. The The DSTL spectrum dataset is also used in other
R-D-LSTM system integrates the advantages of the R-LSTM approaches, such as the receptor density algorithm (RDA)
and D-LSTM systems. The results are shown by Table 6. in [45]. The RDA is inspired by T-cell signaling, and it
To improve computing efficiency, the ReliefF algorithm is has been used for the substance classifications of the DSTL
used to select most significant features from the dataset spectrum dataset. It achieved 86.5% of detection rate and
and reduce the input dimensions for the neural network. 3.2% of false-positive rate [2]. Comparing to the RDA, the
The dimension of the raw dataset is significantly reduced D-LSTM and R-D-LSTM systems in this paper achieve
from 270 to 50 giving a reduction rate of 81.48%. Table 6 the detection rates of 87.88% and 81.81%, false-positive
shows detection rates of the R-D-LSTM system. Compare rates of 0% and 3.57%, respectively, where the performance
to the results of D-LSTM in Table 5, the detection accuracy of D-LSTM is better than the RDA and the R-D-LSTM is
of 81.81% is slightly lower due to that two substances are lower. Based on the trade-off between the detection per-
not matched. One of the two substances is missed for the formance and the computing resource requirements, the
smoked ham and the other is misclassified for the shower D-LSTM and R-D-LSTM can be selected for different appli-
gel. However the advantage of the R-D-LSTM is that as cation domains.
the dimension of the raw dataset is significantly reduced,
the training speed of the neural network is increased by 35% V. CONCLUSION
and the network size is reduced by 28.46%. This will be ben- In this work, LSTM based substance detection methods are
eficial for the substance classification systems with limited proposed, which consist of two parts, i.e., the pre-processing
computing resources such as the embedded hardware systems of mass spectrometry data and classifications of the chemical
which are the typical platforms for the mass spectrometer. substances. For the former, differential operation and Reli-
By comparing the performances of these four different efF algorithm are used to improve the classification accura-
classification systems, the D-LSTM substance classification cies and reduce the feature dimensions and computing cost,
10742 VOLUME 7, 2019

respectively. For the latter, LSTM based substance detec- [21] Z. C. Lipton, J. Berkowitz, and C. Elkan. (2015). ‘‘A critical review of
tion systems are designed, where the optimal parameters of recurrent neural networks for sequence learning.’’ [Online]. Available:
https://arxiv.org/abs/1506.00019
the LSTM model are obtained through experiments. Results [22] T. G. Dietterich, ‘‘Machine learning for sequential data: A review,’’ in
show that the D-LSTM substance classification system has Structural, Syntactic, and Statistical Pattern Recognition. Berlin, Ger-
the best overall detection rate and the R-D-LSTM system many: Springer, 2002, pp. 15–30.
[23] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection for discrete
achieves the balance between high classification accuracy
sequences: A survey,’’ IEEE Trans. Knowl. Data Eng., vol. 24, no. 5,
and low computing resource requirement by combining the pp. 823–839, May 2012.
advantages of differential operation and the RelieF algorithm. [24] J. Zupan and J. Gasteiger, ‘‘Neural networks: A new method for solving
It is desirable for these detection systems to be implemented chemical problems or just a passing phase?’’ Anal. Chim. Acta, vol. 248,
no. 1, pp. 1–30, 1991.
in embedded hardware systems of different real applications. [25] A. Sagheer and M. Kotb, ‘‘Time series forecasting of petroleum production
Future work include further optimize the substance detection using deep LSTM recurrent networks,’’ Neurocomputing, vol. 323, no. 5,
systems such as the neural network parameters, and reduce pp. 203–213, 2019.
[26] Y. Kalegowda and S. L. Harmer, ‘‘Classification of time-of-flight sec-
the requirements of the hardware computing resources.
ondary ion mass spectrometry spectra from complex Cu–Fe sulphides by
principal component analysis and artificial neural networks,’’ Anal. Chim.
REFERENCES Acta, vol. 759, pp. 21–27, Jan. 2013.
[1] F. Aubriet and V. Carré, ‘‘Potential of laser mass spectrometry for the [27] A. Eghbaldar, T. P. Forrest, and D. Cabrol-Bass, ‘‘Development of neural
analysis of environmental dust particles—A review,’’ Anal. Chim. Acta, networks for identification of structural features from mass spectral data,’’
vol. 659, nos. 1–2, pp. 34–54, 2010. Anal. Chim. Acta, vol. 359, no. 3, pp. 283–301, 1998.
[2] J. A. Hilder et al., ‘‘Parameter optimisation in the receptor density [28] S. Bell, E. Nazarov, Y. F. Wang, and G. A. Eiceman, ‘‘Classification of ion
algorithm,’’ in Proc. Int. Conf. Artif. Immune Syst., vol. 6825, 2011, mobility spectra by functional groups using neural networks,’’ Anal. Chim.
pp. 226–239. Acta, vol. 394, nos. 2–3, pp. 121–133, 1999.
[3] F. Y. Edgeworth, ‘‘On discordant observations,’’ London, Edinburgh, [29] Z. Boger, ‘‘Selection of quasi-optimal inputs in chemometrics modeling by
Dublin Philos. Mag. J. Sci., vol. 23, no. 143, pp. 364–375, 1887. artificial neural network analysis,’’ Anal. Chim. Acta, vol. 490, nos. 1–2,
[4] V. Chandola, A. Banerjee, and V. Kumar, ‘‘Anomaly detection: A survey,’’ pp. 31–40, 2003.
ACM Comput. Surv., vol. 41, no. 3, 2009, Art. no. 15. [30] Y. Bengio and Y. LeCun, ‘‘Scaling learning algorithms towards AI,’’ Large
[5] W. H. Brown, B. Iverson, E. V. Anslyn, and C. Foote, Organic Chemistry, Scale Kernel Mach., vol. 34, pp. 321–360, 2007.
8th ed. Boston, MA, USA: Cengage Learning, 2017. [31] Z.-S. Zhang, L.-L. Cao, J. Zhang, P. Chen, and C.-H. Zheng, ‘‘Prediction of
[6] C. S. Tong and K. C. Cheng, ‘‘Mass spectral search method using the molecular substructure using mass spectral data based on deep learning,’’
neural network approach,’’ Chemometrics Intell. Lab. Syst., vol. 49, no. 2, in Intelligent Computing Theories and Methodologies, vol. 9226. Cham,
pp. 135–150, 1999. Switzerland: Springer, 2015, pp. 520–529.
[7] B. Curry and D. E. Rumelhart, ‘‘MSnet: A neural network which clas- [32] G. Dorffner, ‘‘Neural networks for time series processing,’’ Neural Netw.
sifies mass spectra,’’ Tetrahedron Comput. Methodol., vol. 3, nos. 3–4, World, vol. 6, no. 1, pp. 447–468, 1996.
pp. 213–237, 1990. [33] B. Abraham and G. E. P. Box, ‘‘Bayesian analysis of some outlier problems
[8] I. Kononenko, E. Šimec, and M. Robnik-Šikonja, ‘‘Overcoming the in time series,’’ Biometrika, vol. 66, no. 2, pp. 229–236, 1979.
myopia of inductive learning algorithms with RELIEFF,’’ Appl. Intell., [34] A. M. Bianco, M. G. Ben, E. J. Martínez, and V. J. Yohai, ‘‘Outlier detection
vol. 7, no. 1, pp. 39–55, 1997. in regression models with ARIMA errors using robust estimates,’’ J. Fore-
[9] K. Kira and L. A. Rendell, ‘‘The feature selection problem: Traditional casting, vol. 20, no. 8, pp. 565–579, 2001.
methods and a new algorithm,’’ in Proc. 10th Nat. Conf. Artif. Intell., 1992,
[35] B. Abraham and A. Chuang, ‘‘Outlier detection and time series modeling,’’
pp. 129–134.
Technometrics, vol. 31, no. 2, pp. 241–248, May 1989.
[10] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li, ‘‘Multi-instance multi-
[36] H. Sak, A. Senior, and F. Beaufays. (2014). ‘‘Long short-term memory
label learning,’’ Artif. Intell., vol. 176, no. 1, pp. 2291–2320, 2012.
based recurrent neural network architectures for large vocabulary speech
[11] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neural
recognition.’’ [Online]. Available: https://arxiv.org/abs/1402.1128
Netw., vol. 61, pp. 85–117, Jan. 2015.
[37] F. A. Gers, D. Eck, and J. Schmidhuber, ‘‘Applying LSTM to time series
[12] D. A. Cirovic, ‘‘Feed-forward artificial neural networks: Applications to
predictable through time-window approaches,’’ in Proc. Int. Conf. Artif.
spectroscopy,’’ TrAC Trends Anal. Chem., vol. 16, no. 3, pp. 148–155,
Neural Netw., 2001, pp. 669–676.
1997.
[38] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, ‘‘Long short term memory
[13] H. M. Azamathulla, M. C. Deo, and P. B. Deolalikar, ‘‘Alternative neural
networks for anomaly detection in time series,’’ in Proc. Eur. Symp. Artif.
networks to estimate the scour below spillways,’’ Adv. Eng. Softw., vol. 39,
Neural Netw., Comput. Intell. Mach. Learn., 2015, pp. 89–94.
no. 8, pp. 689–698, 2008.
[14] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence learning with [39] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2014, Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
pp. 3104–3112. [40] M. Esteban, C. Ariño, and J. M. Díaz-Cruz, ‘‘Chemometrics for the
[15] G. Zhang, B. E. Patuwo, and M. Y. Hu, ‘‘Forecasting with artificial neural analysis of voltammetric data,’’ TrAC Trends Anal. Chem., vol. 25, no. 1,
networks: The state of the art,’’ Int. J. Forecasting, vol. 14, no. 1, pp. 35–62, pp. 86–92, 2006.
1998. [41] P. J. Werbos, ‘‘Backpropagation through time: What it does and how to do
[16] M. Hüsken and P. Stagge, ‘‘Recurrent neural networks for time series it,’’ Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990.
classification,’’ Neurocomputing, vol. 50, pp. 223–235, Jan. 2003. [42] Y. Bengio, P. Simard, and P. Frasconi, ‘‘Learning long-term dependencies
[17] M. Dixon, ‘‘Sequence classification of the limit order book using recurrent with gradient descent is difficult,’’ IEEE Trans. Neural Netw., vol. 5, no. 2,
neural networks,’’ J. Comput. Sci., vol. 24, pp. 277–286, Jan. 2018. pp. 157–166, Mar. 1994.
[18] J. Donahue et al., ‘‘Long-term recurrent convolutional networks for visual [43] R. Pascanu, T. Mikolov, and Y. Bengio, ‘‘On the difficulty of train-
recognition and description,’’ IEEE Trans. Pattern Anal. Mach. Intell., ing recurrent neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2013,
vol. 39, no. 4, pp. 677–691, Apr. 2017. pp. 1310–1318.
[19] D. Bahdanau, K. Cho, and Y. Bengio. (2014). ‘‘Neural machine trans- [44] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and
lation by jointly learning to align and translate.’’ [Online]. Available: J. Schmidhuber, ‘‘A novel connectionist system for unconstrained hand-
https://arxiv.org/abs/1409.0473 writing recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,
[20] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and no. 5, pp. 855–868, May 2009.
J. Schmidhuber, ‘‘LSTM: A search space odyssey,’’ IEEE Trans. Neural [45] N. D. L. Owens, A. Greensted, J. Timmis, and A. Tyrrell, ‘‘The receptor
Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017. density algorithm,’’ Theor. Comput. Sci., vol. 481, pp. 51–73, Apr. 2013.
VOLUME 7, 2019 10743

JUNXIU LIU, photograph and biography not available at the time of JINLING WANG received the M.Sc. degree
publication. (Hons.) in computing and information systems
from the School of Computing and Intelligent Sys-
JINLEI ZHANG received the B.E. degree from tems, University of Ulster, in 2003, and the Ph.D.
the Harbin University of Science and Technology, degree from the Intelligent Systems Research Cen-
China, in 2013. He is currently pursuing the mas- tre, University of Ulster, in 2016. She is currently
ter’s degree with the Faculty of Electronic Engi- a Research Associate of computing science with
neering, Guangxi Normal University. His research the University of Ulster. Her major research inter-
interests include data analytics, and long short- ests include intelligent data analysis, bio-inspired
term memory networks and its application. adaptive systems, spiking neural networks, artifi-
cial neural networks, online learning, machine learning, data mining, and
pattern recognition for a wide range of datasets. She serves as a Reviewer
for several international conferences and journals.
YULING LUO received the Ph.D. degree in

information and communication engineering from
the South China University of Technology,
Guangzhou, China. She is currently an Associate
Professor with the Faculty of Electronic Engineer-
ing, Guangxi Normal University, Guilin, China.
Her research interests include information secu-
rity, image processing, chaos theory, and embed-
ded system implementations.
SU YANG received the B.A. degree in mechan-

ical engineering from the Changchun Univer-
sity of Technology, Changchun, China, in 2008,
the M.Sc. degree in information technology from
the University of Abertay Dundee, Dundee, U.K.,
in 2010, and the Ph.D. degree in electronic engi-
neering from the University of Kent, Canterbury,
U.K., in 2015. He was with the Intelligent Inter- QIANG FU is currently pursuing the Ph.D. degree
actions Research Group, School of Engineering with the Computer Science and Technology Col-
and Digital Arts, where his research was focused lege, Harbin Engineering University, China. His
on using EEG for biometric person recognition. He was a Postdoctoral major research interests include intelligent infor-
Research Associate with the College of Engineering, Temple University, mation processing and computational intelligence.
Philadelphia, PA, USA, from 2016 to 2017. He is currently a Senior Research
Associate with the Intelligent Systems Research Centre, Ulster University,
Londonderry, U.K. His current research interests include signal process-
ing, pattern recognition, EEG-event detection, and MEG source reconstruc-
tion/localization.
10744 VOLUME 7, 2019

Mass Spectral Substance Detections Using Long Short-Term Memory Networks

Uploaded by

Copyright:

Available Formats

Mass Spectral Substance Detections Using Long Short-Term Memory Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mass Spectral Substance Detections Using Long Short-Term Memory Networks

Uploaded by

Copyright:

Available Formats

Received November 26, 2018, accepted December 23, 2018, date of publication January 9, 2019, date of current version

January 29, 2019.

Mass Spectral Substance Detections Using

Corresponding author: Yuling Luo ([email protected])

I. INTRODUCTION the current conditions, where the unusual substance is pre-

VOLUME 7, 2019 10735

10736 VOLUME 7, 2019

FIGURE 1. Mass spectrum of dopamine.

VOLUME 7, 2019 10737

FIGURE 2. Mass spectrum of shower gel.

10738 VOLUME 7, 2019

FIGURE 4. The architecture of RNNs [16].

1) RECURRENT NEURAL NETWORKS

2) LONG SHORT-TERM MEMORY IV. EXPERIMENTAL RESULTS

VOLUME 7, 2019 10739

FIGURE 6. Mass spectrum data of brewed coffee.

be easily applied to spectrometry data, the DSTL spectrum

10740 VOLUME 7, 2019

TABLE 3. Detection rates of the LSTM substance classification system.

TABLE 4. Detection rates of the R-LSTM substance classification system.

1) LSTM SUBSTANCE CLASSIFICATION SYSTEM 3) D-LSTM SUBSTANCE CLASSIFICATION SYSTEM

VOLUME 7, 2019 10741

TABLE 5. Detection rates of the D-LSTM substance classification system.

TABLE 6. Detection rates of the R-D-LSTM substance classification system.

10742 VOLUME 7, 2019

VOLUME 7, 2019 10743

YULING LUO received the Ph.D. degree in

SU YANG received the B.A. degree in mechan-

10744 VOLUME 7, 2019

You might also like