A Comparative Study of Machine Learning Algorithms For Gas Leak Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A COMPARATIVE STUDY OF MACHINE LEARNING

ALGORITHMS FOR GAS LEAK DETECTION

J.E. Raghavendra Prasad1, Senthil M2, Akhil Yadav3, Paras Gupta4 and Anusha K S5
1 Department of Electronics and Communication Engineering, Amrita School of Engineering,
Coimbatore, Amrita Vishwa Vidyapeetham, India
[email protected], [email protected],
[email protected], [email protected],
[email protected].

Abstract. A gas leak detection system takes into account a lot of factors into
consideration for detecting leaks. Sensors are placed around the leak prone areas
and presence of leak is determined based on the concentration values of the
sensors. The models produce a variety of results depending on type of algorithm
used to determine leak. An error in detecting leak may cause harmful
consequences if the gas is explosive or corrosive in nature. In this paper we take
the concentration values for consideration and apply 4 machine learning
techniques namely Decision Tree, Random forest, ACF, and Naïve Bayes to a
concentration data of a 20-sensor network and the results have been compared.
The experimental results show that Random forest has the best performance when
compared to the other algorithms.

Keywords: ACF, Random forest, Decision tree, Naïve Bayes, gas leak

1 Introduction

Gas leaks occurring in industries are very hazardous because exposure to these gases
can be very harmful and in extreme cases fatal. Most of the leaks that occur in the
industry are unidentified and these gasses emit to the environment for prolonged
periods of time causing pollution to the environment. There have been many cases like
these in India, the most notable being the Bhopal gas tragedy. By the end of 2020 32,737
kms of gas pipelines are to be laid through the length and breadth of the country with a
few of these pipes going through areas of dense habitation. In these cases, prevention
of leaks is mandatory and anomalies that may occur due to human negligence and errors
is not acceptable. These scenarios call for the need of a cost-effective and accurate leak
detection system to avoid dangerous occurrences due to leaks in these areas. Leak
detection systems use the data acquired by the sensors in the leak prone areas as input
which are fed to the algorithm which then decide the presence of leak. Based on the
kind of signals acquired for the sensors like sound, concentration etc. an algorithm is
used to decide the presence of leak.
Leak localization techniques have also been used to determine the exact location of
the leak.[1] In this paper localization techniques that can be used has been proposed
and their performances have been compared and analyzed.
2

In [2], they have used conventional machine learning algorithms like Naïve Bayes
to detect breast cancer using a predictive breast cancer dataset and given a diagnosis of
how the conventionally these algorithms have performed in detecting breast cancer. In
this paper we take a similar approach by using conventional algorithms that can be used
to detect leak and the performances of these algorithms have been compared and
analyzed.

1.1 Related works


The correct prediction of leak in industries are vital due to the impact they
have on both the environment and human life around it. Many industries work with
explosive gases which if leaked may have hazardous consequences. [3] Acoustic based
detection systems have been developed using adaptive filter technology to detect leaks
in natural gas pipelines. The detection system has been simulated using LabVIEW by
differentiating the leak characteristics. Another method [4] that they have used for
detecting VOC (volatile organic compound) Gas leaks is by using infrared sensors. In
this paper they have created a circuit that produces an analogous time series output from
the PIR sensor. After processing the data obtained and producing the required wavelet
coefficients a Markov Modeling based classifier is developed to detect the gas leaks.
Many solutions using machine learning and neural networks have been proposed. In
[5], For predicting the leak points in a petrochemical industry, a three-level back
propagation algorithm has been proposed. This algorithm improves the response speed
of the learning process and lowers the prediction time. They have also proposed a better
method for to decrease the learning error and improve the convergence rate of the
algorithm. In [6] They have proposed a gaussian based model to detect small leaks in
gas transportation pipes by learning distribution of small leaks in pipeline. They have
also based the model by analyzing the acoustic signals. The model proposed takes into
account the environmentally and randomly high noise.

2 Data and pre-processing

The data used in this paper was collected from the tests performed [7] in an experiment
at the Texas A&M Engineering Extension Service facility, College Station, TX, USA.
The sensor network consists of 20 sensors in a 4*5 configuration. The sensors are
placed at an elevation of 2.25m and the leak sources are at an elevation of 0.5m and
5.5m respectively. 60 releases were performed during this experiment where the leak
duration for each release was for a period of 2 minutes. Although there were only 2
sources of release the releases varied from each other based on the source’s nozzle size,
flow rate of gas. The nozzle sizes were 2,6.35,19 and 63.5 mm while the flow rates
ranged from 1.35 to 1020lb/hr. The sensor data has 0 – 2 per cent measurement range
(0 to 20,000 ppm). The data consist of a sensor showing a concentration reading at a
rate of 1 measurement every 5 seconds. Initially all the sensor readings were taken as
zero and in each recorded duration a sensor value is updated. The dataset consists of 20
concentration with a sensor value updated in each instance of time and the
corresponding status (leak or no-leak) based on the scenario during which the readings
were taken. We have only taken the rows where all the sensor values are unique from
3

the previous one i.e. Rows after 20 instances of time. We have also any duplicate values
from the dataset.

Fig. 1. the figure represents a data sample and how the values deviate during a leak. The
sudden data spikes show instances of leak.

The above figure shows how our data shows changes during a leak which is seen by the sudden
rise in the volume percentage of the fig.1 but due to the reason the propane is heavier than air it
settles in the air for a period of time before dispersing due to this reason we eliminate few of the
data values which is shown even after the leak time has been completed. After we finalize the
data from its raw to a processed state we feed the data into the algorithms discussed in section 3
and also details about the data are given if they are further modified from the processed state as
described in this section.

3 Description of algorithms used:

3.1 Naïve Bayes:


This Classifier is used for membership probabilities prediction of each separate class.
In our model, the different classes are leak and no leak. Based on the class with the
highest probability the scenario of leak or no-leak is predicted.
These classifiers are an application of the Bayes probability theorem. It is not an
individual algorithm but a family of algorithms sharing a common trait, i.e. every pair
of features being classified is independent of each other. Firstly, the feature matrix
contains all the vectors of dataset in which each vector consists of the value
of dependent features. Secondly, the Response vector contains the value of class
variable which is the output or predicted value for each row of feature matrix. In our
paper, we have used Naïve Bayes Classifier in order to identify the leak and no-leak
conditions.
The simple Bayes theorem can be expressed as: -
𝐼
𝐼 𝑃( )𝑃(𝐽)
𝐽
𝑃( ) = (1)
𝐽 𝑃(𝐽)

which can be theoretically represented as,


𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑∗𝑃𝑟𝑖𝑜𝑟
Posterior = (2)
𝐸𝑣𝑖𝑑𝑒𝑛𝑐𝑒
4

In simple words, this is a rather simple transformation, but it bridges the gap between
what we want to do and what we can do.
But a downside to the algorithm is that it makes a naive assumption that all features
are independent. Despite its simplicity, Naive Bayes forms a posit on which the other
models have been developed.

3.2 ACF model:


In this model we initially convert all 20 sensors values in each row into a single vale by
determining the likelihood of leak and likelihood of no leak at each instance of time.
We first prepare the probabilities of events in individual columns of the data. To
determine the probability of event of leak we find the individual probability of both
leak(z=1) and no leak(z=0). After determining the probability of leak at each instance.
We use the likelihood function to determine the likelihood of leak and likelihood of no
leak at a specific instance of time. The likelihood at each time step is calculated by:

𝑃𝑡 (𝑧) = ∏20
𝑖 𝑝(𝑠𝑖 (𝑡)/𝑧) (3)
P = likelihood function
z = event of leak or no leak
i = sensor number (1 to 20)
t = time instance

Fig. 2. Represents likelihood of leak plot

To determine anomalies, we use methods that are used for detecting abnormal EEG
signals from normal ones. In [8], A neural network has been applied here for classifying
epileptic signals from normal EEG signal detection but we use a different approach
based on autocorrelation to find the extent of correlation between two values in the
same data set but during different time steps. Even neural networks can be used to find
anomalies between the leak signals but we use the autocorrelation function to provide
a real time monitoring approach to the leak detection system. From [9], where
autocorrelation has been used for robust artefact detection in ECG has been used in this
5

model to detect leak. Autocorrelation is a degree of similarity between a given time


series and its former-self lagged by a factor k which depends on the user. It calculates
the correlation between two time series, except that here both are the same series one
the current one and other the lagged one. We take a time lag of 5 seconds. We then use
cosine similarity to determine the similarity between the data points in both the time
series.

Fig.3. Represents likelihood of no-leak plot

Fig. 4. The red lines represent the predicted leak points in the graph

The cosine similarity between 2 vectors is given by the equation


∑𝑛
𝑖=1 𝑋𝑖∗𝑌𝑖
(4)
√∑𝑛 2 𝑛
𝑖=1 𝑋𝑖 ∗√∑𝑖=1 𝑌𝑖
2

This equation is the extension of the dot product between two vectors to determine the
cosine of angle between two vectors X and Y
𝑋.𝑌
cos ∅ = (5)
||𝑋||∗||𝑌||
6

But in our scenario, we have two list Xi and Yi which corresponds to the present
likelihood values and the other the likelihood values before an interval of 10 seconds
as the window size used for comparison in the auto-correlation function is 10 seconds.
So, we determine the cosine similarity between the two lists using equation 4. The
values range from -1 to 1 where 1 is perfectly similar and -1 is perfectly dissimilar.
Cosine similarity is used because it can determine how similar the series are
irrespective of their size. It measures the cosine of angles between two data points
projected in a multi-dimensional space. This captures the orientation of data in the 3d
space rather than the magnitude. Sometimes although the series are far apart by the
Euclidean distance (Magnitude) because of size they may have a small angle between
them making them more similar. If the data points are dissimilar more by a value greater
than the provided threshold, then it is determined as a leak. The threshold used in the
model was 90 percentile.

3.3 Decision tree:


We use decision tree for decision analysis (determine presence of leak) and to help
identify the easiest strategy to reach the goal. Decision tree is a supervised learning and
performances well for both categorical and continuous data. It divides the data-samples
into two or more sub sets based on most significant differentiator in the input variable.
Decision tree consist of a root node with all the samples and then the algorithm breaks
down the data-set into little subsets while correspondingly an associated decision tree
is developed incrementally. The final tree consists of a tree with decision and leaf
nodes. The decision node has two or more subsets (branches) while leaf node
corresponds to a classification or decision, in our context leak or no leak. The samples
are split from the root node into other decision nodes which in turn are divided into leaf
nodes.

3.4 Random forest:


The Random forest consist of a mixture of a huge number of individual decision tree
that together work as a single classifier. Many numbers of decision trees which act
together as a complete set outperform any of the individual models. From [10], in which
they detected water leaks in a pipeline by using the random forest algorithm. We can
see that that the algorithm works very well in conventional situations as shown in the
paper where inexpensive pressure sensors have been installed to obtain an accuracy of
nearly 96%. This model works very well due to the fact that the data have very low
correlation to another. This algorithm uses bagging and feature randomness for building
each individual tree and creates a forest of trees which are uncorrelated to each other.
Taking into account that all sensor values are independent of each other the data fits
perfectly to the model.

4 Results and inference:

18 data samples were taken and the ACF model was applied to predict the leak points
in the sample. The model predicted 59 out of the 78 samples correctly with an accuracy
of 75.64%.
7

Fig. 6. Shows the number of true and false predictions that was obtained when performing each
of the 18 data samples using the ACF model.

The above detection algorithms have been applied to the data and the results have
been compared with following parameters

Mean Absolute Error (MAE) and Mean Squared Error (MSE)


1
𝑀𝐴𝐸(ℎ, ℎ̂) = ∑𝑛−1 ̂
𝑖=0 |ℎ𝑖 − ℎ𝑖 | (6)
𝑛
1
𝑀𝑆𝐸(ℎ, ℎ̂) = ∑𝑛−1 ̂ 2
𝑖=0 (ℎ𝑖 − ℎ𝑖 ) (7)
𝑛

ℎ̂𝑖 the predicted value of the i-th sample


ℎ𝑖 the corresponding true value of the i-th sample
n sample range

𝑡𝑝
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (8)
𝑡𝑝+𝑓𝑝

Tp is True positive
Fp is false positive
𝑡𝑝
𝑅𝑒𝑐𝑎𝑙𝑙 = (9)
𝑡𝑝+𝑓𝑛

Where
Tp is true positives
Fn is false negatives

F1 score can be interpreted as weight harmonic mean of precision and recall


2∗(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙)
𝐹1 𝑠𝑐𝑜𝑟𝑒 = (10)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
8

Model Accuracy MAE MSE Precision Recall F1 score


name

Naïve 77.48 0.225 0.225 0.481 0.337 0.396


Bayes

Decision 94.25 0.059 0.056 0.849 0.898 0.873


Tree

Random 97.4 0.026 0.026 0.974 0.906 0.939


Forest

Table 1. Represents the performances of each algorithm that have been applied

From the table we deduce that the random forest algorithm has better accuracy when
compared to the other developed models. The ACF model used shows an accuracy of
75.64 per cent. The ACF model runs with respect to real time and depends on the quality
of data it is being worked upon unlike the other three algorithms used which predict the
outcome from training we have done with the model. If more features are taken into
account the accuracy of decision tree and random forest improves due to the fact that
they consider a lot feature and choose the best choice to make a decision The mean
squared error can be used to detect large errors which cause more damage to the
algorithm than an equivalent amount of small errors. We see that Naïve Bayes shows a
very poor performance out of all the models and has a very low F1 score since if a
unique variable present in the dataset is not seen during training occurs during
prediction it just assigns a zero probability to it and a prediction is not made. After the
detection has been done if the area of the leak prone region is very large in such a way
that just detection is insufficient and it also calls for localization. Localization becomes
the next focus in the path to create an automated leak detection system. We can use
techniques like the ones used in [11] where a gas diffusion model is proposed taking
into account the possibility of gas dispersion due to wind using the weighted centroid
algorithm or other techniques like [12] where the support vector machine regression
algorithm has been proposed to localize single and multiple targets in an indoor
environment can be used to localize the leak occurring in the pipeline which is beyond
the scope of this paper and can be done as an extension of the leak detection system.
9

5 References
1. Anusha K. S., Dr. Ramanathan R., and Dr. Jayakumar M., “Device Free Localisation
Techniques in Indoor Environments”, Defence Science Journal (DSJ), vol. 69, no. 4, pp.
378-388 (2019).
2. Kedar Potdar, Rishab Kinnerkar, “A Comparative Study of Machine Learning Algorithms
applied to Predictive Breast Cancer Data”, International Journal of Science & research,
vol.5, no. 9, pp. 1550-1553 (2016)
3. Jiang ChunLei, Wang Yuan, “The research of natural gas pipeline leak detection based on
adaptive filter technology”, Proceedings of 2013 2nd IEEE International Conference on
Measurement, Information and Control (2013)
4. Fatih Erden, E. Birey Soyer, B. Ugur Toreyin, A. Enis Cetin, “VOC gas leak detection using
Pyro-electric Infrared sensors”, Acoustics, Speech, and Signal Processing, 1988. ICASSP-
88., 1988 International Conference on (2010)
5. Kun Wang, Linchao Zhuo, Yun Shao, Dong Yue and Kim Fung Tsang, “Toward Distributed
Data Processing on Intelligent Leak-Points Prediction in Petrochemical Industries”, IEEE
Transactions On Industrial Informatics, Vol. 12, No. 6 (2016).
6. J. Li, G.Chen, C. Liu and J. Tang, “Gaussian-based Models for Small Leak Identification of
Gas Transportation Pipes”, IEEE International Conference of Safety Produce
Informatization (IICSPI), Chongqing, China, pp. 1-5 (2018).
7. Fabien Chraim, Yusuf Bugra Erol, and Kris Pister, “Wireless Gas Leak Detection and
Localization”, IEEE Transactions on Industrial Informatics, Vol. 12, No. 2 (2016)
8. Anusha K.S, Mathew T. Mathews, Subha D. Puthankattil, “Classification of Normal and
Epileptic EEG Signal using Time & Frequency Domain Features through Artificial Neural
Network”, International Conference on Advances in Computing and Communications.
(2010)
9. C. Varon, D. Testelmans, B. Buyse, J. Suykens, and S. Van Huffel, “Robust artefact
detection in long-term ECG recordings based on autocorrelation function similarity and
percentile analysis,” in Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), pp.
3151–3154. (2012)
10. L. Aymon et al., "Leak Detection using Random Forest and Pressure Simulation," 2019 6th
Swiss Conference on Data Science (SDS), Bern, Switzerland, 2019, pp. 109-110, doi:
10.1109/SDS.2019.00008. (2019)
11. Li Qiuming, Liu Zhigang, Wang jinkuan, Xiao Xianda, “A gas source localization algorithm
based on wireless sensor network”, Proceeding of the 11 th World Congress on Intelligent
Control and Automation, Shenyang, pp.2514-2518 (2014).
12. Anusha K.S., Ramanathan R, Jayakumar M, “Link distance support vector regression (LD-
SVR) based device free localization technique in indoor environment,” in Engineering
Science and Technology, an International Journal (2019)

You might also like