A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
A Comparative Study of Machine Learning Algorithms For Gas Leak Detection
J.E. Raghavendra Prasad1, Senthil M2, Akhil Yadav3, Paras Gupta4 and Anusha K S5
1 Department of Electronics and Communication Engineering, Amrita School of Engineering,
Coimbatore, Amrita Vishwa Vidyapeetham, India
[email protected], [email protected],
[email protected], [email protected],
[email protected].
Abstract. A gas leak detection system takes into account a lot of factors into
consideration for detecting leaks. Sensors are placed around the leak prone areas
and presence of leak is determined based on the concentration values of the
sensors. The models produce a variety of results depending on type of algorithm
used to determine leak. An error in detecting leak may cause harmful
consequences if the gas is explosive or corrosive in nature. In this paper we take
the concentration values for consideration and apply 4 machine learning
techniques namely Decision Tree, Random forest, ACF, and Naïve Bayes to a
concentration data of a 20-sensor network and the results have been compared.
The experimental results show that Random forest has the best performance when
compared to the other algorithms.
Keywords: ACF, Random forest, Decision tree, Naïve Bayes, gas leak
1 Introduction
Gas leaks occurring in industries are very hazardous because exposure to these gases
can be very harmful and in extreme cases fatal. Most of the leaks that occur in the
industry are unidentified and these gasses emit to the environment for prolonged
periods of time causing pollution to the environment. There have been many cases like
these in India, the most notable being the Bhopal gas tragedy. By the end of 2020 32,737
kms of gas pipelines are to be laid through the length and breadth of the country with a
few of these pipes going through areas of dense habitation. In these cases, prevention
of leaks is mandatory and anomalies that may occur due to human negligence and errors
is not acceptable. These scenarios call for the need of a cost-effective and accurate leak
detection system to avoid dangerous occurrences due to leaks in these areas. Leak
detection systems use the data acquired by the sensors in the leak prone areas as input
which are fed to the algorithm which then decide the presence of leak. Based on the
kind of signals acquired for the sensors like sound, concentration etc. an algorithm is
used to decide the presence of leak.
Leak localization techniques have also been used to determine the exact location of
the leak.[1] In this paper localization techniques that can be used has been proposed
and their performances have been compared and analyzed.
2
In [2], they have used conventional machine learning algorithms like Naïve Bayes
to detect breast cancer using a predictive breast cancer dataset and given a diagnosis of
how the conventionally these algorithms have performed in detecting breast cancer. In
this paper we take a similar approach by using conventional algorithms that can be used
to detect leak and the performances of these algorithms have been compared and
analyzed.
The data used in this paper was collected from the tests performed [7] in an experiment
at the Texas A&M Engineering Extension Service facility, College Station, TX, USA.
The sensor network consists of 20 sensors in a 4*5 configuration. The sensors are
placed at an elevation of 2.25m and the leak sources are at an elevation of 0.5m and
5.5m respectively. 60 releases were performed during this experiment where the leak
duration for each release was for a period of 2 minutes. Although there were only 2
sources of release the releases varied from each other based on the source’s nozzle size,
flow rate of gas. The nozzle sizes were 2,6.35,19 and 63.5 mm while the flow rates
ranged from 1.35 to 1020lb/hr. The sensor data has 0 – 2 per cent measurement range
(0 to 20,000 ppm). The data consist of a sensor showing a concentration reading at a
rate of 1 measurement every 5 seconds. Initially all the sensor readings were taken as
zero and in each recorded duration a sensor value is updated. The dataset consists of 20
concentration with a sensor value updated in each instance of time and the
corresponding status (leak or no-leak) based on the scenario during which the readings
were taken. We have only taken the rows where all the sensor values are unique from
3
the previous one i.e. Rows after 20 instances of time. We have also any duplicate values
from the dataset.
Fig. 1. the figure represents a data sample and how the values deviate during a leak. The
sudden data spikes show instances of leak.
The above figure shows how our data shows changes during a leak which is seen by the sudden
rise in the volume percentage of the fig.1 but due to the reason the propane is heavier than air it
settles in the air for a period of time before dispersing due to this reason we eliminate few of the
data values which is shown even after the leak time has been completed. After we finalize the
data from its raw to a processed state we feed the data into the algorithms discussed in section 3
and also details about the data are given if they are further modified from the processed state as
described in this section.
In simple words, this is a rather simple transformation, but it bridges the gap between
what we want to do and what we can do.
But a downside to the algorithm is that it makes a naive assumption that all features
are independent. Despite its simplicity, Naive Bayes forms a posit on which the other
models have been developed.
𝑃𝑡 (𝑧) = ∏20
𝑖 𝑝(𝑠𝑖 (𝑡)/𝑧) (3)
P = likelihood function
z = event of leak or no leak
i = sensor number (1 to 20)
t = time instance
To determine anomalies, we use methods that are used for detecting abnormal EEG
signals from normal ones. In [8], A neural network has been applied here for classifying
epileptic signals from normal EEG signal detection but we use a different approach
based on autocorrelation to find the extent of correlation between two values in the
same data set but during different time steps. Even neural networks can be used to find
anomalies between the leak signals but we use the autocorrelation function to provide
a real time monitoring approach to the leak detection system. From [9], where
autocorrelation has been used for robust artefact detection in ECG has been used in this
5
Fig. 4. The red lines represent the predicted leak points in the graph
This equation is the extension of the dot product between two vectors to determine the
cosine of angle between two vectors X and Y
𝑋.𝑌
cos ∅ = (5)
||𝑋||∗||𝑌||
6
But in our scenario, we have two list Xi and Yi which corresponds to the present
likelihood values and the other the likelihood values before an interval of 10 seconds
as the window size used for comparison in the auto-correlation function is 10 seconds.
So, we determine the cosine similarity between the two lists using equation 4. The
values range from -1 to 1 where 1 is perfectly similar and -1 is perfectly dissimilar.
Cosine similarity is used because it can determine how similar the series are
irrespective of their size. It measures the cosine of angles between two data points
projected in a multi-dimensional space. This captures the orientation of data in the 3d
space rather than the magnitude. Sometimes although the series are far apart by the
Euclidean distance (Magnitude) because of size they may have a small angle between
them making them more similar. If the data points are dissimilar more by a value greater
than the provided threshold, then it is determined as a leak. The threshold used in the
model was 90 percentile.
18 data samples were taken and the ACF model was applied to predict the leak points
in the sample. The model predicted 59 out of the 78 samples correctly with an accuracy
of 75.64%.
7
Fig. 6. Shows the number of true and false predictions that was obtained when performing each
of the 18 data samples using the ACF model.
The above detection algorithms have been applied to the data and the results have
been compared with following parameters
𝑡𝑝
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (8)
𝑡𝑝+𝑓𝑝
Tp is True positive
Fp is false positive
𝑡𝑝
𝑅𝑒𝑐𝑎𝑙𝑙 = (9)
𝑡𝑝+𝑓𝑛
Where
Tp is true positives
Fn is false negatives
Table 1. Represents the performances of each algorithm that have been applied
From the table we deduce that the random forest algorithm has better accuracy when
compared to the other developed models. The ACF model used shows an accuracy of
75.64 per cent. The ACF model runs with respect to real time and depends on the quality
of data it is being worked upon unlike the other three algorithms used which predict the
outcome from training we have done with the model. If more features are taken into
account the accuracy of decision tree and random forest improves due to the fact that
they consider a lot feature and choose the best choice to make a decision The mean
squared error can be used to detect large errors which cause more damage to the
algorithm than an equivalent amount of small errors. We see that Naïve Bayes shows a
very poor performance out of all the models and has a very low F1 score since if a
unique variable present in the dataset is not seen during training occurs during
prediction it just assigns a zero probability to it and a prediction is not made. After the
detection has been done if the area of the leak prone region is very large in such a way
that just detection is insufficient and it also calls for localization. Localization becomes
the next focus in the path to create an automated leak detection system. We can use
techniques like the ones used in [11] where a gas diffusion model is proposed taking
into account the possibility of gas dispersion due to wind using the weighted centroid
algorithm or other techniques like [12] where the support vector machine regression
algorithm has been proposed to localize single and multiple targets in an indoor
environment can be used to localize the leak occurring in the pipeline which is beyond
the scope of this paper and can be done as an extension of the leak detection system.
9
5 References
1. Anusha K. S., Dr. Ramanathan R., and Dr. Jayakumar M., “Device Free Localisation
Techniques in Indoor Environments”, Defence Science Journal (DSJ), vol. 69, no. 4, pp.
378-388 (2019).
2. Kedar Potdar, Rishab Kinnerkar, “A Comparative Study of Machine Learning Algorithms
applied to Predictive Breast Cancer Data”, International Journal of Science & research,
vol.5, no. 9, pp. 1550-1553 (2016)
3. Jiang ChunLei, Wang Yuan, “The research of natural gas pipeline leak detection based on
adaptive filter technology”, Proceedings of 2013 2nd IEEE International Conference on
Measurement, Information and Control (2013)
4. Fatih Erden, E. Birey Soyer, B. Ugur Toreyin, A. Enis Cetin, “VOC gas leak detection using
Pyro-electric Infrared sensors”, Acoustics, Speech, and Signal Processing, 1988. ICASSP-
88., 1988 International Conference on (2010)
5. Kun Wang, Linchao Zhuo, Yun Shao, Dong Yue and Kim Fung Tsang, “Toward Distributed
Data Processing on Intelligent Leak-Points Prediction in Petrochemical Industries”, IEEE
Transactions On Industrial Informatics, Vol. 12, No. 6 (2016).
6. J. Li, G.Chen, C. Liu and J. Tang, “Gaussian-based Models for Small Leak Identification of
Gas Transportation Pipes”, IEEE International Conference of Safety Produce
Informatization (IICSPI), Chongqing, China, pp. 1-5 (2018).
7. Fabien Chraim, Yusuf Bugra Erol, and Kris Pister, “Wireless Gas Leak Detection and
Localization”, IEEE Transactions on Industrial Informatics, Vol. 12, No. 2 (2016)
8. Anusha K.S, Mathew T. Mathews, Subha D. Puthankattil, “Classification of Normal and
Epileptic EEG Signal using Time & Frequency Domain Features through Artificial Neural
Network”, International Conference on Advances in Computing and Communications.
(2010)
9. C. Varon, D. Testelmans, B. Buyse, J. Suykens, and S. Van Huffel, “Robust artefact
detection in long-term ECG recordings based on autocorrelation function similarity and
percentile analysis,” in Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), pp.
3151–3154. (2012)
10. L. Aymon et al., "Leak Detection using Random Forest and Pressure Simulation," 2019 6th
Swiss Conference on Data Science (SDS), Bern, Switzerland, 2019, pp. 109-110, doi:
10.1109/SDS.2019.00008. (2019)
11. Li Qiuming, Liu Zhigang, Wang jinkuan, Xiao Xianda, “A gas source localization algorithm
based on wireless sensor network”, Proceeding of the 11 th World Congress on Intelligent
Control and Automation, Shenyang, pp.2514-2518 (2014).
12. Anusha K.S., Ramanathan R, Jayakumar M, “Link distance support vector regression (LD-
SVR) based device free localization technique in indoor environment,” in Engineering
Science and Technology, an International Journal (2019)