Self-Attention GRU Networks For Fake Job Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Volume 6, Issue 11, November – 2021 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Self-Attention GRU Networks for Fake Job


Classification
Ankit Kumar
University of Delhi
New Delhi, India

Abstract:- This paper analyses the Employment Scam A large amount of data that we encounter is text based.
Aegean Dataset and compares various machine learning Text data requires considering semantic as well as syntactic
algorithms including Logistic Regression, Decision Tree, significance of words. With deep learning, Natural Language
Random Forest, XGBoost, K-Nearest Neighbor, Naïve Processing (NLP) has accomplished great heights. It has
Bayes and Support Vector Classifier on the task of fake job empowered our machines to examine, comprehend and choose
classification. The paper also proposes two self-attention important contexts out of the compositions. Nowadays,
enhanced Gated Recurrent Unit networks, one with vanilla Recurrent Neural Network (RNN) has come up as an
RNN architecture and other with Bidirectional empowering alternative to withstand the test of time not just on
architecture, for classifying the fake job from real ones. one but numerous text-based jobs.
The proposed framework uses Gated Recurrent Units with
multi-head self-attention mechanism to enhance the long Recurrent Neural Networks have been utilized for
term retention within the network. In comparison to the different applications like text classification [1, 2, 3, 4], speech
other algorithms, the two GRU models proposed in this recognition [5], language translation [6], image captioning [7],
paper are able to obtain better result. and various others. Speculatively, vanilla Recurrent Neural
Networks show energetic common conduct for a time series
Keywords:- Fake Job Classification; Text Classification; task. However, Hochreiter [8] and Bengio et al., [9] proved that
Gated Recurrent Unit; Recurrent Neural Networks. vanilla Recurrent Neural Networks are frail to dispersing or
detonating slopes. To overcome this issue of frailing slope,
I. INTRODUCTION Hoschreiter proposed Long Short-Term Memory (LSTM) in
his 1997 paper [10]. LSTM is a combination of three gates
21st century world is the world of data. There has never namely input, forgets and output gates. The three gates together
been more data available to humans at once than now. Data is solve the issue of the slope. A more summarized adaptation of
available in various formats – texts, audios, videos, images, LSTM called Gated Recurrent Unit (GRU) was proposed in
graphs and more. There was a time when reaching people or 2011 by Cho et al., [11]. Both the LSTM and GRU have been
accessing things was not easy, but with the advent of internet used in RNN architecture for various tasks and have resulted in
everything has changed. People are one text or internet call many state-of-the-art results. Since GRU has only two gates
(audio or video) away from each other irrespective of their instead of three as is the case with LSTM, GRUs are
geographical locations. Books, journals, news, recruitments- computationally faster than LSTMs.
information regarding anything and everything was difficult to
access earlier, again with internet, it has become easier to The rest of the sections of this paper are structured as
access data or such information. Within three decades of arrival follows: Section 2 details about GRU cell and the use of GRU
of internet, we have moved from a time of not enough data to based RNN architectures for text classification. Besides this the
way too much data. With so much data available at once, we section details about the calculation of self-attention weights.
are at advantage. However, just as there is some bane In section 3, we have given the details our models. Section 4
associated with every boon, this availability of too much data includes the details of datasets, implementations, results and
also has some hidden issues. Especially when there is no the various observations that we have made based on the
validity of the data. With the advent of social media platforms outcomes of our experiments. We conclude this paper in
it has become really easy to share information obtained from section 5.
these data with people. However, this ease has brought a major
issue with it. People can and do share information with other II. BACKGROUND
people without verifying it. An information that is not verified
could pose some real threat to people using that data. For A. Recurrent Neural Networks for Text Classification
instance, a famous journalist in India thought she got a job to Recurrent neural network is a sequential network in which
teach at one of the top ranked university in the world. She quit output at each step is calculated as the function of its current
her job to accept this teaching position. However, later she got input and the outputs obtained from the previous inputs. With
to know that the job offer that she received was fake and there the recent progression within the field of text classification
was no teaching job for her. She had left her journalist job by utilizing RNNs, recurrent networks are being utilized for an
then. This is just one such instance of people falling in the trap assortment of errands. Irsoy et al., [12] in 2014, used RNN for
of fake or unverified information. opinion mining. Pollastri et al., [13], in 2002, used RNNs for
estimating the protein secondary structure. Tang et. al., [14] did

IJISRT21NOV109 www.ijisrt.com 415


Volume 6, Issue 11, November – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
sentiment classification using the gated recurrent network in the forward layer in time from n =1 to T. The output layer yt is
2015. Arevian [15], in 2007, used RNN to classify real-life text then updated as:
data. Melsin et al., [17], in 2015, used RNNs for the task of slot
filling. A combination of RNN and Convolution Neural ⃗⃗⃗⃗
ℎ𝑛 = Ϝ(𝑊𝑥h⃗ 𝑥𝑛 + 𝑊h⃗h⃗ ⃗⃗⃗⃗⃗⃗⃗⃗⃗
ℎ𝑛+1 + 𝑏h⃗ ) (8)
Network (CNN) was used by Lai et al., [16] in 2015 to classify
texts. Liu et al., [2] used recurrent neural networks for ⃖⃗⃗⃗⃗
ℎ𝑛 = Ϝ(𝑊𝑥ℎ⃖⃗⃗ 𝑥𝑛 + 𝑊ℎ⃖⃗⃗ℎ⃖⃗⃗ ⃖⃗⃗⃗⃗⃗⃗⃗⃗⃗
ℎ𝑛+1 + 𝑏ℎ⃖⃗⃗ ) (9)
implementing a joint intent detection and slot filling model in
2016. Lee et al., [18] also used the RNNs in combination with ⃗⃗⃗⃗ ⃖⃗⃗⃗⃗
𝑦𝑛 = 𝑊⃗h𝑦 ℎ𝑛 + 𝑊ℎ⃖⃗⃗𝑦 ℎ𝑛 + 𝑏𝑦 (10)
CNNs to classify short texts in 2016. In text classification
RNNs are being employed for various tasks. Therefore, it Bidirectional RNNs in combination with LSTM cells
seems natural to employ RNNs for sequence based tasks. allows the architecture to access contexts in longer range in
both the directions.
B. Gated Recurrent Unit
In GRU, as depicted in Fig 1, activation ℎ𝑡𝑘 for the kth
recurrent unit at time t is calculated as the linear interpolation
𝑘
between the previous activation ℎ𝑡−1 and future candidate
̅ 𝑘
activation ℎ𝑡 . It is given as

ℎ𝑡𝑘 = (1 − 𝑧𝑡𝑟 )ℎ𝑡−1


𝑟
+ 𝑧𝑡𝑟 ℎ̅𝑡𝑘 (1)

The update gate 𝑧𝑡𝑘 decides what information will be


updated by the unit. It is computed as

𝑧𝑡𝑘 = 𝜎(𝑊𝑧 𝑥𝑡 + 𝑈𝑧 ℎ𝑡−1 )𝑘 (2)

Reset gate allows GRU to reads the new input as if it is


the first word of the sequence by forgetting the previous Fig. 2. A Bidirectional RNN with forward and backward
computations done on the input and. It is calculated as: layers.

𝑟𝑡𝑘 = 𝜎(𝑊𝑥 𝑥𝑡 + 𝑈𝑟 ℎ𝑡−1 )𝑘 (3) D. Self-Attention


Starting from Bahdanau’s attention model [19] to the
Here Wr and Wz are the weights from input to hidden Transformer model [20] many attention models have been
layer. σ is the sigmoid function and, Ur and Uz are the weights proposed in deep learning. The attention model allows the
from one hidden unit to its next hidden unit in the layer. output to pay extra attention on the inputs while estimating the
outpu. In contrast, the self-attention method allows interactions
on inputs with each other i.e. this models allows calculation of
the attention of all other inputs with respect to every single
input. Text classification involves focusing on all the words.
Therefore, given the requirement of our experiments we will be
applying Lin et al., [21] self-attention mechanism proposed in
2017.

The h attention heads are utilized by the self-attention sub


layers. To outline the sub layer abdicate, parameterized linear
transformation is enforced to the concatenation of the outcome
obtained from each head. Each attention head estimates a new
sequence z = (z1, z2, ..., zn) by operating on the input sequence
x = (x1, x2, ..., xn).
𝑛
Fig. 1. A GRU cell with update (z) and reset (r) gates, and h 𝑧𝑖 = ∑𝑘=1 𝛼𝑖𝑘 (𝑥𝑘 𝑊 𝑉 ) (11)
as the activation and ℎ̅ as the candidate activation.
Each weight coefficient, αik, is estimated using a softmax
C. Bidirectional RNNs function:
𝑒𝑥𝑝 𝑒𝑖𝑘
One of the flaw with vanilla RNN is that it uses the past 𝛼𝑖𝑘 = 𝑛 (12)
∑𝑗=1 𝑒𝑥𝑝 𝑒𝑖𝑘
contexts only. Bidirectional RNNs (BRNNs) help us overcome
this shortcoming by processing the data in both the forward and
the backward direction in time before the layers output is fed to
the output layer. Bidirectional RNNs calculate the forward and
the backward hidden sequences, and the output sequence y by Further, by comparing the two inputs eik is calculated as:
looping through the backward layer in time from n = T to 1 and

IJISRT21NOV109 www.ijisrt.com 416


Volume 6, Issue 11, November – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
𝑇
(𝑥𝑖𝑊 𝑄 ) (𝑥𝑘 𝑊 𝑗) data. The first variation has location, company profile,
𝑒𝑖𝑘 = (13) department, requirements, benefits, employment type,
√𝑑𝑧
required experience, required education and industry columns
WV , WQ, WJ ε Rdx× dz are the four matrix parameters. For every removed as these columns have more than 60% missing
attention head and the layer, all these four matrices are always values. Further, we have added one column to the dataset by
unique. adding the length of description as an extra feature. The second
variation of our dataset has several columns combined into
III. MODEL DETAILS one. The description, location, department, company profile,
requirements, benefits, employee type, required experience,
In this work, we have conducted experiments with two required education, industry and function columns combined
models: a GRU Classifier with Self-Attention (GRUSA) and a into one feature. Hence, the available values that were
Bidirectional GRU Classifier with Self-Attention (BGRUSA). removed in the first variation are added to the description
column thereby adding more contexts for making the decision.
GRUSA uses the vanilla Recurrent Neural Network with Further, one-hot representation is used to represent both the
LSTM cells while BGRUSA uses the bidirectional LSTM variations of the dataset. The dataset thus obtained is used for
network. training all the models. For RNN models based on first
variation, the sentence length used in 1024 while for the
A. GRU Classifier with Self-Attention second model the sentence length used is 2608. These values
GRUSA uses GRU cells in the Bidirectional Recurrent are decided based on the median length of the sentences.
Neural Network architecture. At the top of this architecture the
self-attention layer is implemented. This layer ensures that the
model has better focus on all the words with respect to all the
other words in the input.

B. Bidirectional GRU Classifier with Self-Attention


BGRUSA use the bidirectional approach wherein an
RNN with GRU cells runs in forward direction and another
RNN in backward direction. An RNN in both direction
provides the extra context. Thus, producing an opportunity for
better decision making. At the top of this architecture the self-
attention layer is implemented. This layer ensures that the
model has better focus on all the words with respect to all the
other words in the input.

IV. EXPERIMENTS

A. Dataset
We have conducted training and testing of our models
using the Employment Scam Aegean Dataset (EMSCAD) [6] Fig. 3. Heatmap for null values in the dataset.
which is a publicly available dataset containing 17,880 real-life
job advertisements that aims at providing a clear picture of the The GRUSA model, for the first variation of dataset, uses
Employment Scam problem to the research community. 1024 GRU cells followed by the self-attention layer to
EMSCAD records were annotated by hand and were classified improve the learning over longer length. It is followed by a
into two categories. The dataset contains 17,014 legitimate job dense hidden layer of 2048 cells with sigmoid activation
advertisements and 866 fraudulent job advertisements. These function which is further followed by the output layer which
advertisements were published between 2012 to 2014. uses softmax as the activation function. Same configurations
are used for the BGRUSA model as well. For the second
The dataset was divided into training, validation and variation of the data, we have used 2608 GRU cells followed
testing sets randomly with 60% of real and fake records used by the self-attention model. The dense layer for this variation
for training, 20% for validation and 20% for testing. has 4096 cells with sigmoid activation function further
followed by output layer with softmax function. Again, same
B. Implementation configurations are used for BGRUSA model as well.
The training data is firstly preprocessed in order to
prepare it for training the models. The string is encoded into To compare our models learning, we have implemented
utf-8 unicode. All the words are converted to lowercase. Porter other machine learning algorithms also. These algorithms
stemmer is applied to the whole database to remove the include Logistic Regression, Decision Tree, Random Forest,
common morphological and inflexional endings from the XGBoost, K-Nearest Neighbor, Naïve Bayes and Support
words. Several features from original dataset is removed. job Vector Classifier. Besides these, we have also implemented
id is removed since it has all the unique values. Further we the base GRU and Bidirectional GRU models without self-
have removed the columns where we have missing value in attention. We have used accuracy, precision, recall and F1
description column. Now we prepare two variations of this score to evaluate the learning of all the models.

IJISRT21NOV109 www.ijisrt.com 417


Volume 6, Issue 11, November – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
text sequences better than the basic GRU model. When trained
All the models use binary cross entropy loss function, with self-attention, both the unidirectional and bidirectional
Adam to optimize the model and the learning rate of 0.004. models perform better than the non-attention models. The
bidirectional self -attention model performs better than every
C. Results other model for the task of fake job classification. In addition,
The results of all the models are shared in the Table I on resampling and balancing the dataset impacts the learning a lot
the next page. All the four GRU based models are able to learn and allows a more stable learning to take place.
the patterns better than all the other baselines models. Among
baseline models KNN performs better than the other REFERENCES
algorithms in terms of accuracy with 95.24% while XGBoost
performs better in terms of F1 Score with 93.26%. The basic [1]. J. C. Chang and C.C. Lin, ”Recurrent-neural-network for
GRU model attains the accuracy of 93.01 and the F1 score of language detection on Twitter code-switching corpus.”
94.25 while the bidirectional GRU (BGRU) model attains the arXiv preprint arXiv:1412.4314 (2014).
accuracy of 94.62 and the F1 score of 94.76. The self-attention [2]. L. Bing, and I Lane, ”Attention-based recurrent neural
models perform better than non-attention models with network models for joint intent detection and slot
GRUSA model attaining the accuracy of 95.49% and F1 Score filling.” arXiv preprint arXiv:1609.01454 (2016).
of 94.98%. The bidirectional self-attention GRU model [3]. P. Li, J. Li, F. Sun, and P. Wang, ”Short Text Emotion
(BGRUSA) model outperforms all the models in terms of all Analysis Based on Recurrent Neural Network.” In
the four metrics with the accuracy of 97.40 and the F1 score of Proceedings of the 6th International Conference on
95.56. Information Engineering, p. 6. ACM, 2017.
[4]. D. Tang, B. Qin, and T. Liu, ”Document modeling with
TABLE I. RESULTS OF ALL THE MODELS gated recurrent neural network for sentiment
F1 classification.” In Proceedings of the 2015 conference on
Models Accuracy Precision Recall
Score empirical methods in natural language processing, pp.
GRU 93.01 85.36 95.82 94.25 1422-1432. 2015.
BGRU 94.62 86.01 96.78 94.76 [5]. A. Graves, A. R. Mohamed, and G. Hinton, ”Speech
GRUSA 95.49 86.99 97.21 94.98 recognition with deep recurrent neural networks.” In
BGRUSA 97.40 88.90 98.38 95.56 Acoustics, speech and signal processing (icassp), 2013
Logistic ieee international conference on, pp. 6645-6649. IEEE,
83.23 78.98 86.33 84.76 2013.
Regression
Decision [6]. I. Sutskever, O. Vinyals, and Q. V. Le, ”Sequence to
Tree
82.28 78.48 88.43 83.37 sequence learning with neural networks.” In Advances in
Random neural information processing systems, pp. 3104-3112.
93.04 84.91 93.67 91.03 2014.
Forest
XGBoost 94.67 84.13 95.71 93.26 [7]. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, ”Show
and tell: A neural image caption generator.” In Computer
KNN (20n) 95.24 81.55 91.39 88.45
Vision and Pattern Recognition (CVPR), 2015 IEEE
Naïve Bayes 94.03 78.01 85.41 79.60
Conference on, pp. 3156-3164. IEEE, 2015.
SVC 79.10 75.58 86.27 80.57 [8]. S. Hochreiter, ”The vanishing gradient problem during
learning recurrent neural nets and problem solutions.”
D. Observations International Journal of Uncertainty, Fuzziness and
Based on the results obtained from the set of experiments Knowledge-Based Systems 6, no. 02 (1998): 107-116.
that we have conducted in this work, we come up with [9]. R. Pascanu, T. Mikolov, and Y. Bengio, ”On the
following observations: difficulty of training recurrent neural networks.” In
 Bidirectional model performed better than the International Conference on Machine Learning, pp.
unidirectional architecture. 1310-1318. 2013.
 Self-attention models perform better than the non-attention [10]. S. Hochreiter, and J. Schmidhuber, ”Long short-term
models. memory.” Neural computation 9, no. 8 (1997): 1735-
 The bidirectional models are able to learn the sequences 1780.
almost like the self-attention based unidirectional models. [11]. K. Cho, B. V. Merrinboer, C. Gulcehre, D. Bahdanau, F.
 Balancing the dataset increases the overall performance of Bougares, H. Schwenk, and Y. Bengio, ”Learning phrase
all the models. representations using RNN encoder-decoder for
statistical machine translation.” arXiv preprint
V. CONCLUSION arXiv:1406.1078 (2014).
[12]. O. Irsoy, and C. Cardie, ”Opinion mining with deep
In this paper, we have implemented a series of recurrent neural networks.” In Proceedings of the 2014
experiments with unidirectional and Bidirectional RNN conference on empirical methods in natural language
architecture using GRU cell, firstly with self-attention and processing (EMNLP), pp. 720-728. 2014.
then without self-attention. Our results show that even the [13]. G. Pollastri, D. Przybylski, B. Rost, and P. Baldi,
basic GRU model performs better than other baseline ”Improving the prediction of protein secondary structure
algorithms. The Bidirectional GRU is able to remember the in three and eight classes using recurrent neural networks

IJISRT21NOV109 www.ijisrt.com 418


Volume 6, Issue 11, November – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
and profiles.” Proteins: Structure, Function, and
Bioinformatics 47, no. 2 (2002): 228-235.
[14]. Y. Tang, and J. Liu. ”Gated Recurrent Units for Airline
Sentiment Analysis of Twitter Data.”
[15]. G. Arevian, ”Recurrent neural networks for robust real-
world text classification.” In Proceedings of the
IEEE/WIC/ACM International Conference on Web
Intelligence, pp. 326-329. IEEE Computer Society, 2007.
[16]. G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D.
H. Tur, X. He, ”Using recurrent neural networks for slot
filling in spoken language understanding.” IEEE/ACM
Transactions on Audio, Speech, and Language
Processing 23, no. 3 (2015): 530-539.
[17]. J. Wang, L.C. Yu, K. R. Lai, and X. Zhang,
”Dimensional sentiment analysis using a regional CNN-
LSTM model.” In Proceedings of the 54th Annual
Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), vol. 2, pp. 225-
230. 2016.
[18]. J. Y. Lee, and F. Dernoncourt, ”Sequential short-text
classification with recurrent and convolutional neural
networks.” arXiv preprint arXiv:1603.03827 (2016).
[19]. D. Bahdanau, K. Cho, and Y. Bengio, ”Neural machine
translation by jointly learning to align and translate.”
arXiv preprint arXiv:1409.0473 (2014).
[20]. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
and Illia Polosukhin. "Attention is all you need." In
Advances in neural information processing systems, pp.
5998-6008. 2017.
[21]. Z. Lin, M. Feng, C. N. Santos, M. Yu, B. Xiang, B. Zhou,
and Y. Bengio, ”A structured self-attentive sentence
embedding.” arXiv preprint arXiv:1703.03130 (2017).

IJISRT21NOV109 www.ijisrt.com 419

You might also like