T-BERTSum Topic-Aware Text Summarization Based on BERT
T-BERTSum Topic-Aware Text Summarization Based on BERT
T-BERTSum Topic-Aware Text Summarization Based on BERT
Abstract— In the era of social networks, the rapid growth input documents while still retaining its key content. This
of data mining in information retrieval and natural language technique plays an important role in information retrieval and
processing makes automatic text summarization necessary. Cur- natural language processing (NLP) [1], which is a research
rently, pretrained word embedding and sequence to sequence
models can be effectively adapted in social network summa- hotspot in a multitude of fields, such as computer science,
rization to extract significant information with strong encoding multimedia, and statistics [2]. Currently, typical summarization
capability. However, how to tackle the long text dependence methods include extractive and abstractive [3]. The extractive
and utilize the latent topic mapping has become an increasingly method selects salient sentences or reorganizes sentences from
crucial challenge for these models. In this article, we propose the original text, and the abstractive method generates novel
a topic-aware extractive and abstractive summarization model
named T-BERTSum, based on Bidirectional Encoder Represen- words or phrases with comprehension. Generally, most of the
tations from Transformers (BERTs). This is an improvement over existing methods are designed to encode paragraphs and then
previous models, in which the proposed approach can simultane- decode with different mechanisms. Nevertheless, there is a
ously infer topics and generate summarization from social texts. large amount of information loss in the encoding and decoding
First, the encoded latent topic representation, through the neural stages. Therefore, existing works on summarization mainly
topic model (NTM), is matched with the embedded representation
of BERT, to guide the generation with the topic. Second, focus on word embedding or contextual contents.
the long-term dependencies are learned through the transformer However, the advantages of word embedding are limited by
network to jointly explore topic inference and text summa- specific small datasets, which require richer contextual infor-
rization in an end-to-end manner. Third, the long short-term mation. The future direction of summarization is to predict
memory (LSTM) network layers are stacked on the extractive words from full context by language modeling and represen-
model to capture sequence timing information, and the effective
information is further filtered on the abstractive model through tation learning [3]. Consequently, in this article, the challeng-
a gated network. In addition, a two-stage extractive–abstractive ing research problem is studied: how to use the pretrained
model is constructed to share the information. Compared with language model for text representation and generation.
the previous work, the proposed model T-BERTSum focuses Unfortunately, it is an open challenge to generate sentences
on pretrained external knowledge and topic mining to capture related to a topic with overall coherence and discourse-
more accurate contextual representations. Experimental results
on the CNN/Daily mail and XSum datasets demonstrate that our relatedness [4]. In order to well summarize the original
proposed model achieves new state-of-the-art results while gener- text, we propose a topic-aware extractive and abstractive
ating consistent topics compared with the most advanced method. summarization model named T-BERTSum. The proposed
Index Terms— Bidirectional Encoder Representations from model T-BERTSum has several challenges. First, the model is
Transformers (BERTs), neural topic model (NTM), social net- expected to obtain accurate and updatable topics corresponding
work, text summarization. to the specific article. Second, the model is necessary to match
topic information with word embedding in an end-to-end
I. I NTRODUCTION manner to guide the generation of topic-aware summarization.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
880 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 3, JUNE 2022
memory (LSTM) layers are stacked on top of the output has made several adjustments on the basis of BERT, namely,
layer to classify whether the sentence belongs to summaries more training data, larger batch size, longer training time, and
for the extractive model. Besides, a gated network is added removal of next predict loss. In our study, we focused on
for the abstractive model to remove the useless information. BERT to extract context information effectively for sequence
As a consequence, the dependencies of the sentence with encoding. We believe that the BERT-based model can achieve
a relatively long span can be captured. The T-BERTSum better performance.
adopts joint learning of topic modeling and text summa- Most importantly, one of the limitations of automatic sum-
rization and separates the optimizer of the encoder and the marization is how to reflect the implicit information conveyed
decoder to accommodate the fact that the former is pretrained between different texts and the background influence [13].
and the latter must be trained from scratch. In addition, Akhtar et al. [14] used the latent Dirichlet allocation (LDA)
a two-stage summarization framework is constructed, based to label the documents with topics and used formal concept
on T-BERTSum, to generate summaries on the extracted analysis (FCA) to automatically organize in a lattice structure.
sentences, sharing information while greatly reducing sentence In this way, topics’ identification and documents’ organization
redundancy. Generally, the main contributions of this work are help better with text mining. Roul et al. [15] proposed a
given as follows. heuristic method that used the LDA technique to identify
1) The new proposed T-BERTSum that applies BERT in the optimum number of independent topics present in the
text summarization, which introduces rich semantic fea- corpus, ensured that all the important contents from the
tures, based on the modified transformer architecture that corpus of documents are captured in the extracted summary.
achieves efficient and paralleled computation. In addition, the two-tiered topic model based on the pachinko
2) The background information is considered to be inte- allocation model (PAM) is combined with the TextRank
grated into the encoding as an additional knowledge, method for summarization [16]. Word topic distribution of
which is encoded to be an adjustable topic representa- LDA combines the sequence-to-sequence model to improve
tion, aiming to guide the generation of summaries in an abstractive sentence summarization [17]. Yang et al. [18]
end-to-end manner. introduced a novel neighborhood preserving semantic (NPS)
3) The ability to generate smooth summaries with low measure to capture the sparse candidate topics under that
redundancy can be improved by sharing information low-rank matrix factorization model. These techniques used
between different tasks in a two-stage extractive– the topic model, as an additional mechanism, to improve text
abstractive model. generation. Nonetheless, these models have some sparseness
The rest of this article is organized as follows. In Section II, problems and difficult to train. Our approach utilizes the
we review some related works that can be adapted for the text neural topic model (NTM) to induce implicit topics in neural
summarization task in this article. In Section III, we elaborate networks, which is easy to explain and expand. We have also
on our proposed framework T-BERTSum in detail followed demonstrated the influence of topics on text summarization.
by the experimental analysis in Section IV. According to the From another aspect, text summarization is basically divided
experimental results, we make conclusions in Section V. into extractive and abstractive models. Sadiq et al. [19]
consider that the target user lacks background knowledge or
reading ability and propose a linear combination of feature
II. R ELATED W ORKS scores for social networks. NeuSum [20] integrated the selec-
Current works on text summarization mainly focus on tion strategy into the scoring model, which solves the problem
word embedding, which represents each element in someway of segmentation of sentence scoring and sentence selection
[8], [9]. However, word embedding cannot completely solve previously and has able to end-to-end training without human
the problem of polysemy. In order to alleviate this problem, intervention. ExDoS [21] is the first approach to combine both
Embeddings from Language Models (ELMOs) [10] adopted supervised and unsupervised algorithms in a single framework
bidirectional LSTM to train the language model, where hier- for document summarization; it iteratively minimizes the error
archical LSTM can grasp different granularity information. rate of the classifier in each cluster with the help of dynamic
The LSTM captures the word features, syntactic features, and local feature weighting. Wang et al. [22] used the seq2seq
semantic features, respectively, from the shallow to the deep, model of convolution and the strategy gradient algorithm
but the parallelism is poor. By contrast, the transformer [5] to summarize and optimize the text. With the emergence
accelerates the deep learning training process based on the of self-attention [23], increasing methods [24], [25] adapts
attention mechanism, greatly improves the feature extraction self-attention instead of the RNN sequence model and uses
capability, and contributes to parallel processing. Moreover, multihead attention to capture different semantic information
generative pretraining (GPT) [11] obtained a better context based on the fact that each element contributes differently to
representation by using a feature extractor with a unidirec- the sequence. Su et al. [26] propose a two-stage method for
tional transformer. Furthermore, BERT [5] trains a corpus variable-length abstractive summarization; it consists of a text
of 33 million words via masked language modeling and next segmentation module and a two-stage transformer-based sum-
sentence prediction, which has generated better word embed- marization module and has achieved good results in capturing
ding. In recent years, BERT has been successfully applied to the relationship between sentences.
various NLP tasks, such as text implication, name entity recog- In addition, some works [27], [28] have focused on sum-
nition, and machine reading comprehension. RoBERTa [12] marizing via using pretraining language models recently.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
MA et al.: T-BERTSum: TOPIC-AWARE TEXT SUMMARIZATION BASED ON BERT 881
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
882 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 3, JUNE 2022
Topic embedding is appended to each input sequence, which 1) Extractive: Extractive summarization can be defined as
is the output representation of topic information in the implicit the task of assigning a label Yt ∈ {0, 1} to each sentence,
sequence trained by NTM described in Section III-B. The indicating whether the sentence should be included in the
main contribution of topic embedding is to represent the topic summary or not. Moreover, LSTM combined with the trans-
information hidden in each word or sequence that can mine the former still has its unique advantages [35], which adds a
gist of this article and solve the problem of polysemy. As an forget gate to the simple RNN model to control historical state
example, the word “novel” can be understood as fiction or information. Therefore, the LSTM is combined for classifying
new in different contexts. Therefore, it is necessary to dig out summarization. After the encoder, the sentence embedding
and represent the background information of each word. S = {s1 , s2 , . . . , sn } is obtained to be the input of extractive
model and filter key sentences by document-level features.
At t-time step, the input is the vector st , and the output
B. Neural Topic Model calculation process is given as follows:
Our topic model is inspired by NTM that induces latent
f t = σ (w f st + b f ) (5)
topics in neural networks. We assume topic vector t ∈ R K ×H ,
where K is the number of topics and H is the dimension of i t = σ (wi st + bi ) (6)
the embedding. The token embedding e indicates the meaning ot = σ (wo st + b0 ) (7)
of each token. The topic distribution over words for a given ct = f t ct−1 + i t tanh(wc st + bc ) (8)
topic assignment z n is p(wn |β, z n ) = Multi(βzn ), where Multi
h t = ot tanh(ct ) (9)
is the multinomial topic distribution, which is generated by
computing token embedding and topic embedding as follows: where σ is a sigmoid function; st is the current input; f t , i t ,
and ot are forget gates, input gates, and output gates; ct and h t
β K = soft max e · t KT . (3) are the context vector and the output vector; and w f , b f , wi ,
bi , wo , bo , wc , and bc are the weight and the bias of the forget
We assume that β is achieved by calculating the semantic
similarity between topics and words. Here, the prior para- gate, the input gate, the output gate, and the context vector,
respectively. The final output layer uses a sigmoid function
meters of μ and σ in Fig. 1 are defined as G(θ |μ0 , σ02 ) in
to calculate the final prediction score Ŷt , as shown in (10),
which the Gaussian sample x ∼ N(x|μ0 , σ02 ), N ∼ (μ|σ0 ),
to determine whether the sentence should be included in the
represents the Gaussian distributions, where μ0 and σ02 are
summarization. The loss of the whole model is the binary
hyperparameters that set for a zero mean and unit variance
classification entropy of Ŷt against gold label Yt
Gaussian. We use Gaussian softmax to generate θ = g(x)
and variational inference [34] to approximate a posterior Ŷt = sigmoid(wo h t + bo ). (10)
distribution q(θ |d) over x. The loss function of topic model
is defined as 2) Abstractive: There are many words in the sequence,
N and only a few of them capture the key information of the
entire sequence, which is exactly what we need. In order to
L NTM = E q(θ |d) log [ p(wn |βzn ) p(z n |θ )]
filter the key information of the input sequence, the gated
n=1 zn
network is added to improve the transformer before the
− DKL q(θ |d)|| p θ |μ0 , σ02 (4) decoder, as shown in Fig. 3. Generally, the gated network is
where q(θ |d) is the variational distribution approximating the used to control the information flow from the input sequence
true posterior p(θ |d). The KL term in (4) can be easily inte- to the output sequence, which makes the decoder focus on
grated as a Gaussian KL-divergence. We generate variational generating summaries from key information and removing
parameters μ(d) and σ (d) through the inference network for unnecessary information. Here, the input of the gated network
document d so that we can estimate the variational lower is the sentence representation s. The corresponding hidden
bound by sampling θ from q(θ |d) = G(θ |μ(d), σ 2 (d)). layer representation of [CLS] is used as the representation
We leave out the derivation details and refer the readers to [7]. of the input sequence vector, that is, s = h 0 . At t-time step,
the output is a new vector ĥ t obtained by filtering h t ; the gated
network generates a threshold as follows:
C. Summarization
gt = sigmoid(Wg [h t , s] + b) (11)
The transformer is used to be an encoder based on the
self-attention mechanism. Compared with RNN that needs where Wg is a linear transformation, b is the bias, g indicates
to process the input sequence word by word, it calculates the importance of the word, and ĥ t controls the filtered
the context vector of each word through self-attention, which information, which is computed as
has excellent parallelism and low computational complexity. h̃ t = gt h t . (12)
We stacked a six-layered transformer that each layer has
multihead attention and forward feedback layers. The final The filtered sequence is put into the N-layer transformer
output of the encoder is contextual embedding, as described in decoder, which is shown on in Fig. 3 (right). In order to effi-
Section III-A. As shown in Fig. 1, two modes are built up for ciently decode the sequence and better capture the information
summarization: the extractive mode and the abstractive mode. passed by the encoder, the transformers’ multihead attention
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
MA et al.: T-BERTSum: TOPIC-AWARE TEXT SUMMARIZATION BASED ON BERT 883
TABLE I
BASIC S TATISTICS OF THE D ATASET: S IZE OF T RAINING , VALIDATION ,
T EST S ETS , AND AVERAGE D OCUMENT AND S UMMARY L ENGTH ( IN
T ERMS OF W ORDS AND S ENTENCES )
where L NTM represents the loss of NTM and L Ext and L Abs ,
respectively, represent the losses of extractive summarization
and abstractive summarization. L Ext equals (9), and L Abs
equals (12). λ is the tradeoff parameter, controlling the balance
between topic model and text summarization. To accommodate
the fact that the encoder is pretrained and the decoder must be
trained from scratch, the encoder’s optimizer is separated from
the decoder. We follow Liu and Lapata [30] to use different
warm-up steps, learning rates, and two Adam optimizers for
the encoder and the decoder, respectively. This will make the
fine-tuning more stable.
IV. E XPERIMENT
A. Data Preparation
Fig. 3. encoder_ours architecture with the transformer, which consists of the We conduct experiments on two benchmark datasets,
gated network and the encoder–decoder framework based on multiattention.
namely, CNN/Daily Mail [36] and XSum [6], which are both
common and well-known corpora for text summarization [1].
is chose to help the decoder learn the soft alignment between In recent years, the former has been widely used in automatic
the summary and the source document. The decoders learning text summarization tasks due to its large data volume and
objective is to minimize negative likelihood of conditional long text content. The latter spurs further research toward
probability the summarization models for its single news outlet and
|a|
uniform summarization style. These datasets have different
scales and methods of generation, some prefer to abstract,
L dec = − log P(at = ŷt |a < t, H ). (13)
t=1
and some prefer to extract with a prominent sentence or first
sentence. Table I demonstrates the statistics of these datasets
In addition, we propose a two-stage text summarization
as follows, including data segmentation, average text length,
model based on the above work. In text summarization,
average summary length, and the percent of novel bigrams in
the extracted sentences first processed by the encoder in
gold summary. We used the standard splits of [36] for training,
the previous stage to obtain the encoder output sequence
validation, and testing (90 266/1220/1093 CNN documents and
representation X = {x 1 , x 2 , . . . , x n }. The representation is
196 961/12 148/10 397 DailyMail documents). We used the
used to be the input of the decoder; then, decoder predicts
splits of Narayan et al. (2018a) [6] for training, validation,
the final summary representation Y = {y1 , y2 , . . . yn } by the
and testing (204 045/11 332/11 334 XSum documents).
transformer with gated network. We found that the model
1) CNN/Daily Mail: There are 287 227 data for train-
can take advantage of the information shared between these
ing, 13 368 data for validation, and 11 490 data for testing.
two tasks without fundamentally changing its architecture to
CNN/Daily Mail consists of news articles with a summary
provide a more complete sequence.
corresponding to annotate several highlighted sentences man-
ually. There are 52.90% novel bigrams in the CNN reference
D. Joint Learning
summaries and 52.16% in DailyMail. It is widely used in
The entire model integrates the NTM and the text summa- automatic text summarization tasks due to its large corpus and
rization, which can be updated simultaneously in one frame- long text, and suitable for extractive and abstractive models.
work. In this framework, we jointly deal with topic modeling The original dataset can be applied here.1
and summaries’ generation and define the loss function of the 2) XSum: It contains 226 711 BBC articles and is accom-
overall framework as follows: panied by a sentence summary, which answers the question
⎧
⎪
⎨λL NTM + L Ext , mod e = ext regarding what this article is about. There are 83.31% novel
L final = λL NTM + L Abs , mod e = abs bigrams in the XSum reference summaries. The articles and
⎪
⎩ summaries in the XSum dataset are shorter, but the vocabulary
λL NTM + L Ext + L Abs , mod e = two−stage
(14) 1 https://github.com/abisee/cnn-dailymail
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
884 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 3, JUNE 2022
is large enough to be compared to CNN. The original dataset to generate summaries, and T-BERTSum(ExtAbs) integrates
can be applied here.2 extractive and abstractive to generate sentence-level sequences.
2) Comparison Models: To further illustrate the superiority
of our model over the two datasets, we compare the perfor-
B. Experimental Setup mance with many recent methods. Two groups are divided
The experimental setup is described from the aspect of based on whether they are extractive models or abstractive
model settings, comparison models, and evaluation metrics. models. All comparison models are described in detail as
Among them, the evaluation metrics include automatic evalu- follows.
ation and manual evaluation. 1) Leading Sentences (Lead-3): Lead3 is a baseline, which
1) Model Settings: All models were implemented on the directly extracts the first three sentences of the article as
PyTorch3 [37] version of OpenNMT [38]. To reduce GPU a summary. This model is the extractive baseline.
memory, we choose the “BERTbase”4 for fine-tuning, which 2) SummaRuNNer: It is proposed by Nallapati et al. [41] to
has 110M total parameters. The size of the vocabulary is complete the sequence classification problem and selects
30 522, and the dimensions of word embedding are 768. the final subset of articles by RNN, as an extractive
For the number of topics, we set K = 1. When K = 1, baseline.
the guidance of summary generation is the best. When 3) Refresh: It is proposed by Narayan et al. [42], which
K > 1, the model will be a little bit disturbed; we observed optimizes the rouge evaluation by combining the max-
that the ability of words to represent multiple topics is imum likelihood cross entropy and the reinforcement
deficient. However, on the whole, multiple topics will not learning objective to make sentence ordering more accu-
deviate from the topic too far, which is closer to the reference rate, as an extractive baseline.
summary than the effect of setting K to 0. We obtained a 4) HSSAS: It is proposed by Al-Sabahi et al. [43] to create
probability distribution for each word over topics, and the sentence and document embeddings by a hierarchi-
topic distribution can be inferred for any new document. cal self-attention mechanism, as an extractive baseline.
We follow the grid search of Miao et al. [7] by tuning We followed Al-Sabahi et al. [43] to set the maximum
the hyperparameters in NTM on the development set for sentence length to 50 words and the maximum number
achieving the held-out perplexity. We check sparsity (between of sentences per document to 100. At training time,
le-3 and 0.75) to estimate perplexity. In order to achieve the batch size was set to 64.
the optimal parameter setting, γ = 0.8 and λ = 1.0 for 5) BERT + Transformer: It is a simple variant of BERT,
controlling the effects of NTM and summarization. μ0 and σ02 which combines with the transformer to integrate sen-
are hyperparameters that set for a zero mean and unit variance tences for extracting abstracts proposed by Liu [44].
Gaussian. For extractive summarization, we get the score of We followed the original text to select the top-three
each sentence according to the output layer, then arrange the checkpoints based on the evaluation losses and use the
sentences from high to low, and select the first three sentences trigram blocking to reduce redundancy.
as the key sentences. For abstractive summarization, we use 6) Pointer-Generator + Coverage: See et al. [45] copy the
beam search whose size is set to 4. During beam search, words directly from the original text through pointer
we set the probability of duplicate words to 0 and delete and retain the ability to generate new words through
sentences with less than three words from the result set until generators, as an abstractive baseline.
an end-of-sequence token is emitted. We use the decoder of a 7) Bottom-Up: It is proposed by Gehrmann et al. [46] to
six-layer transformer with 512 hidden units, six-head attention identify phrases in the source document that should be
blocks, and 2048 hidden feedforward layers. The batch size is part of the summary by using a data-efficient content
140 with gradient accumulation every five steps. selector as a bottom-up attention step and an abstractive
We use the Adam optimizer [39] and follow baseline.
Vaswani et al. [23] with a learning rate of 2e−3 and 8) DCA: Çelikyilmaz et al. [47] have multiple agents to
0.05 for training the encoder and the decoder. In addition, represent documents and a hierarchical attention mech-
we set two Adam optimizers with β1 = 0.9 and β2 = 0.999 anism that decodes the agents. It serves as the best
for the encoder and the decoder, respectively. Model abstractive model in 2018, as a baseline for abstracting.
checkpoints were saved and evaluated on the validation set 9) BERTSum: It is proposed by Liu and Lapata [30] to use
every 2000 steps. The maximum length of the sentence pretrained language models to effectively summary in
of the summary is set to 512. For regularization, we use generation tasks, which can be used as a baseline for
dropout [40] and set the dropout rate to 0.1. new methods.5
In addition, we did three comparative experiments: 10) BEAR: It is proposed by Wang et al. [31] to use BERT
T-BERTSum(Ext) extracts vital sentences based on pretrained word embedding as input and integrated the extractive
encoder and stacked LSTM, T-BERTSum(Abs) combines network and the generation network into a unified model
six-layer transformer encoder–decoder and gated network by reinforcement learning as an abstractive baseline.
We followed the original text to set the learning rate
2
https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset of machine learning to 3 × 10−4 , the maximum length
3 https://pytorch.org/
4 https://github.com/huggingface/pytorch-pretrained-BERT 5 https://github.com/nlpyang/BertSum
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
MA et al.: T-BERTSum: TOPIC-AWARE TEXT SUMMARIZATION BASED ON BERT 885
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
886 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 3, JUNE 2022
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
MA et al.: T-BERTSum: TOPIC-AWARE TEXT SUMMARIZATION BASED ON BERT 887
TABLE V
C OMPARISON OF G ROUND -T RUTH S UMMARY AND G ENERATED S UM -
MARIES OF THE B ASELINE BEAR M ODEL AND O UR M ODEL ON THE
CNN/D AILY M AIL D ATASET. F OR B REVITY, T HIS A RTICLE H AS
B EEN S HORTENED . F OR R EADABILITY, C APITALIZATION WAS
A DDED M ANUALLY
(a) (b)
Fig. 6. Results of ablation study between (a) extractive model and (b) abstrac-
tive model.
TABLE IV
H UMAN E VALUATION OF S IX M ODELS . W E C OMPARE THE
S CORE OF S ALIENT, C OHERENCE , AND R EDUNDANCY
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
888 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 3, JUNE 2022
sentences. On the redundancy evaluation of the abstractive solve another big problem, that is, the generated summaries
model, PTGEN + COV obtained the highest score because do not match the facts of the source text: on the one hand,
the method incorporated a replication mechanism based on how to introduce additional structured knowledge so that the
the pointer network to avoid the generation of overlapping encoder can not only consider the contextual representation
words. However, our model score is not too low since we but also consider additional knowledge information; on the
take the attitude that the gated network plays a certain role other hand, how to extend the topic information so that we
in filtering information at each step. In terms of salient, can obtain multiple topics and subtopics of the article to
we got the highest evaluation score by manual evaluation, enhance sentence information and consolidate document-level
which is the recognition of our model ability and the quality knowledge. Finally, we can consider how to process topic
of the summaries. We found that our model has a better information and additional structured knowledge in parallel
comprehensive ability, which proves that integrating the topic on the basis of the method in this article, so as to make a
information can better summarize the original text based on qualitative leap in the task of text summarization while keeping
the premise of strong representation ability. the generated summary consistent with the original facts.
We present the example in Table V for comparison with our R EFERENCES
model and the baseline model. We can see that our model does [1] M. Allahyari et al., “Text summarization techniques: A brief sur-
not lose crucial information because of the long distance for vey,” 2017, arXiv:1707.02268. [Online]. Available: http://arxiv.org/abs/
the long article while capturing the topic reliably. The words 1707.02268
[2] T. Ma, Q. Liu, J. Cao, Y. Tian, A. Al-Dhelaan, and M. Al-Rodhaan,
of the table marked with underlines are the important topics of “LGIEM: Global and local node influence based community detection,”
the text. As an example, our model captures core ideas, such Future Gener. Comput. Syst., vol. 105, pp. 533–546, Apr. 2020, doi:
as “ f i er ce f ight” around the topic compared to the baseline 10.1016/j.future.2019.12.022.
[3] A. Khan and N. Salim, “A review on abstractive summarization meth-
model, which can well reflect the event itself described by ods,” J. Theor. Appl. Inf. Technol., vol. 59, no. 1, pp. 64–72, 2014.
the article. When both models capture the same topic, our [4] M. Gambhir and V. Gupta, “Recent automatic text summarization
model can also generate new recurrence vocabulary based on techniques: A survey,” Artif. Intell. Rev., vol. 47, no. 1, pp. 1–66,
Jan. 2017, doi: 10.1007/s10462-016-9475-9.
the topic, which is effective and accurate. [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
of deep bidirectional transformers for language understanding,” in Proc.
V. C ONCLUSION AND F UTURE W ORK Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang.
Technol., vol. 1. Minneapolis, MN, USA: Association for Computational
In this work, we propose a general model of extractive Linguistics, Jun. 2019, pp. 4171–4186, doi: 10.18653/v1/n19-1423.
and abstractive for text summarization, which is based on [6] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details,
BERT’s powerful architecture and additional topic embedding just the summary! Topic-aware convolutional neural networks for
extreme summarization,” 2018, arXiv:1808.08745. [Online]. Available:
information to guide contextual information capture. For a http://arxiv.org/abs/1808.08745
good summary, an accurate representation is extremely impor- [7] Y. Miao, E. Grefenstette, and P. Blunsom, “Discovering discrete latent
tant. This article introduces the representation of a powerful topics with neural variational inference,” 2017, arXiv:1706.00359.
[Online]. Available: http://arxiv.org/abs/1706.00359
pretraining language model (BERT) to lay the foundation of [8] K. Cho et al., “Learning phrase representations using RNN encoder-
the source text encoding and emphasizes the subjectivity of the decoder for statistical machine translation,” in Proc. 2014 Conf. Empiri-
generated content. The fusion of topic embedding is a direct cal Methods Natural Lang. Process. (EMNLP), Doha, Qatar, Oct. 2014,
pp. 1724–1734, doi: 10.3115/v1/d14-1179.
and effective way to achieve high-quality generation through [9] H. Su et al., “Improving multi-turn dialogue modelling with
NTM inferring. The combination of token embedding, segment utterance ReWriter,” 2019, arXiv:1906.07004. [Online]. Available:
embedding, position embedding, and topic embedding can http://arxiv.org/abs/1906.07004
[10] M. E. Peters et al., “Deep contextualized word representations,”
more abundantly embed the information that the original text 2018, arXiv:1802.05365. [Online]. Available: http://arxiv.org/abs/
should contain. Stacking the transformer layer in the encoding 1802.05365
stage is able to enhance the BERT’s ability to represent source [11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving
language understanding by generative pre-training,” OpenAI Company,
texts, make full use of self-attention, and judge the importance San Francisco, CA, USA, Tech. Rep., 2018.
of different components of the sentence through different focus [12] Y. Liu et al., “RoBERTa: A robustly optimized BERT pre-
scores. The two-stage extractive–abstractive model can share training approach,” 2019, arXiv:1907.11692. [Online]. Available:
http://arxiv.org/abs/1907.11692
information and generate salient summaries, which reduces a [13] T. Ma, Y. Zhao, H. Zhou, Y. Tian, A. Al-Dhelaan, and M. Al-Rodhaan,
certain degree of redundancy. The experimental results show “Natural disaster topic extraction in sina microblogging based on graph
that the model proposed in this article achieves the state- analysis,” Expert Syst. Appl., vol. 115, pp. 346–355, Jan. 2019, doi:
10.1016/j.eswa.2018.08.010.
of-the-art results on the CNN/Daily Mail dataset and the [14] N. Akhtar, H. Javed, and T. Ahmad, “Hierarchical summarization of
XSum dataset. The analysis shows that the model can generate text documents using topic modeling and formal concept analysis,” in
high-quality summaries with outstanding consistency for the Data Management, Analytics and Innovation. Singapore: Springer, 2019,
pp. 21–33.
original text. [15] R. K. Roul, S. Mehrotra, Y. Pungaliya, and J. K. Sahoo, “A new
Although the model has made some progress in text sum- automatic multi-document text summarization using topic modeling,” in
marization, it also has some limitations. For long articles Proc. Int. Conf. Distrib. Comput. Internet Technol. (ICDCIT), vol. 11319.
Cham, Switzerland: Springer, Jan. 2019, pp. 212–221, doi: 10.1007/978-
with multiple topics, our model has limited processing power. 3-030-05366-6_17.
In future work, we will try to extend our work to multitopic [16] C. Lin and E. H. Hovy, “The automated acquisition of topic
with the transformer network to capture multiple topics hierar- signatures for text summarization,” in Proc. 18th Int. Conf.
Comput. Linguistics (COLING). San Mateo, CA, USA: Morgan
chically by imitating multihead self-attention and further prove Kaufmann, Jul./Aug. 2000, pp. 495–501. [Online]. Available:
the validity of this article. In addition, we need to further https://www.aclweb.org/anthology/C00-1072/
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
MA et al.: T-BERTSum: TOPIC-AWARE TEXT SUMMARIZATION BASED ON BERT 889
[17] H. Pan, H. Liu, and Y. Tang, “A sequence-to-sequence text summariza- [38] G. Klein, Y. Kim, Y. Deng, V. Nguyen, J. Senellart, and
tion model with topic based attention mechanism,” in Proc. Int. Conf. A. M. Rush, “OpenNMT: Neural machine translation toolkit,” 2018,
Web Inf. Syst. Appl., vol. 11817. Cham, Switzerland: Springer, Sep. 2019, arXiv:1805.11462. [Online]. Available: http://arxiv.org/abs/1805.
pp. 285–297, doi: 10.1007/978-3-030-30952-7_29. 11462
[18] Z. Yang, Y. Yao, and S. Tu, “Exploiting sparse topics mining for [39] D. P. Kingma and B. Jimmy, “Adam: A method for stochastic optimiza-
temporal event summarization,” in Proc. IEEE 5th Int. Conf. Image, tion,” in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), San Diego, CA,
Vis. Comput. (ICIVC), Jul. 2020, pp. 322–331. USA, May 2015.
[19] A. T. Sadiq, Y. H. Ali, and M. S. M. N. Fadhil, “Text summarization [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
for social network conversation,” in Proc. Int. Conf. Adv. Comput. Sci. R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
Appl. Technol., Dec. 2013, pp. 13–18. from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
[20] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao, 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2670313
“Neural document summarization by jointly learning to score [41] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: A recurrent neural
and select sentences,” 2018, arXiv:1807.02305. [Online]. Available: network based sequence model for extractive summarization of doc-
http://arxiv.org/abs/1807.02305 uments,” in Proc. 31st AAAI Conf. Artif. Intell. San Francisco, CA,
[21] S. Ghodratnama, A. Beheshti, M. Zakershahrak, and F. Sobhan- USA. AAAI Press, Feb. 2017, pp. 3075–3081. [Online]. Available:
manesh, “Extractive document summarization based on dynamic feature http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14636
space mapping,” IEEE Access, vol. 8, pp. 139084–139095, 2020, doi: [42] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences
10.1109/ACCESS.2020.3012539. for extractive summarization with reinforcement learning,” 2018,
[22] L. Wang, J. Yao, Y. Tao, L. Zhong, W. Liu, and Q. Du, “A reinforced arXiv:1802.08636. [Online]. Available: http://arxiv.org/abs/1802.
topic-aware convolutional sequence-to-sequence model for abstrac- 08636
tive text summarization,” 2018, arXiv:1805.03616. [Online]. Available: [43] K. Al-Sabahi, Z. Zuping, and M. Nadher, “A hierarchical struc-
http://arxiv.org/abs/1805.03616 tured self-attentive model for extractive document summarization
[23] A. Vaswani et al., “Attention is all you need,” 2017, arXiv:1706.03762. (HSSAS),” IEEE Access, vol. 6, pp. 24205–24212, 2018, doi:
[Online]. Available: http://arxiv.org/abs/1706.03762 10.1109/ACCESS.2018.2829199.
[24] T. Ma, H. Wang, L. Zhang, Y. Tian, and N. Al-Nabhan, “Graph [44] Y. Liu, “Fine-tune BERT for extractive summarization,” 2019,
classification based on structural features of significant nodes and spatial arXiv:1903.10318. [Online]. Available: http://arxiv.org/abs/1903.
convolutional neural networks,” Neurocomputing, vol. 423, pp. 639–650, 10318
Jan. 2021, doi: 10.1016/j.neucom.2020.10.060. [45] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization
[25] T. Cai, M. Shen, H. Peng, L. Jiang, and Q. Dai, “Improving transformer with pointer-generator networks,” in Proc. 55th Annu. Meeting Assoc. for
with sequential context representations for abstractive text summariza- Comput. Linguistics (ACL), vol. 1. Stroudsburg, PA, USA: Association
tion,” in Proc. CCF Int. Conf. Natural Lang. Process. Chin. Com- for Computational Linguistics, Jul./Aug. 2017, pp. 1073–1083, doi:
put. (NLPCC), vol. 11838. Cham, Switzerland: Springer, Oct. 2019, 10.18653/v1/P17-1099.
pp. 512–524, doi: 10.1007/978-3-030-32233-5_40. [46] S. Gehrmann, Y. Deng, and A. M. Rush, “Bottom-up abstrac-
tive summarization,” 2018, arXiv:1808.10792. [Online]. Available:
[26] M.-H. Su, C.-H. Wu, and H.-T. Cheng, “A two-stage transformer-based
approach for variable-length abstractive summarization,” IEEE/ACM http://arxiv.org/abs/1808.10792
[47] A. Çelikyilmaz, A. Bosselut, X. He, and Y. Choi, “Deep communicating
Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2061–2072, 2020,
agents for abstractive summarization,” in Proc. Conf. North Amer.
doi: 10.1109/TASLP.2020.3006731.
Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol. (NAACL-
[27] X. Zhang, F. Wei, and M. Zhou, “HIBERT: Document level
HLT), vol. 1. Stroudsburg, PA, USA: Association for Computational
pre-training of hierarchical bidirectional transformers for docu-
Linguistics, Jun. 2018, pp. 1662–1675, doi: 10.18653/v1/n18-1150.
ment summarization,” 2019, arXiv:1905.06566. [Online]. Available:
[48] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”
http://arxiv.org/abs/1905.06566
in Text Summarization Branches Out. Barcelona, Spain: Association for
[28] A. Hoang, A. Bosselut, A. Celikyilmaz, and Y. Choi, “Efficient Computational Linguistics, 2004, pp. 74–81.
adaptation of pretrained transformers for abstractive summariza-
tion,” 2019, arXiv:1906.00138. [Online]. Available: http://arxiv.org/abs/
1906.00138
[29] H. Zhang, J. Xu, and J. Wang, “Pretraining-based natural language
generation for text summarization,” 2019, arXiv:1902.09243. [Online]. Tinghuai Ma (Member, IEEE) received the bache-
Available: http://arxiv.org/abs/1902.09243 lor’s and master’s degrees from the Huazhong Uni-
[30] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,” versity of Science and Technology (HUST), Wuhan,
in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint China, in 1997 and 2000, respectively, and the Ph.D.
Conf. Natural Lang. Process. (EMNLP-IJCNLP). Stroudsburg, PA, USA: degree from the Chinese Academy of Sciences,
Association for Computational Linguistics, Nov. 2019, pp. 3728–3738, Beijing, China, in 2003.
doi: 10.18653/v1/D19-1387. He was a Post-Doctoral Associate with AJOU
[31] Q. Wang, P. Liu, Z. Zhu, H. Yin, Q. Zhang, and L. Zhang, “A text University, Suwon, South Korea, in 2004. From
abstraction summary model based on BERT word embedding and November 2007 to July 2008, he visited the Chinese
reinforcement learning,” Appl. Sci., vol. 9, no. 21, p. 4701, Nov. 2019. Meteorology Administration, Beijing. From Febru-
ary 2009 to August 2009, he was a Visiting Professor
[32] A. Srikanth, A. S. Umasankar, S. Thanu, and S. J. Nirmala,
with the Ubiquitous Computing Laboratory, Kyung Hee University, Seoul,
“Extractive text summarization using dynamic clustering and co-
South Korea. He is currently a Professor of computer sciences with Nanjing
reference on BERT,” in Proc. 5th Int. Conf. Comput., Com-
University of Information Science and Technology, Nanjing, China. He has
mun. Secur. (ICCCS), Patna, India, Oct. 2020, pp. 1–5, doi:
published more than 100 journal articles/conference papers. His research
10.1109/ICCCS49678.2020.9277220.
interests are data mining, cloud computing, ubiquitous computing, privacy-
[33] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normaliza-
preserving, and so on.
tion,” 2016, arXiv:1607.06450. [Online]. Available: http://arxiv.org/abs/
1607.06450
[34] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference:
A review for statisticians,” 2016, arXiv:1601.00670. [Online]. Available:
Qian Pan received the bachelor’s degree in
http://arxiv.org/abs/1601.00670
software engineering from the Nanjing University
[35] M. X. Chen et al., “The best of both worlds: Combining recent advances of Information Science and Technology, Nanjing,
in neural machine translation,” 2018, arXiv:1804.09849. [Online]. China, in 2021.
Available: http://arxiv.org/abs/1804.09849 She is currently a Computer Professional
[36] K. M. Hermann et al., “Teaching machines to read and comprehend,” Researcher with the Nanjing University of
2015, arXiv:1506.03340. [Online]. Available: http://arxiv.org/abs/1506. Information Science and Technology. Her research
03340 interest lies in data mining, especially on the text
[37] A. Paszke et al., “Automatic differentiation in PyTorch,” in Proc. NIPS summarization task.
Autodiff Workshop, Future Gradient-Based Mach. Learn. Softw. Techn.,
Long Beach, CA, USA, Dec. 2017.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
890 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 3, JUNE 2022
Huan Rong received the Ph.D. degree in computer Yuan Tian received the master’s and Ph.D. degrees
science from the Nanjing University of Information from Kyung Hee University, Seoul, South Korea,
Science and Technology, Nanjing, China, in 2020. in 2009 and 2012, respectively.
He is currently a Visiting Scholar with the Univer- She is currently an Assistant Professor with the
sity of Central Arkansas, Conway, AR, USA. He is College of Computer and Information Sciences,
also an Assistance Professor with the School of Arti- King Saud University, Riyadh, Saudi Arabia. She
ficial Intelligence, Nanjing University of Information is also an Associate Professor with the School of
Science and Technology. His research interests lie Computer, Nanjing Institute of Technology, Nanjing,
in deep learning and the application of artificial China. Her research interests are broadly divided
intelligence, especially in sentiment analysis and into privacy and security, which are related to the
other interdisciplinary tasks. His research contribu- cloud.
tions have been published in Information Sciences, IEEE T RANSACTIONS ON Dr. Tian is also a member of the technical committees of several inter-
A FFECTIVE C OMPUTING, Soft Computing, and so on. national conferences. She is also an active reviewer of many international
journals.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.