T-BERTSum Topic-Aware Text Summarization Based on BERT

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO.
3, JUNE 2022 879
T-BERTSum: Topic-Aware Text

Summarization Based on BERT
Tinghuai Ma , Member, IEEE, Qian Pan, Huan Rong , Yurong Qian, Yuan Tian , and Najla Al-Nabhan
Abstract— In the era of social networks, the rapid growth input documents while still retaining its key content. This
of data mining in information retrieval and natural language technique plays an important role in information retrieval and
processing makes automatic text summarization necessary. Cur- natural language processing (NLP) [1], which is a research
rently, pretrained word embedding and sequence to sequence
models can be effectively adapted in social network summa- hotspot in a multitude of fields, such as computer science,
rization to extract significant information with strong encoding multimedia, and statistics [2]. Currently, typical summarization
capability. However, how to tackle the long text dependence methods include extractive and abstractive [3]. The extractive
and utilize the latent topic mapping has become an increasingly method selects salient sentences or reorganizes sentences from
crucial challenge for these models. In this article, we propose the original text, and the abstractive method generates novel
a topic-aware extractive and abstractive summarization model
named T-BERTSum, based on Bidirectional Encoder Represen- words or phrases with comprehension. Generally, most of the
tations from Transformers (BERTs). This is an improvement over existing methods are designed to encode paragraphs and then
previous models, in which the proposed approach can simultane- decode with different mechanisms. Nevertheless, there is a
ously infer topics and generate summarization from social texts. large amount of information loss in the encoding and decoding
First, the encoded latent topic representation, through the neural stages. Therefore, existing works on summarization mainly
topic model (NTM), is matched with the embedded representation
of BERT, to guide the generation with the topic. Second, focus on word embedding or contextual contents.
the long-term dependencies are learned through the transformer However, the advantages of word embedding are limited by
network to jointly explore topic inference and text summa- specific small datasets, which require richer contextual infor-
rization in an end-to-end manner. Third, the long short-term mation. The future direction of summarization is to predict
memory (LSTM) network layers are stacked on the extractive words from full context by language modeling and represen-
model to capture sequence timing information, and the effective
information is further filtered on the abstractive model through tation learning [3]. Consequently, in this article, the challeng-
a gated network. In addition, a two-stage extractive–abstractive ing research problem is studied: how to use the pretrained
model is constructed to share the information. Compared with language model for text representation and generation.
the previous work, the proposed model T-BERTSum focuses Unfortunately, it is an open challenge to generate sentences
on pretrained external knowledge and topic mining to capture related to a topic with overall coherence and discourse-
more accurate contextual representations. Experimental results
on the CNN/Daily mail and XSum datasets demonstrate that our relatedness [4]. In order to well summarize the original
proposed model achieves new state-of-the-art results while gener- text, we propose a topic-aware extractive and abstractive
ating consistent topics compared with the most advanced method. summarization model named T-BERTSum. The proposed
Index Terms— Bidirectional Encoder Representations from model T-BERTSum has several challenges. First, the model is
Transformers (BERTs), neural topic model (NTM), social net- expected to obtain accurate and updatable topics corresponding
work, text summarization. to the specific article. Second, the model is necessary to match
topic information with word embedding in an end-to-end
I. I NTRODUCTION manner to guide the generation of topic-aware summarization.
A UTOMATIC text summarization is the process of com-

pressing and extracting information effectively from
Third, the overall framework of the extractive model and the
abstractive model needs to summarize as much information as
possible with low redundancy.
Manuscript received August 11, 2020; revised February 28, 2021; accepted
April 8, 2021. Date of publication June 24, 2021; date of current version To overcome these challenges, the text is first represented
May 30, 2022. This work was supported in part by the National Key Research as word embedding by the most advanced pretrained language
and Development Program of China under Grant 2021YFE0104400 and in model, Bidirectional Encoder Representations from Trans-
part by the Deanship of Scientific Research at King Saud University under
Grant RGP-1441-33. (Corresponding author: Tinghuai Ma.) formers (BERTs) [5], greatly guarantees the semantics of the
Tinghuai Ma and Qian Pan are with the School of Computer and Software, context. Inspired by the work [6], the topic distribution is
Nanjing University of Information Science and Technology, Nanjing 210044, trained by NTM [7] in the neural variational inference frame-
China (e-mail: thma@nuist.edu.cn).
Huan Rong is with the School of Artificial Intelligence, Nanjing University work. Technically, global information is represented through
of Information Science and Technology, Nanjing 210044, China (e-mail: the integration of powerful token embedding and potential
hrong@nuist.edu.cn). topic embedding, which contains quite relevant quantity of
Yurong Qian is with the School of Software, Xinjiang University,
Urumqi 830008, China. information. Then, the topic-aware sequence is put into the
Yuan Tian is with the School of Computer, Nanjing Institute of Technology, decoder with the transformer to learn soft alignments between
Nanjing 211167, China. summaries and source documents. T-BERTSum extends the
Najla Al-Nabhan is with the Department of Computer Science, King Saud
University, Riyadh 11362, Saudi Arabia. successful transformer for text encoding and decoding [5]
Digital Object Identifier 10.1109/TCSS.2021.3088506 based on the attention network. Finally, the long short-term
2329-924X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Green University of Bangladesh. Downloaded on October 02,2024 at 06:08:46 UTC from IEEE Xplore. Restrictions apply.
880 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO. 3, JUNE 2022
memory (LSTM) layers are stacked on top of the output has made several adjustments on the basis of BERT, namely,
layer to classify whether the sentence belongs to summaries more training data, larger batch size, longer training time, and
for the extractive model. Besides, a gated network is added removal of next predict loss. In our study, we focused on
for the abstractive model to remove the useless information. BERT to extract context information effectively for sequence
As a consequence, the dependencies of the sentence with encoding. We believe that the BERT-based model can achieve
a relatively long span can be captured. The T-BERTSum better performance.
adopts joint learning of topic modeling and text summa- Most importantly, one of the limitations of automatic sum-
rization and separates the optimizer of the encoder and the marization is how to reflect the implicit information conveyed
decoder to accommodate the fact that the former is pretrained between different texts and the background influence [13].
and the latter must be trained from scratch. In addition, Akhtar et al. [14] used the latent Dirichlet allocation (LDA)
a two-stage summarization framework is constructed, based to label the documents with topics and used formal concept
on T-BERTSum, to generate summaries on the extracted analysis (FCA) to automatically organize in a lattice structure.
sentences, sharing information while greatly reducing sentence In this way, topics’ identification and documents’ organization
redundancy. Generally, the main contributions of this work are help better with text mining. Roul et al. [15] proposed a
given as follows. heuristic method that used the LDA technique to identify
1) The new proposed T-BERTSum that applies BERT in the optimum number of independent topics present in the
text summarization, which introduces rich semantic fea- corpus, ensured that all the important contents from the
tures, based on the modified transformer architecture that corpus of documents are captured in the extracted summary.
achieves efficient and paralleled computation. In addition, the two-tiered topic model based on the pachinko
2) The background information is considered to be inte- allocation model (PAM) is combined with the TextRank
grated into the encoding as an additional knowledge, method for summarization [16]. Word topic distribution of
which is encoded to be an adjustable topic representa- LDA combines the sequence-to-sequence model to improve
tion, aiming to guide the generation of summaries in an abstractive sentence summarization [17]. Yang et al. [18]
end-to-end manner. introduced a novel neighborhood preserving semantic (NPS)
3) The ability to generate smooth summaries with low measure to capture the sparse candidate topics under that
redundancy can be improved by sharing information low-rank matrix factorization model. These techniques used
between different tasks in a two-stage extractive– the topic model, as an additional mechanism, to improve text
abstractive model. generation. Nonetheless, these models have some sparseness
The rest of this article is organized as follows. In Section II, problems and difficult to train. Our approach utilizes the
we review some related works that can be adapted for the text neural topic model (NTM) to induce implicit topics in neural
summarization task in this article. In Section III, we elaborate networks, which is easy to explain and expand. We have also
on our proposed framework T-BERTSum in detail followed demonstrated the influence of topics on text summarization.
by the experimental analysis in Section IV. According to the From another aspect, text summarization is basically divided
experimental results, we make conclusions in Section V. into extractive and abstractive models. Sadiq et al. [19]
consider that the target user lacks background knowledge or
reading ability and propose a linear combination of feature
II. R ELATED W ORKS scores for social networks. NeuSum [20] integrated the selec-
Current works on text summarization mainly focus on tion strategy into the scoring model, which solves the problem
word embedding, which represents each element in someway of segmentation of sentence scoring and sentence selection
[8], [9]. However, word embedding cannot completely solve previously and has able to end-to-end training without human
the problem of polysemy. In order to alleviate this problem, intervention. ExDoS [21] is the first approach to combine both
Embeddings from Language Models (ELMOs) [10] adopted supervised and unsupervised algorithms in a single framework
bidirectional LSTM to train the language model, where hier- for document summarization; it iteratively minimizes the error
archical LSTM can grasp different granularity information. rate of the classifier in each cluster with the help of dynamic
The LSTM captures the word features, syntactic features, and local feature weighting. Wang et al. [22] used the seq2seq
semantic features, respectively, from the shallow to the deep, model of convolution and the strategy gradient algorithm
but the parallelism is poor. By contrast, the transformer [5] to summarize and optimize the text. With the emergence
accelerates the deep learning training process based on the of self-attention [23], increasing methods [24], [25] adapts
attention mechanism, greatly improves the feature extraction self-attention instead of the RNN sequence model and uses
capability, and contributes to parallel processing. Moreover, multihead attention to capture different semantic information
generative pretraining (GPT) [11] obtained a better context based on the fact that each element contributes differently to
representation by using a feature extractor with a unidirec- the sequence. Su et al. [26] propose a two-stage method for
tional transformer. Furthermore, BERT [5] trains a corpus variable-length abstractive summarization; it consists of a text
of 33 million words via masked language modeling and next segmentation module and a two-stage transformer-based sum-
sentence prediction, which has generated better word embed- marization module and has achieved good results in capturing
ding. In recent years, BERT has been successfully applied to the relationship between sentences.
various NLP tasks, such as text implication, name entity recog- In addition, some works [27], [28] have focused on sum-
nition, and machine reading comprehension. RoBERTa [12] marizing via using pretraining language models recently.
MA et al.: T-BERTSum: TOPIC-AWARE TEXT SUMMARIZATION BASED ON BERT 881
Fig. 2. Overview architecture of the T-BERTSum embedding. Differentiate

sentences and tags with different color markers.
in Section III-C). These three components can be updated

simultaneously by joint learning. For the problem of training
mismatch, we fine-tune the optimizer to train the three parts.
This process will be introduced in Section III-D.
Fig. 1. Overall framework of our summarization model T-BERTSum.
A. Representation
Formally, the input text is first preprocessed by inserting two
These models [29], [30] first applied BERT to fine-tuning
special tokens. The token [CLS] is inserted into the beginning
on text summarization. Wang et al. [31] used BERT word
of each sentence, and the output calculated by this token is
embedding as input and integrated the extractive network and
used to integrate the information in each sequence, while the
the generation network into a unified model by reinforcement
token [SEP] is inserted into the ending of each sentence, as an
learning. Srikanth et al. [32] use the existing BERT model to
indicator of sentence boundaries. The preprocessed text is then
produce extractive summarization by clustering the embed-
represented as a sequence of token X = {w1 , w2 , . . . , wn }.
dings of sentences by K-means clustering and introduce a
Each instance x is processed into representation: bag-
dynamic method to decide the suitable number of sentences
of-words (BoW) term vector x BOW ∈ R V , where V is the
to pick from clusters. Different from the previous work that
vocabulary size.
only uses BERT on text generation, our goal is to match
The modification of the input embedding forms the prefer-
the word embedding to the topic information and compre-
able sequence. As illustrated in Fig. 2, for each input sentence,
hensively and accurately express the context information of
the output is the summation of four types of embedding: token
each sequence. To the best of our knowledge, this is still
embedding, segment embedding, position embedding, and
a relatively unexplored area, and the proposed method can
topic embedding. Among them, the topic feature embedding is
also be applied to other NLP tasks. In conclusion, inspired
newly introduced to capture the underlying topic information,
by the above works, in this article, we intend to adopt the
and the other three embeddings are the designs of the original
embedding from BERT, combine with the topic representation,
BERT. A bidirectional transformer with multiple layers dealing
and guide the summaries’ generation in powerful transformer
with these characteristics from the output to generate the
architecture.
contextual embedding for each token is given as follows:
III. P ROPOSED M ODEL T-BERTS UM h̃ l = LN(h l−1 + MHAtt(h l−1 )) (1)
In this section, we describe the general structure of extrac- h l = LN(h̃ l + FFN(h̃ l )) (2)
tive and abstractive, which generates multisentences’ sum-
maries from a given source document. The overall structure where h 0 = x are the output vectors, LN is the layer
of our model is shown in Fig. 1. There are three major normalization operation [33], MHAtt is the multihead attention
components (from left to right). operation [23], and FFN is a two-layer feedforward network
1) Representation: It adds the matching of token embed- operation. Superscript l indicates the depth of the stacked
ding, segment embedding, position embedding, and layer.
topic embedding of the text with BERT to fully express In T-BERTSum, the symbol [CLS] is used to aggregate the
the semantic features of the sequence (as described in features of a sequence, while the sequence is ended with a
Section III-A). special ending element [SEP]. Token embedding represents the
2) An NTM to induce latent topics (as described in embedding vector of each word. Segment embedding is used
Section III-B). to distinguish sentences, and we follow Liu and Lapata [30]
3) Summarization It consists of an extractive mode and an to embed sentence to represent odd or even sentences by
abstractive mode. embedding E A or E B and achieve the purpose of learning
The former utilizes LSTM for classification to select cru- adjacent sentence features or paragraph sentence features
cial sentences, while the latter generates sentences by stack- at different layers. Position embedding learns embedding at
ing gated networks and transformer layers (as described each position to represent the sequence order information.
Topic embedding is appended to each input sequence, which 1) Extractive: Extractive summarization can be defined as
is the output representation of topic information in the implicit the task of assigning a label Yt ∈ {0, 1} to each sentence,
sequence trained by NTM described in Section III-B. The indicating whether the sentence should be included in the
main contribution of topic embedding is to represent the topic summary or not. Moreover, LSTM combined with the trans-
information hidden in each word or sequence that can mine the former still has its unique advantages [35], which adds a
gist of this article and solve the problem of polysemy. As an forget gate to the simple RNN model to control historical state
example, the word “novel” can be understood as fiction or information. Therefore, the LSTM is combined for classifying
new in different contexts. Therefore, it is necessary to dig out summarization. After the encoder, the sentence embedding
and represent the background information of each word. S = {s1 , s2 , . . . , sn } is obtained to be the input of extractive
model and filter key sentences by document-level features.
At t-time step, the input is the vector st , and the output
B. Neural Topic Model calculation process is given as follows:
Our topic model is inspired by NTM that induces latent
f t = σ (w f st + b f ) (5)
topics in neural networks. We assume topic vector t ∈ R K ×H ,
where K is the number of topics and H is the dimension of i t = σ (wi st + bi ) (6)
the embedding. The token embedding e indicates the meaning ot = σ (wo st + b0 ) (7)
of each token. The topic distribution over words for a given ct = f t ct−1 + i t tanh(wc st + bc ) (8)
topic assignment z n is p(wn |β, z n ) = Multi(βzn ), where Multi
h t = ot tanh(ct ) (9)
is the multinomial topic distribution, which is generated by
computing token embedding and topic embedding as follows: where σ is a sigmoid function; st is the current input; f t , i t ,
and ot are forget gates, input gates, and output gates; ct and h t
β K = soft max e · t KT . (3) are the context vector and the output vector; and w f , b f , wi ,
bi , wo , bo , wc , and bc are the weight and the bias of the forget
We assume that β is achieved by calculating the semantic
similarity between topics and words. Here, the prior para- gate, the input gate, the output gate, and the context vector,
respectively. The final output layer uses a sigmoid function
meters of μ and σ in Fig. 1 are defined as G(θ |μ0 , σ02 ) in
to calculate the final prediction score Ŷt , as shown in (10),
which the Gaussian sample x ∼ N(x|μ0 , σ02 ), N ∼ (μ|σ0 ),
to determine whether the sentence should be included in the
represents the Gaussian distributions, where μ0 and σ02 are
summarization. The loss of the whole model is the binary
hyperparameters that set for a zero mean and unit variance
classification entropy of Ŷt against gold label Yt
Gaussian. We use Gaussian softmax to generate θ = g(x)
and variational inference [34] to approximate a posterior Ŷt = sigmoid(wo h t + bo ). (10)
distribution q(θ |d) over x. The loss function of topic model
is defined as 2) Abstractive: There are many words in the sequence,
N and only a few of them capture the key information of the
entire sequence, which is exactly what we need. In order to
L NTM = E q(θ |d) log [ p(wn |βzn ) p(z n |θ )]
filter the key information of the input sequence, the gated
n=1 zn
network is added to improve the transformer before the
− DKL q(θ |d)|| p θ |μ0 , σ02 (4) decoder, as shown in Fig. 3. Generally, the gated network is
where q(θ |d) is the variational distribution approximating the used to control the information flow from the input sequence
true posterior p(θ |d). The KL term in (4) can be easily inte- to the output sequence, which makes the decoder focus on
grated as a Gaussian KL-divergence. We generate variational generating summaries from key information and removing
parameters μ(d) and σ (d) through the inference network for unnecessary information. Here, the input of the gated network
document d so that we can estimate the variational lower is the sentence representation s. The corresponding hidden
bound by sampling θ from q(θ |d) = G(θ |μ(d), σ 2 (d)). layer representation of [CLS] is used as the representation
We leave out the derivation details and refer the readers to [7]. of the input sequence vector, that is, s = h 0 . At t-time step,
the output is a new vector ĥ t obtained by filtering h t ; the gated
network generates a threshold as follows:
C. Summarization
gt = sigmoid(Wg [h t , s] + b) (11)
The transformer is used to be an encoder based on the
self-attention mechanism. Compared with RNN that needs where Wg is a linear transformation, b is the bias, g indicates
to process the input sequence word by word, it calculates the importance of the word, and ĥ t controls the filtered
the context vector of each word through self-attention, which information, which is computed as
has excellent parallelism and low computational complexity. h̃ t = gt h t . (12)
We stacked a six-layered transformer that each layer has
multihead attention and forward feedback layers. The final The filtered sequence is put into the N-layer transformer
output of the encoder is contextual embedding, as described in decoder, which is shown on in Fig. 3 (right). In order to effi-
Section III-A. As shown in Fig. 1, two modes are built up for ciently decode the sequence and better capture the information
summarization: the extractive mode and the abstractive mode. passed by the encoder, the transformers’ multihead attention
TABLE I
BASIC S TATISTICS OF THE D ATASET: S IZE OF T RAINING , VALIDATION ,
T EST S ETS , AND AVERAGE D OCUMENT AND S UMMARY L ENGTH ( IN
T ERMS OF W ORDS AND S ENTENCES )
where L NTM represents the loss of NTM and L Ext and L Abs ,
respectively, represent the losses of extractive summarization
and abstractive summarization. L Ext equals (9), and L Abs
equals (12). λ is the tradeoff parameter, controlling the balance
between topic model and text summarization. To accommodate
the fact that the encoder is pretrained and the decoder must be
trained from scratch, the encoder’s optimizer is separated from
the decoder. We follow Liu and Lapata [30] to use different
warm-up steps, learning rates, and two Adam optimizers for
the encoder and the decoder, respectively. This will make the
fine-tuning more stable.
IV. E XPERIMENT
A. Data Preparation
Fig. 3. encoder_ours architecture with the transformer, which consists of the We conduct experiments on two benchmark datasets,
gated network and the encoder–decoder framework based on multiattention.
namely, CNN/Daily Mail [36] and XSum [6], which are both
common and well-known corpora for text summarization [1].
is chose to help the decoder learn the soft alignment between In recent years, the former has been widely used in automatic
the summary and the source document. The decoders learning text summarization tasks due to its large data volume and
objective is to minimize negative likelihood of conditional long text content. The latter spurs further research toward
probability the summarization models for its single news outlet and
|a|
uniform summarization style. These datasets have different
scales and methods of generation, some prefer to abstract,
L dec = − log P(at = ŷt |a < t, H ). (13)
t=1
and some prefer to extract with a prominent sentence or first
sentence. Table I demonstrates the statistics of these datasets
In addition, we propose a two-stage text summarization
as follows, including data segmentation, average text length,
model based on the above work. In text summarization,
average summary length, and the percent of novel bigrams in
the extracted sentences first processed by the encoder in
gold summary. We used the standard splits of [36] for training,
the previous stage to obtain the encoder output sequence
validation, and testing (90 266/1220/1093 CNN documents and
representation X = {x 1 , x 2 , . . . , x n }. The representation is
196 961/12 148/10 397 DailyMail documents). We used the
used to be the input of the decoder; then, decoder predicts
splits of Narayan et al. (2018a) [6] for training, validation,
the final summary representation Y = {y1 , y2 , . . . yn } by the
and testing (204 045/11 332/11 334 XSum documents).
transformer with gated network. We found that the model
1) CNN/Daily Mail: There are 287 227 data for train-
can take advantage of the information shared between these
ing, 13 368 data for validation, and 11 490 data for testing.
two tasks without fundamentally changing its architecture to
CNN/Daily Mail consists of news articles with a summary
provide a more complete sequence.
corresponding to annotate several highlighted sentences man-
ually. There are 52.90% novel bigrams in the CNN reference
D. Joint Learning
summaries and 52.16% in DailyMail. It is widely used in
The entire model integrates the NTM and the text summa- automatic text summarization tasks due to its large corpus and
rization, which can be updated simultaneously in one frame- long text, and suitable for extractive and abstractive models.
work. In this framework, we jointly deal with topic modeling The original dataset can be applied here.1
and summaries’ generation and define the loss function of the 2) XSum: It contains 226 711 BBC articles and is accom-
overall framework as follows: panied by a sentence summary, which answers the question
⎧
⎪
⎨λL NTM + L Ext , mod e = ext regarding what this article is about. There are 83.31% novel
L final = λL NTM + L Abs , mod e = abs bigrams in the XSum reference summaries. The articles and
⎪
⎩ summaries in the XSum dataset are shorter, but the vocabulary
λL NTM + L Ext + L Abs , mod e = two−stage
(14) 1 https://github.com/abisee/cnn-dailymail
is large enough to be compared to CNN. The original dataset to generate summaries, and T-BERTSum(ExtAbs) integrates
can be applied here.2 extractive and abstractive to generate sentence-level sequences.
2) Comparison Models: To further illustrate the superiority
of our model over the two datasets, we compare the perfor-
B. Experimental Setup mance with many recent methods. Two groups are divided
The experimental setup is described from the aspect of based on whether they are extractive models or abstractive
model settings, comparison models, and evaluation metrics. models. All comparison models are described in detail as
Among them, the evaluation metrics include automatic evalu- follows.
ation and manual evaluation. 1) Leading Sentences (Lead-3): Lead3 is a baseline, which
1) Model Settings: All models were implemented on the directly extracts the first three sentences of the article as
PyTorch3 [37] version of OpenNMT [38]. To reduce GPU a summary. This model is the extractive baseline.
memory, we choose the “BERTbase”4 for fine-tuning, which 2) SummaRuNNer: It is proposed by Nallapati et al. [41] to
has 110M total parameters. The size of the vocabulary is complete the sequence classification problem and selects
30 522, and the dimensions of word embedding are 768. the final subset of articles by RNN, as an extractive
For the number of topics, we set K = 1. When K = 1, baseline.
the guidance of summary generation is the best. When 3) Refresh: It is proposed by Narayan et al. [42], which
K > 1, the model will be a little bit disturbed; we observed optimizes the rouge evaluation by combining the max-
that the ability of words to represent multiple topics is imum likelihood cross entropy and the reinforcement
deficient. However, on the whole, multiple topics will not learning objective to make sentence ordering more accu-
deviate from the topic too far, which is closer to the reference rate, as an extractive baseline.
summary than the effect of setting K to 0. We obtained a 4) HSSAS: It is proposed by Al-Sabahi et al. [43] to create
probability distribution for each word over topics, and the sentence and document embeddings by a hierarchi-
topic distribution can be inferred for any new document. cal self-attention mechanism, as an extractive baseline.
We follow the grid search of Miao et al. [7] by tuning We followed Al-Sabahi et al. [43] to set the maximum
the hyperparameters in NTM on the development set for sentence length to 50 words and the maximum number
achieving the held-out perplexity. We check sparsity (between of sentences per document to 100. At training time,
le-3 and 0.75) to estimate perplexity. In order to achieve the batch size was set to 64.
the optimal parameter setting, γ = 0.8 and λ = 1.0 for 5) BERT + Transformer: It is a simple variant of BERT,
controlling the effects of NTM and summarization. μ0 and σ02 which combines with the transformer to integrate sen-
are hyperparameters that set for a zero mean and unit variance tences for extracting abstracts proposed by Liu [44].
Gaussian. For extractive summarization, we get the score of We followed the original text to select the top-three
each sentence according to the output layer, then arrange the checkpoints based on the evaluation losses and use the
sentences from high to low, and select the first three sentences trigram blocking to reduce redundancy.
as the key sentences. For abstractive summarization, we use 6) Pointer-Generator + Coverage: See et al. [45] copy the
beam search whose size is set to 4. During beam search, words directly from the original text through pointer
we set the probability of duplicate words to 0 and delete and retain the ability to generate new words through
sentences with less than three words from the result set until generators, as an abstractive baseline.
an end-of-sequence token is emitted. We use the decoder of a 7) Bottom-Up: It is proposed by Gehrmann et al. [46] to
six-layer transformer with 512 hidden units, six-head attention identify phrases in the source document that should be
blocks, and 2048 hidden feedforward layers. The batch size is part of the summary by using a data-efficient content
140 with gradient accumulation every five steps. selector as a bottom-up attention step and an abstractive
We use the Adam optimizer [39] and follow baseline.
Vaswani et al. [23] with a learning rate of 2e−3 and 8) DCA: Çelikyilmaz et al. [47] have multiple agents to
0.05 for training the encoder and the decoder. In addition, represent documents and a hierarchical attention mech-
we set two Adam optimizers with β1 = 0.9 and β2 = 0.999 anism that decodes the agents. It serves as the best
for the encoder and the decoder, respectively. Model abstractive model in 2018, as a baseline for abstracting.
checkpoints were saved and evaluated on the validation set 9) BERTSum: It is proposed by Liu and Lapata [30] to use
every 2000 steps. The maximum length of the sentence pretrained language models to effectively summary in
of the summary is set to 512. For regularization, we use generation tasks, which can be used as a baseline for
dropout [40] and set the dropout rate to 0.1. new methods.5
In addition, we did three comparative experiments: 10) BEAR: It is proposed by Wang et al. [31] to use BERT
T-BERTSum(Ext) extracts vital sentences based on pretrained word embedding as input and integrated the extractive
encoder and stacked LSTM, T-BERTSum(Abs) combines network and the generation network into a unified model
six-layer transformer encoder–decoder and gated network by reinforcement learning as an abstractive baseline.
We followed the original text to set the learning rate
2
https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset of machine learning to 3 × 10−4 , the maximum length
3 https://pytorch.org/
4 https://github.com/huggingface/pytorch-pretrained-BERT 5 https://github.com/nlpyang/BertSum
of the input text sequence sentences is set to 100, and TABLE II

the maximum number of sentences is set to 60. P ERFORMANCE C OMPARISON OF M ODELS W ITH R ESPECT TO THE
BASELINES ON CNN/D AILY M AIL . R-AVG C ALCULATES
3) Metrics: AVERAGE S CORE OF ROUGE -1, ROUGE -2, AND ROUGE -L
a) Automatic evaluation: We apply ROUGE-1 that
measures unigram recall between the summary and docu-
ment, ROUGE-2 that measures bigram recall similarly, and
ROUGE-L that measures the longest common subsequence
between the summary and document as automatic evaluation.
Rouge [48] compares the automatic generated summaries with
the manual standard summaries by counting the overlapping
lexical units between the two. This method has become
the metric for evaluating the generated summary model and
computed as follows:
S∈{Ref} n−grams∈S Countmatch (n−gram)
ROUGE−N =
S∈{Ref} n−grams∈S Count(n−gram)
(15)
where n is the length of the n−grams. Countmatch (n−gram) is
the number of n−grams appearing in a candidate summaries
and a reference summaries simultaneously. Count(n−gram)
TABLE III
is the number of n−grams in the reference summaries. The
P ERFORMANCE C OMPARISON OF M ODELS W ITH R ESPECT TO THE
scores are computed by python pyrouge6 package. BASELINES ON XS UM . R-AVG C ALCULATES AVERAGE
b) Manual evaluation: Since the automatic evaluation S CORE OF ROUGE -1, ROUGE -2, AND ROUGE -L
has great limitations in semantic and syntax information,
the manual evaluation is further used to assist with automatic
evaluation to verify our models. We implemented a small-scale
manual evaluation and selected 100 articles from the test set
to conduct anonymous experiments on the extractive model
[Lead-3, BERTSum, and T-BERTSum(Ext)] and the abstrac-
tive model [PTGEN + COV, BEAR, and T-BERTSum(Abs)].
Due to the limited human resources, we randomly selected
three highly educated volunteers to anonymously grade the
summaries with six models, ranging from 1 to 5. A higher
score is indicative of higher model capabilities. The evaluation
includes the following.
is the comparative baseline of the abstractive model. The
1) Salient: The ability to reproduce the original information last part is our model, T-BERTSum(Ext), T-BERTSum(Abs),
or viewpoint. and T-BERTSum(ExtAbs). The topic-aware model based on
2) Coherence: Whether the summary has a consistent topic, pretraining performs better than traditional models in different
meaning that whether the generated content is chaotic. evaluation standards, indicating that the transformer’s good
3) Redundancy: Whether the summary contains less redun- architecture based on attention can capture key information
dant words. and summarize valid text instead of using traditional recurrent
neural networks. Whether our model is compared with the
C. Experimental Result and Analysis extractive baseline or the abstractive baseline, the score reflects
We not only analyze the quality of the model from the the superiority of the model, which implies the necessity of
standard automatic evaluation but also discuss the ability to introducing the theme to guide the generation.
generate text from the novelty percentage. Then, the ablation In the complete experimental group, the experimental results
studies are done to verify the role of each component of the of our model are significantly improved compared with the
model. Finally, the differences between the original text and earlier work and baseline model. As shown in Table II,
the generated summaries are shown in the case study, and the we obtain the highest ROUGE score for the current abstractive
importance of topic-aware is verified. summary compared with HSSAS, which has the ability to
1) Result: Tables II and III show the results of automatic automatically learn distributed representations of sentences
evaluation of our model and the comparative model on the and documents and convert the summary task into a classi-
CNN/Daily Mail and XSum datasets, respectively. As shown fication task by calculating the respective probabilities of the
in Table II, the first part of the table is the comparative membership of the sentence summary. This model uses a hier-
baseline of the extractive model. The second part of the table archical self-attention mechanism to create sentences and doc-
ument embeddings, which is similar to our model. However,
6 https://pypi.org/project/pyrouge/ T-BERTSum(Ext) has increased by a few percentages,
indicating that, to greatly improve the quality of the gener-

ation, it requires not only the accurate embedding of the doc-
ument but also the auxiliary extraction of additional informa-
tion, so that the model can accurately locate the most relevant
important sentences. Compared with BERTSum, which also
uses BERT for the text embedding and transformer for the
basic architecture, our abstractive model improves by 0.4%,
1.06%, and 0.98% on ROUGE-1, ROUGE-2, and ROUGE-L,
respectively. The improvement of the score explains that it is
indispensable to add topic awareness to the model by mapping
text embeddings. Topic-aware word embeddings or sentence
sequences can better help the model to classify and identify the
text, so as to achieve the formation of a summary that does not
deviate from the topic. Compared with the traditional model,
we not only interpret and infer the text by encoding–decoding Fig. 4. Statistics of novel n-grams and sentences on the CNN/Daily mail
dataset. Our model can generate far more novel n-grams and sentences than
but also integrate the background information of the text into other baseline models.
the text embedding from the perspective of text understanding,
so as to make favorable use of the ability of BERT to
generate the model of the pretraining language with large-scale CNN/Daily mail with the source text. Novelty measures the
corpus. We also try to combine the extractive model with generative ability of a model. As shown in Fig. 4, our model
the abstractive model so that the two promote each other and copies next to 15% of sentences from the source text, and
share useful information. The two-stage extractive–abstractive the copy rate is closest to the copy rate of the reference
model also achieved the highest score on ROUGE-L compared summaries. PTGEN + COV has low novelty, copying nearly
with other extractive and abstractive models. The two-stage 50% of the original sentence and generating nearly 1% of
extractive–abstractive model T-BERTSum(ExtAbs) also scored new words, because the pointer network tends to select words
the highest on ROUGE-L. Compared with the extractive model from the original text. Compared with the BERA model, our
and the abstractive model, there is a small drop in scores, model has been greatly improved, with relatively high novelty,
which means that the two-stage model can share information, and nearly 5% of the token is newly generated because our
causing less interference and reducing the redundancy of gen- model emphasizes the understanding of semantics and has the
eration. It can be observed that our extractive model performs support of large-scale lexical reserves. It also proves that the
better than the abstractive model and the two-stage model, introduction of topic representation can improve the encoding
which is related to the CNN/Daily Mail dataset that is biased ability and memory ability of the encoder; the gated network
toward extracting prominent sentences as summaries. can also remember valid information.
For XSum, each article is accompanied by a sentence that 2) Ablation Study: In order to verify the validity of the
tends to generate. As shown in Table III, our models sig- representation of the pretrained language model, we conducted
nificantly outperform the comparison models across all three a simple comparison experiment. We modeled BERTlarge on
variants of the ROUGE metric. Compared with the classic the extractive model, another version of BERT, which consists
pointer network, pointer-generator + coverage, our method of larger corpus training and architecture. We found that the
has been significantly improved. The two-stage model has model was slightly improved, as shown in Fig. 5. This also
improved by nearly 10%. This shows that the pointer network reflects BERT’s strong representational ability. Compared with
can well integrate the extractive and abstractive methods, but LEAD-3, the model has been significantly improved, which
it lacks the ability of text embedding representation and topic proves that the first step of the text task is to have a powerful
awareness guidance, which are indispensable for the quality feature representation to fully understand the original article
assurance of generation. Compared with the best model, as much as possible. Compared with the T-BERTSum(base),
which is also based on BERT as text embedding, the scores T-BERTSum(large) has a certain improvement, which is
of ROUGE-1, ROUGE-2, and ROUGE-L are, respectively, 0.63%, 0.7%, and 0.95%, respectively. We believe that BERT
improved 1.14%, 1.15%, and 1.03% for the abstractive model, can further improve the performance of generating summaries
which manifests that our model still has certain advantages through pretraining on large datasets and powerful architecture
in abstractive method and can focus on effective information. for learning complex features.
The NTM can effectively inference and enhance the model’s Ablation studies show the contribution of different compo-
understanding ability. The two-stage model has also improved nents of T-BERTSum, and the results are shown in Fig. 6.
by 0.81%, 0.76%, and 0.4% on ROUGE-1, ROUGE-2, and We conducted two comparative experiments on the extractive
ROUGE-L. Compared with other baseline models, our model model [see Fig. 6(a)]: one model only used a six-layer trans-
has obvious advantages in generative datasets, which shows former to classify the sentences in the decoding stage instead
that our model can capture context information and summarize of the LSTM network, and one model used the unpretraining
it well. transformer architecture that has fewer parameters. The former
We also evaluate the novelty of the abstractive model by improves 0.3% on average of ROUGE, which is in line
calculating the proportion of newly appeared n-grams on with our hypothesis that the combination of transformer and
TABLE V
C OMPARISON OF G ROUND -T RUTH S UMMARY AND G ENERATED S UM -
MARIES OF THE B ASELINE BEAR M ODEL AND O UR M ODEL ON THE
CNN/D AILY M AIL D ATASET. F OR B REVITY, T HIS A RTICLE H AS
B EEN S HORTENED . F OR R EADABILITY, C APITALIZATION WAS
A DDED M ANUALLY
Fig. 5. Performance comparison between the BERTbase model and the

BERTlarge model.
(a) (b)
Fig. 6. Results of ablation study between (a) extractive model and (b) abstrac-
tive model.
TABLE IV
H UMAN E VALUATION OF S IX M ODELS . W E C OMPARE THE
S CORE OF S ALIENT, C OHERENCE , AND R EDUNDANCY
LSTM has certain advantages. LSTM can make up for the

shortcomings of the transformer in abandoning RNN. The
latter improves the average of ROUGE by nearly 3% points.
Pretraining improves the efficiency and accuracy of the model.
Although there are a lot of parameters, we only need to train
these parameters with different optimizers. We also performed
two comparative experiments on the abstractive model [see
Fig. 6(b)]: the model without the gated network and the
model without pretraining. Although the former has only a
slight improvement and a small contribution to the removal
of useless information, it does not trigger a reverse effect on
the model. The gated network memorizes the information to
a certain extent, which is beneficial for capturing contextual In terms of coherence, LEAD-3 is only second to our model
information. It is obvious that pretraining can greatly improve in terms of coherence that the method selects the first three
the ability to generate and reduce the burden on the model. sentences of the original article as the summary, which is
3) Manual Evaluation and Demonstration: For simplicity, a high probability to describe the same point of standpoint.
we only perform a manual evaluation on the CNN/Daily Nevertheless, our model still gets a high score, proving that the
Mail dataset. The results of evaluation scoring are shown model makes positive use of the background information of the
in Table IV. Consistent with the results of the automatic original text and the guidance to generate the summary with
evaluation, our model obtained better scores in both extractive topics. It does not make much sense to evaluate the redundancy
and abstractive, which further proves that our model can well of the extractive model. In terms of redundancy, it does not
model the original documents and accurately capture crucial make much sense to evaluate the redundancy on the extractive
information. model because the extractive method will remove repeated
sentences. On the redundancy evaluation of the abstractive solve another big problem, that is, the generated summaries
model, PTGEN + COV obtained the highest score because do not match the facts of the source text: on the one hand,
the method incorporated a replication mechanism based on how to introduce additional structured knowledge so that the
the pointer network to avoid the generation of overlapping encoder can not only consider the contextual representation
words. However, our model score is not too low since we but also consider additional knowledge information; on the
take the attitude that the gated network plays a certain role other hand, how to extend the topic information so that we
in filtering information at each step. In terms of salient, can obtain multiple topics and subtopics of the article to
we got the highest evaluation score by manual evaluation, enhance sentence information and consolidate document-level
which is the recognition of our model ability and the quality knowledge. Finally, we can consider how to process topic
of the summaries. We found that our model has a better information and additional structured knowledge in parallel
comprehensive ability, which proves that integrating the topic on the basis of the method in this article, so as to make a
information can better summarize the original text based on qualitative leap in the task of text summarization while keeping
the premise of strong representation ability. the generated summary consistent with the original facts.
We present the example in Table V for comparison with our R EFERENCES
model and the baseline model. We can see that our model does [1] M. Allahyari et al., “Text summarization techniques: A brief sur-
not lose crucial information because of the long distance for vey,” 2017, arXiv:1707.02268. [Online]. Available: http://arxiv.org/abs/
the long article while capturing the topic reliably. The words 1707.02268
[2] T. Ma, Q. Liu, J. Cao, Y. Tian, A. Al-Dhelaan, and M. Al-Rodhaan,
of the table marked with underlines are the important topics of “LGIEM: Global and local node influence based community detection,”
the text. As an example, our model captures core ideas, such Future Gener. Comput. Syst., vol. 105, pp. 533–546, Apr. 2020, doi:
as “ f i er ce f ight” around the topic compared to the baseline 10.1016/j.future.2019.12.022.
[3] A. Khan and N. Salim, “A review on abstractive summarization meth-
model, which can well reflect the event itself described by ods,” J. Theor. Appl. Inf. Technol., vol. 59, no. 1, pp. 64–72, 2014.
the article. When both models capture the same topic, our [4] M. Gambhir and V. Gupta, “Recent automatic text summarization
model can also generate new recurrence vocabulary based on techniques: A survey,” Artif. Intell. Rev., vol. 47, no. 1, pp. 1–66,
Jan. 2017, doi: 10.1007/s10462-016-9475-9.
the topic, which is effective and accurate. [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
of deep bidirectional transformers for language understanding,” in Proc.
V. C ONCLUSION AND F UTURE W ORK Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang.
Technol., vol. 1. Minneapolis, MN, USA: Association for Computational
In this work, we propose a general model of extractive Linguistics, Jun. 2019, pp. 4171–4186, doi: 10.18653/v1/n19-1423.
and abstractive for text summarization, which is based on [6] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details,
BERT’s powerful architecture and additional topic embedding just the summary! Topic-aware convolutional neural networks for
extreme summarization,” 2018, arXiv:1808.08745. [Online]. Available:
information to guide contextual information capture. For a http://arxiv.org/abs/1808.08745
good summary, an accurate representation is extremely impor- [7] Y. Miao, E. Grefenstette, and P. Blunsom, “Discovering discrete latent
tant. This article introduces the representation of a powerful topics with neural variational inference,” 2017, arXiv:1706.00359.
[Online]. Available: http://arxiv.org/abs/1706.00359
pretraining language model (BERT) to lay the foundation of [8] K. Cho et al., “Learning phrase representations using RNN encoder-
the source text encoding and emphasizes the subjectivity of the decoder for statistical machine translation,” in Proc. 2014 Conf. Empiri-
generated content. The fusion of topic embedding is a direct cal Methods Natural Lang. Process. (EMNLP), Doha, Qatar, Oct. 2014,
pp. 1724–1734, doi: 10.3115/v1/d14-1179.
and effective way to achieve high-quality generation through [9] H. Su et al., “Improving multi-turn dialogue modelling with
NTM inferring. The combination of token embedding, segment utterance ReWriter,” 2019, arXiv:1906.07004. [Online]. Available:
embedding, position embedding, and topic embedding can http://arxiv.org/abs/1906.07004
[10] M. E. Peters et al., “Deep contextualized word representations,”
more abundantly embed the information that the original text 2018, arXiv:1802.05365. [Online]. Available: http://arxiv.org/abs/
should contain. Stacking the transformer layer in the encoding 1802.05365
stage is able to enhance the BERT’s ability to represent source [11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving
language understanding by generative pre-training,” OpenAI Company,
texts, make full use of self-attention, and judge the importance San Francisco, CA, USA, Tech. Rep., 2018.
of different components of the sentence through different focus [12] Y. Liu et al., “RoBERTa: A robustly optimized BERT pre-
scores. The two-stage extractive–abstractive model can share training approach,” 2019, arXiv:1907.11692. [Online]. Available:
http://arxiv.org/abs/1907.11692
information and generate salient summaries, which reduces a [13] T. Ma, Y. Zhao, H. Zhou, Y. Tian, A. Al-Dhelaan, and M. Al-Rodhaan,
certain degree of redundancy. The experimental results show “Natural disaster topic extraction in sina microblogging based on graph
that the model proposed in this article achieves the state- analysis,” Expert Syst. Appl., vol. 115, pp. 346–355, Jan. 2019, doi:
10.1016/j.eswa.2018.08.010.
of-the-art results on the CNN/Daily Mail dataset and the [14] N. Akhtar, H. Javed, and T. Ahmad, “Hierarchical summarization of
XSum dataset. The analysis shows that the model can generate text documents using topic modeling and formal concept analysis,” in
high-quality summaries with outstanding consistency for the Data Management, Analytics and Innovation. Singapore: Springer, 2019,
pp. 21–33.
original text. [15] R. K. Roul, S. Mehrotra, Y. Pungaliya, and J. K. Sahoo, “A new
Although the model has made some progress in text sum- automatic multi-document text summarization using topic modeling,” in
marization, it also has some limitations. For long articles Proc. Int. Conf. Distrib. Comput. Internet Technol. (ICDCIT), vol. 11319.
Cham, Switzerland: Springer, Jan. 2019, pp. 212–221, doi: 10.1007/978-
with multiple topics, our model has limited processing power. 3-030-05366-6_17.
In future work, we will try to extend our work to multitopic [16] C. Lin and E. H. Hovy, “The automated acquisition of topic
with the transformer network to capture multiple topics hierar- signatures for text summarization,” in Proc. 18th Int. Conf.
Comput. Linguistics (COLING). San Mateo, CA, USA: Morgan
chically by imitating multihead self-attention and further prove Kaufmann, Jul./Aug. 2000, pp. 495–501. [Online]. Available:
the validity of this article. In addition, we need to further https://www.aclweb.org/anthology/C00-1072/
[17] H. Pan, H. Liu, and Y. Tang, “A sequence-to-sequence text summariza- [38] G. Klein, Y. Kim, Y. Deng, V. Nguyen, J. Senellart, and
tion model with topic based attention mechanism,” in Proc. Int. Conf. A. M. Rush, “OpenNMT: Neural machine translation toolkit,” 2018,
Web Inf. Syst. Appl., vol. 11817. Cham, Switzerland: Springer, Sep. 2019, arXiv:1805.11462. [Online]. Available: http://arxiv.org/abs/1805.
pp. 285–297, doi: 10.1007/978-3-030-30952-7_29. 11462
[18] Z. Yang, Y. Yao, and S. Tu, “Exploiting sparse topics mining for [39] D. P. Kingma and B. Jimmy, “Adam: A method for stochastic optimiza-
temporal event summarization,” in Proc. IEEE 5th Int. Conf. Image, tion,” in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), San Diego, CA,
Vis. Comput. (ICIVC), Jul. 2020, pp. 322–331. USA, May 2015.
[19] A. T. Sadiq, Y. H. Ali, and M. S. M. N. Fadhil, “Text summarization [40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
for social network conversation,” in Proc. Int. Conf. Adv. Comput. Sci. R. Salakhutdinov, “Dropout: A simple way to prevent neural networks
Appl. Technol., Dec. 2013, pp. 13–18. from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
[20] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao, 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2670313
“Neural document summarization by jointly learning to score [41] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: A recurrent neural
and select sentences,” 2018, arXiv:1807.02305. [Online]. Available: network based sequence model for extractive summarization of doc-
http://arxiv.org/abs/1807.02305 uments,” in Proc. 31st AAAI Conf. Artif. Intell. San Francisco, CA,
[21] S. Ghodratnama, A. Beheshti, M. Zakershahrak, and F. Sobhan- USA. AAAI Press, Feb. 2017, pp. 3075–3081. [Online]. Available:
manesh, “Extractive document summarization based on dynamic feature http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14636
space mapping,” IEEE Access, vol. 8, pp. 139084–139095, 2020, doi: [42] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences
10.1109/ACCESS.2020.3012539. for extractive summarization with reinforcement learning,” 2018,
[22] L. Wang, J. Yao, Y. Tao, L. Zhong, W. Liu, and Q. Du, “A reinforced arXiv:1802.08636. [Online]. Available: http://arxiv.org/abs/1802.
topic-aware convolutional sequence-to-sequence model for abstrac- 08636
tive text summarization,” 2018, arXiv:1805.03616. [Online]. Available: [43] K. Al-Sabahi, Z. Zuping, and M. Nadher, “A hierarchical struc-
http://arxiv.org/abs/1805.03616 tured self-attentive model for extractive document summarization
[23] A. Vaswani et al., “Attention is all you need,” 2017, arXiv:1706.03762. (HSSAS),” IEEE Access, vol. 6, pp. 24205–24212, 2018, doi:
[Online]. Available: http://arxiv.org/abs/1706.03762 10.1109/ACCESS.2018.2829199.
[24] T. Ma, H. Wang, L. Zhang, Y. Tian, and N. Al-Nabhan, “Graph [44] Y. Liu, “Fine-tune BERT for extractive summarization,” 2019,
classification based on structural features of significant nodes and spatial arXiv:1903.10318. [Online]. Available: http://arxiv.org/abs/1903.
convolutional neural networks,” Neurocomputing, vol. 423, pp. 639–650, 10318
Jan. 2021, doi: 10.1016/j.neucom.2020.10.060. [45] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization
[25] T. Cai, M. Shen, H. Peng, L. Jiang, and Q. Dai, “Improving transformer with pointer-generator networks,” in Proc. 55th Annu. Meeting Assoc. for
with sequential context representations for abstractive text summariza- Comput. Linguistics (ACL), vol. 1. Stroudsburg, PA, USA: Association
tion,” in Proc. CCF Int. Conf. Natural Lang. Process. Chin. Com- for Computational Linguistics, Jul./Aug. 2017, pp. 1073–1083, doi:
put. (NLPCC), vol. 11838. Cham, Switzerland: Springer, Oct. 2019, 10.18653/v1/P17-1099.
pp. 512–524, doi: 10.1007/978-3-030-32233-5_40. [46] S. Gehrmann, Y. Deng, and A. M. Rush, “Bottom-up abstrac-
tive summarization,” 2018, arXiv:1808.10792. [Online]. Available:
[26] M.-H. Su, C.-H. Wu, and H.-T. Cheng, “A two-stage transformer-based
approach for variable-length abstractive summarization,” IEEE/ACM http://arxiv.org/abs/1808.10792
[47] A. Çelikyilmaz, A. Bosselut, X. He, and Y. Choi, “Deep communicating
Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2061–2072, 2020,
agents for abstractive summarization,” in Proc. Conf. North Amer.
doi: 10.1109/TASLP.2020.3006731.
Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol. (NAACL-
[27] X. Zhang, F. Wei, and M. Zhou, “HIBERT: Document level
HLT), vol. 1. Stroudsburg, PA, USA: Association for Computational
pre-training of hierarchical bidirectional transformers for docu-
Linguistics, Jun. 2018, pp. 1662–1675, doi: 10.18653/v1/n18-1150.
ment summarization,” 2019, arXiv:1905.06566. [Online]. Available:
[48] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”
in Text Summarization Branches Out. Barcelona, Spain: Association for
[28] A. Hoang, A. Bosselut, A. Celikyilmaz, and Y. Choi, “Efficient Computational Linguistics, 2004, pp. 74–81.
adaptation of pretrained transformers for abstractive summariza-
tion,” 2019, arXiv:1906.00138. [Online]. Available: http://arxiv.org/abs/
1906.00138
[29] H. Zhang, J. Xu, and J. Wang, “Pretraining-based natural language
generation for text summarization,” 2019, arXiv:1902.09243. [Online]. Tinghuai Ma (Member, IEEE) received the bache-
Available: http://arxiv.org/abs/1902.09243 lor’s and master’s degrees from the Huazhong Uni-
[30] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,” versity of Science and Technology (HUST), Wuhan,
in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint China, in 1997 and 2000, respectively, and the Ph.D.
Conf. Natural Lang. Process. (EMNLP-IJCNLP). Stroudsburg, PA, USA: degree from the Chinese Academy of Sciences,
Association for Computational Linguistics, Nov. 2019, pp. 3728–3738, Beijing, China, in 2003.
doi: 10.18653/v1/D19-1387. He was a Post-Doctoral Associate with AJOU
[31] Q. Wang, P. Liu, Z. Zhu, H. Yin, Q. Zhang, and L. Zhang, “A text University, Suwon, South Korea, in 2004. From
abstraction summary model based on BERT word embedding and November 2007 to July 2008, he visited the Chinese
reinforcement learning,” Appl. Sci., vol. 9, no. 21, p. 4701, Nov. 2019. Meteorology Administration, Beijing. From Febru-
ary 2009 to August 2009, he was a Visiting Professor
[32] A. Srikanth, A. S. Umasankar, S. Thanu, and S. J. Nirmala,
with the Ubiquitous Computing Laboratory, Kyung Hee University, Seoul,
“Extractive text summarization using dynamic clustering and co-
South Korea. He is currently a Professor of computer sciences with Nanjing
reference on BERT,” in Proc. 5th Int. Conf. Comput., Com-
University of Information Science and Technology, Nanjing, China. He has
mun. Secur. (ICCCS), Patna, India, Oct. 2020, pp. 1–5, doi:
published more than 100 journal articles/conference papers. His research
10.1109/ICCCS49678.2020.9277220.
interests are data mining, cloud computing, ubiquitous computing, privacy-
[33] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normaliza-
preserving, and so on.
tion,” 2016, arXiv:1607.06450. [Online]. Available: http://arxiv.org/abs/
1607.06450
[34] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference:
A review for statisticians,” 2016, arXiv:1601.00670. [Online]. Available:
Qian Pan received the bachelor’s degree in
software engineering from the Nanjing University
[35] M. X. Chen et al., “The best of both worlds: Combining recent advances of Information Science and Technology, Nanjing,
in neural machine translation,” 2018, arXiv:1804.09849. [Online]. China, in 2021.
Available: http://arxiv.org/abs/1804.09849 She is currently a Computer Professional
[36] K. M. Hermann et al., “Teaching machines to read and comprehend,” Researcher with the Nanjing University of
2015, arXiv:1506.03340. [Online]. Available: http://arxiv.org/abs/1506. Information Science and Technology. Her research
03340 interest lies in data mining, especially on the text
[37] A. Paszke et al., “Automatic differentiation in PyTorch,” in Proc. NIPS summarization task.
Autodiff Workshop, Future Gradient-Based Mach. Learn. Softw. Techn.,
Long Beach, CA, USA, Dec. 2017.
Huan Rong received the Ph.D. degree in computer Yuan Tian received the master’s and Ph.D. degrees
science from the Nanjing University of Information from Kyung Hee University, Seoul, South Korea,
Science and Technology, Nanjing, China, in 2020. in 2009 and 2012, respectively.
He is currently a Visiting Scholar with the Univer- She is currently an Assistant Professor with the
sity of Central Arkansas, Conway, AR, USA. He is College of Computer and Information Sciences,
also an Assistance Professor with the School of Arti- King Saud University, Riyadh, Saudi Arabia. She
ficial Intelligence, Nanjing University of Information is also an Associate Professor with the School of
Science and Technology. His research interests lie Computer, Nanjing Institute of Technology, Nanjing,
in deep learning and the application of artificial China. Her research interests are broadly divided
intelligence, especially in sentiment analysis and into privacy and security, which are related to the
other interdisciplinary tasks. His research contribu- cloud.
tions have been published in Information Sciences, IEEE T RANSACTIONS ON Dr. Tian is also a member of the technical committees of several inter-
A FFECTIVE C OMPUTING, Soft Computing, and so on. national conferences. She is also an active reviewer of many international
journals.
Yurong Qian received the B.S. and M.S. degrees

in computer science and technology from Xin-
jiang University, Urumqi, China, in 2002 and 2005,
respectively, and the Ph.D. degree in biology from Najla Al-Nabhan received the B.S. degree (Hons.) in computer applications
Nanjing University, Nanjing, China, in 2010. and the M.S. degree in computer science from The George Washington
From 2012 to 2013, she was a Post-Doctoral University, Washington, DC, USA, in 2005 and 2008, respectively, and
Fellow with the Department of Electronics and Com- received the Ph.D. degree in computer science from King Saud University
puter Engineering, Hanyang University, Seoul, South (KSU), Riyadh, Saudi Arabia, in 2013.
Korea. She is currently a Professor with the School She is currently the Vice Assistant Professor with the Computer Science
of Software, Xinjiang University. Her research inter- Department, College of Computer and Information Sciences (CCIS), KSU.
ests include cloud computing, image processing, and Her current research interests include wireless sensor networks, multimedia
intelligent computation, such as artificial neural networks. sensor networks, cognitive networks, and network security.

T-BERTSum Topic-Aware Text Summarization Based on BERT

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

T-BERTSum Topic-Aware Text Summarization Based on BERT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

T-BERTSum Topic-Aware Text Summarization Based on BERT

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 9, NO.

3, JUNE 2022 879

T-BERTSum: Topic-Aware Text

A UTOMATIC text summarization is the process of com-

Fig. 2. Overview architecture of the T-BERTSum embedding. Differentiate

in Section III-C). These three components can be updated

of the input text sequence sentences is set to 100, and TABLE II

indicating that, to greatly improve the quality of the gener-

Fig. 5. Performance comparison between the BERTbase model and the

LSTM has certain advantages. LSTM can make up for the

Yurong Qian received the B.S. and M.S. degrees

You might also like