Bangla Text Summarization Using Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Bangla News Summarization using

Natural Language Processing


Al-Mahmud, MD. Abdul Alim and MD. Rezaul Karim
Department of Computer Science and Engineering
Khulna University of Engineering & Technology
Khulna-9203, Bangladesh.
[email protected], [email protected], [email protected]

Abstract - Automatic news summarization is the technique of and serving with satisfactory accuracy. But there is no complete
compressing the original news document into shorter form which system for Bangla document summarization.
will provide same meaning and information as provided by News summarization has become a recent topic of research
original text. The brief summary produced by summarization for many researchers. A huge number of articles have published
system allows readers to quickly and easily understand the content
in scientific journals with different proposed methods. But, for
of original documents without having to read each individual
document. The overall motive of text summarization is to convey Bangla Language there are few number of papers & journals are
the meaning of text by using less number of words and published. The existing methods for summarizing Bangla news
sentences. The domain of our thesis lies under Natural Language are Pronoun Replacement and Improved Sentence Ranking [1],
Processing (NLP) which basically includes analysis, classification Sentence Scoring and Ranking [2], Sentence Frequency and
and summarization of raw text obtained from Bangla news or Clustering [3], Sentence Extraction [4].
Articles. In our proposed methodology, first we pre-process the
data including tokenization and stemming in every words of the II. RELATED WORK
document. Stemming is the main key work of our thesis, avoiding
For the English document summarization many works
conventional rule based stemming, we create a stemmer database
that provide root word of any kind of word. Here we calculate the
have been done but for the Bangla document summarization a
score value for every sentence of the document based on few works have been done until now though Bangla is one of
various features such as tf-idf value, sentence frequency, numerical the most significant language in world. We study some works
figure count, title word and position value and set ranking the in this field to find out how improved the accuracy of text
sentences according to score. Then we choose one third of the summarization.
highest ranked sentences for summery. In here we mainly focus on The earliest work on English document summarization
pre-processing of input data and it makes methodology more proposed the frequency of a particular word in a document to
accurate and efficient. be a useful measure of significance described by Luhn in [20]
though it was the preliminary step of the text summarization but
I. INTRODUCTION
many of his ideas are still found to be effective for text. At first
Now-a-days, information overload on the World Wide all stop words were removed and rest of the words were
Web (WWW) is becoming a problem for an increasingly large stemmed to their root forms.
number of web users. To reduce this information overload Haque, Pervin, and Zerina Begum proposed first time that
problem, automatic text summarization can be an indispensable Pronoun replacement is necessary to minimize the dangling
tool. The abstracts or summaries can be used as the document pro-noun from summary [1]. After replacing pronoun,
surrogates in place of the original documents. In another way, sentences are ranked using term frequency, sentence frequency,
the summaries can help the reader to get a quick overview of an numerical figures and title words. If two sentences have at least
entire news document. Another important issue related to the 60% cosine similarity, the frequency of the larger sentence is
information explosion on the internet is the problem that many increased, and the smaller sentence is removed to eliminate
documents with the same or similar topics are duplicated. This redundancy. Moreover, the first sentence is included in
kind of data duplication problem increases the necessity for summary always if it contains any title word. In Bangla text,
effective document summarization. The domain of our thesis numerical figures can be presented both in words and digits
lies under Natural Language Processing (NLP) which basically with a variety of forms.
includes analysis, classification and summarization of raw news The development of an extraction based summarization
obtained from news document. technique which works on Bangla text documents [2]. The
News summarization is the process to make summary or system summarizes a single document at a time. Before creating
abstract of a given news. Automatic news summarization is the the summary of a document, it is pre-processed by tokenization,
technique, where a computer summarizes a news. A summary removal of stop words and stemming. In the document
is a text that is produced out of one or more (possibly summarization process, the countable features like word
multimedia) texts, that contains (some of) the same information frequency and sentence positional value are used to make the
of the original text(s), and that is no longer than half of the summary more precise and concrete. Attributes like cue words
original text(s) described by Lin and Hovy [1-3]. English and skeleton of the document are included in the process, which
document summarization systems are already there help to make the summary more relevant to the content of the
document. The proposed technique has been compared with
summary of documents generated by human professionals. The
evaluation shows that 83.57% of summary sentences selected
by the system agreed with those made by human.
Tawhidul and Mostafa proposed Bhasa, a corpus-based
search engine and summarizer [21]. It uses vector space
retrieval method on key words to perform document indexing
and retrieving information. Bhasa prioritizes the corpus file Tokenization
based on terms frequency. The system used a tokenizer which
is capable of detecting different words, tags, abbreviation, etc.
and then performed document ranking to summarize the Stop Words
tokenised document.
For the very first time Haque, Suraiya Pervin, Zerina
Begum, represented the Frequency and Clustering based feature
of the sentence where redundancy elimination is a consequence
[3]. Another one remarkable aspect is sentence clustering on the Stemming
basis of similarity ratio among sentences. The summary
sentence selection is done from all the clusters so that there will
be maximum coverage of information in summary even if
information is found scattered in input document. Two sets of
human generated summary have been utilized where one is to
train the system and another is for performance evaluation. The
proposed method has been found better while turning
comparison with the latest state-of-the art method of Bengali
news documents summarization. The results of performance
evaluation show that the average Precision, Recall and F-
measure values are 0.608, 0.664 and 0.632 respectively

III. PROPOSED METHOD Fig. 1. The flow chart of the Summarization Process

Firstly, we will pre-process the document then calculate the We handle the stop word. In computing, stop words are words
sentence score or rank. Finally, based on the rank, document which are filtered out before or after processing of natural
summary will be determined. language data. There is no universal list of stop words in NLP
The proposed Bangla text summarization approach is research. List of words is there as preposition/conju-
described in the following steps. nction/interjection (অবযয় - Obboy) in Bangla language. They
are tagged as stop words. List of 363 stop words has been
A. Pre-processing collected [5] for Bangla language.
In Bangla document summarization process, some pre- Special character removal: In this sector we handle special
processing is needed before executing the sentence scoring character by regular expression. Some special character as
algorithm. By the pre-processing, the documents are prepared like-?, /, \,!,.9
for ranking and summary generation. The pre-processing done Stemming: A word can be found in different forms in the same
on the documents are as follows: document. These words have to be converted to their original
Tokenization: A news document is the combination of form for simplicity. The stemming algorithm is used to
sentences and sentence consists some words. Here every word transform words to their canonical forms, like গ্রামের, গ্রােমি,
is considered as a token. A document is treated as a chain of গ্রােও etc. should be converted to their original form গ্রাে. In this
tokens. It is basic to separate the data that is web crawled for work, we use a stemmer database (Fig. 3.2) that returns the root
effortlessness of processing, analysis and classification. Every word of using word in the sentence. In the stemmer database we
news document is acquired as a passage. This is separated into collect more than one hundred and fifty thousand words that
singular sentences and put away in records then it is divided provides more accurate stemming word than conventional rule
into single words. based stemming. In this database we map each word of the
Two types of tokenization are done: document to its root word by using xml format. Sample format
i) Sentence Tokenization is given below. The problem in the rule based stemming is that
ii) Word Tokenization here in many words are deviated from its root word (Table I).
Stop Words Removal: In Bangla words like এবং(And), For example if we deduce the suffix ‘য়য়’ from word ‘য়েময়’ then
অথবা(Or), কিন্তু(But), etc. are used frequently in sentences it is converted into ‘য়ে’ that is unwanted. Here we can show an
which have little significance in the implication of a document. Xml code format of stemming mapping –
These words can simply be removed for classification process.
of larger sentence is set as the summation of the frequency of
both sentences. As there is a removal of sentence(s) based on
60% or more similarity, this results redundancy elimination.
The cosine similarity between two sentences Si = [wi1,
wi2, ….,wim] and Sj = [wj1, wj2, ….,wjm] is measured as [6]:

∑𝑚
𝑘=1 𝑤𝑖𝑘 𝑤𝑗𝑘
𝑆𝑖𝑚(𝑆𝑖 , 𝑆𝑗 ) = , (3)
√ ∑𝑚 2 𝑚 2
𝑘=1 𝑤 𝑖𝑘 . ∑𝑘=1 𝑤 𝑗𝑘

Where 𝑖, 𝑗 = 1,2,3 … . , 𝑛
Where w indicates the words in sentences and n is the total
number of sentences [1].

Fig. 2. Xml code format of stemming mapping 3) Counting the existence of numerical figure from digits
and words (SNc):
TABLE I. EXAMPLE OF RULE BASED STEMMING PROBLEMS The third attribute is to count numerical figures for every
sentence (SNc). The value of SNc for each sentence is set to 0
Rule Example Wrongly Actual (zero) at first, and it is incremented by 1 for the existence of
(Suffix) Stemmed Stemming each numerical figure. In [7-9] numerical figure (in digits) was
য় ন -> া িমরন –> িরা যামবন -> যাবা যামবন -> যাওয়া counted and shown that a sentence. All the sentences are
কিমেন -> কিো কিমেন -> য়িওয়া segmented to words [w1s1, w2s1 … w1s2, w2s2,… wnsn] in the
কিমবন -> কিবা কিমবন -> য়িওয়া preprocessing step and count the numerical figure from digits
য় র -> ε িামের->িাে য়েমের -> য়েে য়েমের -> য়েমে and words based on the following equations:
স্ত্রীময়র -> স্ত্রীয় স্ত্রীময়র -> স্ত্রী
যামির -> যাি যামির -> যামির ∀ i ∈ {1,…, n} Ndigit(i) = Regexp (S(i ), [0,1,2,3,4,5,6,7,8,9]) (4)
য়ে -> ε বাড়ীমে->বাড়ী হামে -> হা হামে -> হাে ∀ i ∈ {1,…, n} Nwords(i) = Regexp(S(i ), [FormofNumInWords])(5)
ভামে -> ভা ভামে -> ভাে
∀ i ∈ {1,…, n}SNC(i) = Ndigit(i) + Nwords(i) (6)
B. Sentence Ranking
For sentence scoring, values of some attributes are 4) Considering title words for sentence scoring:
calculated for all the sentences at first and then sum-up all the In existing methods [11, 12], title words have been considered
attributes’ value to compute the score of each sentence. Three for sentence scoring. Because we have observed from the
attributes are considered in this method as follows: analysis of many news documents that title words convey the
1) tf-idf calculation, 2) Sentence frequency calculation, 3) theme of the news documents. The score of every sentence for
Existence of numerical data, 4) Title word score calculation, 5) containing title word is set to 0 (zero) at first and incremented
Positional score calculation, 6) Treating the first sentence by 1 for the existence of each title word.
specially.
5) Positional value (SPV):
1) Term frequency inverse document frequency score The position of a sentence in a document has a considerable
calculation (STF-IDF): The TF-IDF score is calculated with influence over the content of the document. The positional
the following equations: value of a sentence is computed by assigning the highest value
𝑁 to the first sentence and the lowest value to the last sentence of
𝑇𝐹 − 𝐼𝐷𝐹(𝑡) = 𝑇𝐹 ∗ 𝐿𝑜𝑔( ) (1) the document. The position value PV is calculated using the
𝐷𝐹
formula:
𝑆𝑇𝐹−𝐼𝐷𝐹(𝑘) = ∑𝑇𝐼=1 𝑇𝐹 − 𝐼𝐷𝐹 (2) 1
𝑃𝑉𝑘 = (7)
Where, N is the number of documents in a corpus, DF √𝑘
indicates the number of documents in which the term t Where, k is the actual positional value of a sentence in the
appears. STF-IDF(k) means the TF-IDF score for kth sentence document.
which includes the summation of TF-IDF score of all the terms 6) Treating the first Sentence Specially:
of sentence k.[1]
In some existing methods [13-15], the sentence score is
2) Sentence frequency calculation (SSF) and redundancy depended on the position where the positional score is the
elimination: highest for the first sentence and the lowest for the last. This
This proposed procedure has introduced the second attribute as score is gradually decreasing from the first sentence to the last.
sentence frequency (SSF) which is based on cosine similarity. In But, in most of the time especially for Bangla news documents,
this method, sentence frequency of each sentence is set as 1 the first sentence is much important than any other sentences.
(one) at first. If any sentence has cosine similarity 60% or more
with any other, smaller sentence is removed and the frequency
C. Summary Generation selected for summary then the pronoun is replaced by the noun
After measuring all the attributes, the score of each sentence is with the suffix contain the noun, otherwise we left it as it is.
computed using the following equation where Sk is the score of
kth sentences:
IV. EXPERIMENTAL RESULTS AND DISCUSSION
𝒘𝟏 × 𝑺𝑻𝑭(𝒌) + 𝒘𝟐 × 𝑺𝑺𝑭(𝒌) + 𝒘𝟑 × 𝑺𝑵𝑪(𝒌) + 𝒘𝟒 × 𝑺𝑻 + 𝒘𝟓 × 𝑺𝑷𝑽 , 𝒌 > 𝟏
(8)
𝒎𝒂𝒙(𝑺𝒌 ) + 𝟏, 𝒊𝒇 𝒌 = 𝟏 𝒂𝒏𝒅 𝑺(𝟏) 𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒔 𝒂𝒏𝒚 𝒕𝒊𝒕𝒍𝒆 𝒘𝒐𝒓𝒅 In our experiment, we evaluate our proposed method in 3
different evaluating techniques which was used for evaluating
Where 0 <= w1, w2, w3, w4 <=1; k = n, n-1, n-2, …., 1 and automatic generated summary. For evaluation we use the same
n is the number of sentences. The score of the first sentence will dataset [10] that used by [1]. The experimental results of our
be set as the highest score + 1 if it contains any title word so proposed method are outperformed among the methods
that it will be selected always. [1,2,4,13,19].
The values of coefficients w1, w2, w3, w4 and w5 in the above Dataset: Since there is no benchmark dataset of Bangla news
equation are obtained by tuning them for the better results of document. To evaluate our proposed method, we use a dataset
summary generation. provided by [1] [10]. The dataset has two set of text document.
Each set have 100 news documents (each document has 10-20
lines of Unicode text) & 100 human-generated summary. These
news documents contain a variety of news that covers a broad
range of topics like politics, sports, crime, economy,
environment, etc. First set of 100 document & summaries a
used for adjusting the value of w1, w2, w3, w4, w5 in the previous
section and other set of 100 document with corresponding
model summaries are treated as a performance evaluation set.
Evaluation: In this research, the summary of proposed system
has been compared with model summaries of 200 news
documents each. The Precision, Recall, and F-measure are
brought into play here as these have long used as important
evaluation metrics in information retrieval field [18].
The evaluation process is as follows:
If ‘A’ indicates the number of sentences retrieved by
Fig. 3. F-measure for various values of w.
summarizer and ‘B’ shows the number of sentences that are
After sentence ranking, one third top-ranked sentences are relevant as compared to target set, Precision, Recall and F-
extracted as summary sentences as in the following equation: measure are computed as:
∀𝒊𝝐{𝟏,….,𝒏/𝟑} 𝑺𝒖𝒎𝑺𝒆𝒏 = 𝑺𝒖𝒎𝑺𝒆𝒏 ∪ 𝑬𝒙𝒕𝑻𝒐𝒑𝑺𝒄𝒐𝒓𝒆𝒅(𝑺) (9) 𝐴∩𝐵
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃) = (10)
Where n is the number of sentences; ExtTopScored function 𝐴

extract top scored sentences from sentences’ set S; SumSen is 𝑅𝑒𝑐𝑎𝑙𝑙 (𝑅) =
𝐴∩𝐵 (11)
𝐵
the set of summary sentences. The number of summary
sentences is kept as approximately one third of the total 2 ×𝑃 ×𝑅 (12)
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
sentences according to the ratio of source document to summary 𝑃+𝑅
based on [18]. Experimental Result and Analysis: For the efficiency
D. Improved Pronoun Replacement judgment of the proposed method, experiments have been
If we replace pronoun before summary generation [1] then it is conducted on 200 news documents. In each time, the system
probable to redundant noun in the summary. In here we replace generated summary is compared with three model summaries
pronoun after summary generation so that noun wouldn’t not be of each document, and compute the average value of Precision,
redundant. Recall and F-measure with ROUGE automatic evaluation
package (Table II). We used another method (based on Cosine
First, we check the words of the sentences in summary Similarity checking) for evaluating our proposed method (Table
sentences. When we found a subjective pronoun (কেকন, ইকন, উকন II). The result of this evaluation has been depicted in Fig. 4.
এবং য়ে) in a sentence of summary, we look back 2 sentences in
main document sentences, if a single noun is found & the
sentence contain that noun is not selected for summary then the
pronoun is replaced by the noun, otherwise we left it as it is.
Again if we found an objective pronoun (োর, োমি, োহামি এবং
োহার) in a sentences in Summary, we look back 2 sentences in
main document sentences, if a single noun is found with suffixes
য়ি, য়র, এর, য়রর, র and the sentence contain that noun is not
TABLE II. AVERAGE OF ROUGE-1, ROUGE-2 AND COSINE
SIMILARITY BASED SCORES OF THE PROPOSED SYSTEM FOR 200
DOCUMENTS WITH 95% CONFIDENCE INTERVAL
Avg Precision Avg Recall Avg_F_measure

Avg of
ROUGE-1 0.6787 0.7447 0.7044
score
Avg of
ROUGE-2 0.6082 0.6869 0.6395
score
Avg of
CosSim 0.5753 0.6404 0.6019
score Fig. 6. Comparison based on ROUGE-2 scores of 200
documents
We implement the proposed procedure in a programming
language named Python. In our implementation, one-third Discussion on Results: The focus of our proposed method is
sentences are selected as a final summary. We evaluate appropriate stemming. If a word not stemmed correctly its
implementation of our proposed method with the same dataset direct wrong result in several steps such as in tf-idf score
[10] used in [1] & for comparing F-measure value we use their calculation (Fig. 4.4), cosine similarity measurement for
measurements of method 1 presented in [2], method 2 presented sentence frequency count and calculating title words scores.
in [4], method 3 presented in [19], method 4 is in [13], method
5 presented in [1]. Comparison results, based on ROUGE-1 and
ROUGE-2, have been depicted in Fig. 5 and Fig. 6

Fig. 6. Comparison of F-measure value considering only tf-idf


scoring of method [1] and our proposed method

The dramatic improvement of F-measure value of Fig. 6.


Fig. 4. Avg_precision, Avg_recall & Avg_F_measure values proved that, stemming of token is very important issue for
of proposed method calculated from Cosine Similarity based summarization. Because words in Bangla language are very
approach much inflective. Let a Bangla word “যাওয়া” can be various form
such as যাব, যামবন, যামেন, যাচ্ছেমেন, কিময়কিমেন, যাইমেমিন,
যাইমেকিমেন etc. This huge types of inflection is very difficult
(some of the case impossible) change in root word (Stemming).
Our method solves this problem.

V. CONCLUSION

After studying the existing systems, we conclude that our


proposed technique will provide a more realistic and efficient
Bangla news summarization. Stemming based sentence ranking
is more efficient than other method, because many of score
calculation techniques are dependent on the appropriate
Fig. 5. Comparison based on ROUGE-1 scores of 200 stemming.
documents In our proposed methodology, first we pre-process the data
including tokenization and stemming in every words of the
document. We found that our proposed method improved
results of average precision, recall and f-score as compared to
the existing work. We mainly focus on pre-processing more
specifically stemming of the word, we use a stemming database Electronics & Vision (ICIEV), Dhaka, Bangladesh, 2013,
that directly returns the stemming word instead of using pp. 1-5.
conventional stemming rules. This work increases the accuracy [13] K. Sarkar, “A keyphrase-based approach to text
of Tf-Idf value and cosine similarity as our expectation. summarization for English and Bengali documents,”
International Journal of Technology Diffusion (IJTD), vol.
This thesis work has implications on Bangla news document 5, no. 2, pp. 28-38, 2014.
only. In future, we hope to make it work for any kind of Bangla [14] A. Abuobieda, N. Salim, A. T. Albaham, A. H. Osman, and
document and introduce more features for sentence ranking to Y. J. Kumar, “Text summarization features selection
make the system generated summary close to the human method using pseudo genetic-based model,” in Proceedings
generated summary. The number of data in the stemming of International Conference on Information Retrieval &
database can be increased. In this work we have worked on Knowledge Management, Kuala Lumpur, Malaysia, 2012,
extraction-based summarization only, in future it can be done pp. 193–197.
by paraphrasing manner. [15] M. A. Fattah and F. Ren, “GA, MR, FFNN, PNN and GMM
based models for automatic text summarization,” Computer
Speech and Language, vol. 23, no. 1, pp. 126–144, 2009
REFERENCES [16] H. P. Edmundson, “New methods in automatic extracting,”
Journal of the ACM, vol. 16, no. 2, pp. 264-285,1969.
[1] Md. Majharul Haque, Pervin, and Zerina Begum, “An [17] M. I. Efat, M. Ibrahim, and H. Kayesh, “Automated Bangla
Innovative Approach of Bangla Text Summarization by text summarization by sentence scoring and ranking,” in
Introducing Pronoun Replacement and Improved Sentence Proceedings of International Conference on Informatics,
Ranking”, in Journal of Information Processing Systems, Electronics & Vision (ICIEV), Dhaka, Bangladesh, 2013,
Vol.13, No.4, pp.752~777, August 2017. pp. 1-5
[2] Md. Iftekharul Alam Efat, Mohammad Ibrahim, Humayun [18] S. Hariharan, T. Ramkumar, and R. Srinivasan, “Enhanced
Kayesh, “Automated Bangla text summarization by graph based approach for multi document
sentence scoring and ranking”, in 2013 International summarization,” The International Arab Journal of
Conference on Informatics, Electronics and Vision Information Technology, vol. 10, no. 4, pp. 334-341, 2013.
(ICIEV), August 2013. [19] K. Sarkar, “An approach to summarizing Bengali news
[3] Md. Majharul Haque, Suraiya Pervin, Zerina Begum, documents,” in Proceedings of the International Conference
“Automatic Bengali News Documents Summarization by on Advances in Computing, Communications and
Introducing Sentence Frequency and Clustering” in 18th Informatics, Chennai, India, 2012, pp. 857-862.
international Conference on Computer and Information [20] H. P. Luhn, The automatic creation of literature abstracts,
Technology (ICCIT), 21-23 December 2015 in IBM Journal of Research Development, volume 2,
[4] Kamal Sarkar , “Bengali Text Summarization By Sentence number 2, pages 159-165, 1958.
Extraction” in International Conference on Business and [21] Md Tawhidul Islam and Shaikh Mostafa Al Masum, Bhasa:
Information Management(ICBIM 2012),NIT Durgapur, PP A CorpusBased Information Retrieval and Summariser for
233-245, January 2012. Bengali Text, in Proceedings of the 7th International
[5] Indian Statistical Institute, “List of stop words for Bengali Conference on Computer and Information Technology,
language,” 2016 [Online]. 2004
Available: http://www.isical. ac.in/~fire/data/
stopwords/
[6] G. Salton and C. Buckley, “Term-weighting approaches in
automatic text retrieval,” Information Processing &
Management, vol. 25, no. 5, pp. 513–523, 1988
[7] N. Uddin and S. A. Khan, “A study on text summarization
techniques and implement few of them for Bangla
language,” in Proceedings of 10th International conference
on Computer and Information Technology, Dhaka,
Bangladesh, 2007, pp. 1-4.
[8] A. Abuobieda, N. Salim, A. T. Albaham, A. H. Osman, and
Y. J. Kumar, “Text summarization features selection
method using pseudo genetic-based model,” in Proceedings
of International Conference on Information Retrieval &
Knowledge Management, Kuala Lumpur, Malaysia, 2012,
pp.193–197.
[9] M. A. Fattah and F. Ren, “GA, MR, FFNN, PNN and GMM
based models for automatic text summarization,” Computer
Speech and Language, vol. 23, no. 1, pp. 126–144, 2009
[10] Bangla Natural Language Processing Community [Online].
Available: http://bnlpc.org/research.php.
[11] H. P. Edmundson, “New methods in automatic extracting,”
Journal of the ACM, vol. 16, no. 2, pp. 264-285, 1969.
[12] M. I. Efat, M. Ibrahim, and H. Kayesh, “Automated Bangla
text summarization by sentence scoring and ranking,” in
Proceedings of International Conference on Informatics,

You might also like