A Comparative Study On Text Summarization Methods: Abstract

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

International Journal of Engineering and Techniques - Volume 3 Issue 6, Nov - Dec 2017

RESEARCH ARTICLE OPEN ACCESS

A Comparative Study on Text Summarization Methods


Fr.Augustine George1, Dr.Hanumanthappa2
1Computer Science,KristuJayantiCollege,Bangalore
2
Computer Science, Bangalore University

Abstract:
With the advent of Internet, the data being added online is increasing at enormous rate. Businesses are waiting
for models that can render some useful information out of this large chunk; and hence our research holds
significance. There are various statistical and NLP models that are used and each of them are efficient in their
own way. Here, we will be making a comparative study so as to bring out the parameters and efficiency, thus
bringing about a deduction as to when and how best we can use a particular model. Text summarization is the
technique, which automatically creates an abstract, or summary of a text. The technique has been developed for
many years.
Summarization is one of the research works in NLP, which concentrates on providing meaningful summary
using various NLP tools and techniques. Since huge amount of information is used across the digital world, it is
highly essential to have automatic summarization techniques. Extractive and Abstractive summarization are the
two summarization techniques available. Lot of research work are being carried out in this area especially in
extractive summarization. The techniques involved here are text summarization with statistical scoring,
Linguistic Method, Graph based method, and artificial Intelligence.

Keywords — Text Summarization, Natural Language Processing, Lexicon, Graph based.

I. INTRODUCTION Extractive and abstractive summarization. An


Text summarization is a way to reduce the large extractive summarization method works by
amount of information into a brief form by the selecting important sentences, paragraphs etc. from
process of selecting important information and the original document and concatenating them into
discarding unwanted and redundant information. It shorter form.
is necessary to do Automatic Text Summarization The importance of sentences gathered based on
(ATS) [6] in the field of information retrieval due to statistical and linguistic features of sentences.An
the amount of textual information present in the Abstractive summarization [8][9] can be done by
World Wide Web (WWW). The process of understanding of the main concepts in a document
summarizing a source text in to a shorter version and then can be expressed in natural
preserving its information content called language.Extractive text summarization process
summarization. Automated summarization tools can further divided into two steps: Pre Processing-
help people to grasp main concepts of information structured representation of the original
sources in a short time. Statistical models provide a text.Stepand processing step–extract features
principled and mathematically sound framework in influencing the relevance of sentences decided and
which to accomplish both of these tasks. calculated and then weights assigned to these
Automatic summary [7] can be an inductive way features using weight-learning method. Final score
by collecting some parts of the original document, of each sentence is determined using Feature-
or in an informative way to cover all relevant weight equation. Preprocessing can be done using
information of the text.Text Summarization following ways Sentences boundary identification,
methods can be done in two ways. They are Stop-Word Elimination and Stemming.Summary
assessment [10][11][12] is a very important aspect

ISSN: 2395-1303 http://www.ijetjournal.org Page 593


International Journal of Engineering and Techniques - Volume 3 Issue 6, Nov - Dec 2017

for text summarization. The biggestchallenge of labels used for classification. The most basic
abstractive summary is the representation problem. annotation scheme is modelled after the scientific
Systems’ capabilities are constrained by the method: aim, method, results, conclusion [1].
richness of their representations and their ability to
generate such structures—systems cannot B. Semi-supervised and unsupervised classification
summarize what their representations cannot Guoet. al [3] use four semi-supervised
capture classifiers for sentence classification: three variants
of the support vector machine and a conditional
II. LITERATURESURVEY random field model. The semi-supervised
Automatic text summarization arose in the fifties classifiers either (1) start with a small set of labeled
and become important that suggested to weight the data and choose, at each iteration, additional
sentences of a document as a function of high unlabeled data to be labeled and added to the
frequency words[13], disregarding the very high training set (known as active learning) or (2)
frequency common words. One well-known and include the unlabeled data in the classifier
widely used statistical model of text is latent formulation with an estimate of, or distribution over,
Dirichlet allocation [5], which is a latent variable the unknown labels. They perform sentence
mixture model where a document is modeled as a classification on biomedical abstracts using a
mixture over T clusters known as topics. Informally, version of the Argumentative Zones annotation
a topic is a semantically focused set of words. scheme developed specifically for biology articles.
Informally, a topic is a semantically focused set They present experiments using only 100 labeled
of words. Formally, LDA represents a topic as a abstracts (approximately 700 sentences) to train the
probability vector, or distribution, over the words in different classifiers.
a vocabulary. Thus, the topic about “football” Wu et al. [4] use a hidden Markov model to label
would give high-probability to words such as sentences in scientific abstracts. They first label a
“football”, “quarterback”, “touchdown”, etc. and set of 106 abstracts (709 sentences). They use the
give low (or zero) probability to all other non- labeled data to extract pairs of words from
football related words. Similarly, the topic about sentences that are strong indicators of a particular
“traveling” would give high-probability to traveling label. They then use these word pairs and the
related words, and low (or zero) probability to non- labeled sentences to train a hidden Markov model.
traveling related words. The following methods are Again, we use less labeled data than Wu et al. Also,
also determining the sentence weights. the annotation scheme used by Wu et al. (based on
the scientific method) differs from the annotation
A. Supervised classification scheme used in this paper.
The vast majority of existing work on sentence
classification employs a supervised learning C. Annotation scheme
approach. Common classifiers include conditional We use an annotation scheme that is derived
random fields, naive Bayes classifiers, support from Argumentative Zones [5] (AZ). There are five
vector machines, hidden Markov models and labels in our annotation scheme: own, contrast,
maximum entropy models. basis, aim and miscellaneous. The AZ annotation
scheme includes one additional label textual which
The scope of the task refers to whether describes sentences that discuss the structure of the
classification is performed on the abstract sentences article, e.g. “In Section 3, we show that...”. We
only which is thought to be an easier task since removed the label textual because it was not of
fewer sentence types occur in the abstract– or on obvious use for other applications. We also
the entire text of the article. Alternatively, other collapsed two of the labels in AZ – neutral and
past work has focused on a specific section within other – into one label miscellaneous. The label
the article [2]. The second aspect in which past neutral describes sentences that refer to past work
work differs is the annotation scheme, i.e. the set of in a neutral way. The label other describes

ISSN: 2395-1303 http://www.ijetjournal.org Page 594


International Journal of Engineering and Techniques - Volume 3 Issue 6, Nov - Dec 2017

sentences that state generally accepted background IV EXTRACTIVE TEXT SUMMARIZATION


information. This paper concentrates on Extractive text
summarization process and is divided into two steps:
III. TEXT SUMMARIZATION Pre Processing step andprocessing step. Pre
Processing is structured representation of the
A. Steps for text summarization: original text. It usually includesSentences boundary
1. Topic Identification:The most prominent identification. In English, sentence boundary is
information in the text is identified .There identified with presence of dot at the end of
are different techniques for topic sentence. b) Stop-Word Elimination—Common
identification are used which are Position, words with no semantics c) Stemming—The
Cue Phrases, word frequency.Methods purpose of stemming is to obtain the stem or radix
which are based on the position of phrases of each word, which emphasize its semantics. It
are the most useful methods for topic includes the following methods.
identification.

2. summaries need to go A. Pseudo Statistical scoring methods


Interpretation:Abstract
through interpretation step. In This step, 1) Title method
different subjects are fused in order to form This method states that sentences that appear in
a general content. the title are considered to be more important and are
more likely to be included in the summary. The
3. Summary Generation:In this step, the system score of the sentences is calculated as how many
uses text generation method. words are commonly used between a sentence and a
title. Title method cannot be effective if the
document does not include any title information.

2) Location method
This method states that sentences that appear in
the title are considered to be more important and are
more likely to be included in the summary. The
score of the sentences is calculated as how many
words are commonly used between a sentence and a
title. Title method cannot be effective if the
document does not include any title information.

3) tf-idf method
The term frequency-inverse document
frequency is a numerical statistic which reflects
how important a word is to a document. It is often
used as a weighting factor in information retrieval
and text mining. tf-idf is used majorly for stop
words filtering in text summarization and
categorization application. The tf-idf value
increases proportionally to the number of times a
word appears in the document. tf–idf weighting
scheme are oftenused by search engines as a central
tool in scoring and ranking a document's relevance
Fig.1Text Summarization
given a user query.

ISSN: 2395-1303 http://www.ijetjournal.org Page 595


International Journal of Engineering and Techniques - Volume 3 Issue 6, Nov - Dec 2017

The term frequency f(t,d) means the raw linguistic method, which involves the semantic
frequency of a term in a document, that i the processing for summarization.
number of times that term t occurs in document d. Linguistic approaches have some difficulties in
The inverse document frequency is a measure of using high quality linguistic analysis tools (a
whether the term is common or rare across all discourse parser, etc.) and linguistic resources
documents. It is obtained by dividing the total (Word Net, Lexical Chain, Context Vector Space,
number of documents by the number of documents etc.).
containing the term.
C) Lexical chain
4) Cue word method The concept of lexical chains was first
Weight is assigned to text based on its introduced by Morris and Hirst. Basically, lexical
significance like positive weights "verified, chains exploit the cohesion among an arbitrary
significant, best, this paper" and negative weights number of related words. Lexical chains can be
like "hardly, impossible". Cue phrases are usually computed in a source document by grouping
genre dependent. The sentence consisting such cue (chaining) sets of words that are semantically
phrases can be included in summary. The cue related. Identities, synonyms, and
phrase method is based on the assumption that such hypernyms/hyponyms are the relations among
phrases provide a "rhetorical" context for words that might cause them to be grouped into the
identifying important sentences. The source same lexical chain.
abstraction in this case is a set of cue phrases and Lexical chains are used for IR and grammatical
the sentences that contain them. Above all error corrections. In computing lexical chains, the
statistical features are used by extractive text noun instances must be grouped according to the
summarization. above relations, but each noun instance must
belong to exactly one lexical chain. There are
Bayesian Classifier: several difficulties in determining which lexical
chain a particular word instance should join. Words
P( F1 , F2 ,...Fk | s ∈ S ) P ( s ∈ S ) must be grouped such that it creates a strongest and
P ( s ∈ S | F1 , F2 ,...Fk ) =
P( F1 , F2 ,... Fk ) longest lexical chain.

k Generally, a procedure for constructing lexical


P ( s ∈ S | F , F ,...F ) =
∏ j =1
P( F j | s ∈ S ) P( s ∈ S )
chains follows three steps:
1 2 k k
∏ j =1
P( F j )
1. Select a set of candidate words;
• Each Probability is calculated empirically 2. For each candidate word, find an appropriate
from a corpus chain relying on a relatedness criterion among
• Higher probability sentences are chose to be members of the chains;
in the summary 3. If it is found, insert the word in the chain and
update it accordingly
B. Linguistic Approaches
Linguistic is a scientific study of language which 1) Word Net
includes study of semantics and pragmatics. Study Word Net is a on-line lexical database available
of semantics means how meaning is inferred from for English language. It groups the English words
words and concepts and study of pragmatics into sets of synonyms called sys-nets. Word Net
includes how meaning is inferred from also provides a short meaning of each sys-net and
context.Linguistic approaches are based on semantic relation between each sys-net. Word-net
considering the connection between the words and also serves as a thesaurus and a on-line dictionary
trying to find the main concept by analyzing the which is used by many systems for determining
words. Abstractive text summarization is based on relationship between words. Thesaurus is reference

ISSN: 2395-1303 http://www.ijetjournal.org Page 596


International Journal of Engineering and Techniques - Volume 3 Issue 6, Nov - Dec 2017

work that contains a list of words grouped edges • Edge weights w(u, v) define a measure of
together according to the similarity of meaning. pairwise similarity between nodes u,v
Semantic relations between the words are
represented by synonyms sets, hyponym trees.
Word-net are used for building lexical chains
according to these relations. Word Net contains
more than 118,000 different word forms. LexSum
is a summarization system which uses Word Net Fig.2Example of Graph-based Representations

for generating the lexical chain.


TABLE I : GRAPH METHOD SAMPLE

D. Graph Theory Data Directed? Node Edge


Graph theory can be applied for representing the
Web yes page link
structure of the text as well as the relationship Citation Net yes citation reference
between sentences of the document. Sentences in relation
the document are represented as nodes. The edges Text no sent semantic
between nodes are considered as connections connectivity
between sentences. These connections are related
by similarity relation. By developing different
similarity criteria, the similarity between two E. Neural Network Approach
sentences is calculated and each sentence is scored. A neural network is trained on a corpus of
Whenever a summary is to be processed all the documents. The neural network is then modified,
sentences with the highest scored are chosen for the through feature fusion, to produce a summary of
summary. In graph ranking algorithms, the highly ranked sentences in the document. Through
importance of a vertex within the graph is feature fusion, the network discovers the
iteratively computed from the entire graph. importance (and unimportance) of various features
used to determine the summary-worthiness of each
• Higher semantic/syntactic structures sentence. The input to the neural network can be
• Network (graph) based methods either real or binary vectors. The first phase of the
process involves training the neural networks to
1) Graph Approach learn the types of sentences that should be included
In this technique, there is a node for every in the summary. This is accomplished by training
sentence. Two sentences are connected with an the network with sentences in several test
edge if the two sentences share some common paragraphs where each sentence is identified as to
words, in other words, their similarity is above whether it should be included in the summary or not.
some threshold. This representation gives two This is done by a human reader. The neural network
results:The partitions contained in the graph (that is learns the patterns inherent in sentences that should
those sub-graphs that are unconnected to the other be included in the summary and those that should
sub graphs), form distinct topics covered in the not be included. We use a three-layered feed
documents. The second result by the graph- forward neural network, which has been proven to
theoretic method is the identification of the be a universal function approximate. It can discover
important sentences in the document. The nodes the patterns and approximate the inherent function
with high cardinality (number of edges connected to of any data to an accuracy of 100% as long as there
that node), are the important sentences in the are no contradictions in the data set. Our neural
partition, and hence carry higher preference to be network consists of seven input-layer neurons, six
included in the summary. hidden-layer neurons, and one output-layer neuron.
Let G(V, E) be a weighted undirected graph Therefore, the unnecessary connections and
– V - set of nodes in the graph – E - set of weighted neurons can be pruned without affecting the
performance of the network.

ISSN: 2395-1303 http://www.ijetjournal.org Page 597


International Journal of Engineering and Techniques - Volume 3 Issue 6, Nov - Dec 2017

Parameters Cue Title method Location Frequency Lexical Word similarity


method method Measures chains computation

Relevance factors Keyword Word Density Document Individual Word Homogene • Based on
s, Alert- Ratio Structure count (excluding ity index similarity
Words stop-words) • Based on Corpus

Accuracy Good Relatively Better than Comparatively Good Average accuracy


good frequency lower than word-
measures density method

Application Signature Dynamic Dynamic Dynamic reporting NA Signature-based


-based reporting reporting reinforcement
analysis

TABLE II :COMPARISON BASED ON PARAMETERS USED

V .CONCLUSION results and discussion. Bioinformatics, 3(23):3174–3180,


This comparative survey paper is December 2009.
concentrating on extractive summarization methods. 2.M. A. Angrosh, S. Cranefield, and N. Stanger.Context
Extractive method is selection of important identification of sentences in related work sections using a
sentences from the original text based on statistical conditional random field: towards intelligent digital
and linguistic features of sentences. Many libraries. In Proc. of the 10th Annual Joint Conf. on
variations of the extractive approach discussed in Digital Libraries, pages 43–302, 2010.
this paper. We found that the use of Natural 3. Y. Guo, A. Korhonen, and T. Poibeau. A weakly-
Language Processing methods would provide supervised approach to argumentative zoning of scientific
cohesion and semantics. If texts containing multiple documents. In Proc. of the 2011 Conf. on Empirical
topics or meaning, the generated summary might Methods in Natural Lang. Proc., 2011.
not be balanced in the extractive method. Deciding
4. W. Jien-Chen, C. Yu-Chia, H.-C.Liou, and J. Chang.
proper weights of individual features is very Computational analysis of move structures in academic
important as quality of final summary is depending abstracts. In Proc. of the COLING/ACL on Interactive
on it. The biggest challenge for text summarization presentation sessions, COLING-ACL ’06, pages 41–44,
is to summarize content from a number of textual 2006.
and semi structured sources, including databases
and web pages, in the right way. The text 5. S. Teufel and M. Moens.Summarizing scientific
summarization software should produce the articles: experiments with relevance and rhetorical
effective summary in less time and with least status. Computational Linguistics, 28(4):409–445,
redundancy. We have shown how summarization Dec 2002.
strategies must be adapted using different methods. 6. KarelJezek and Josef Steinberger, "Automatic Text
summarization", Vaclav Snasel (Ed.): Znalosti 2008,
Our future work is to summarize web pages and pp.112, ISBN 978-80-227-2827-0, FIIT STU
journal articles, taking into account contextual Brarislava, UstavInformatiky a
information that guides sentence selection. softverovehoinzinierstva, 2008
7. Weiguo Fan, Linda Wallace, Stephanie Rich, and
REFERENCES: Zhongju Zhang, “Tapping into the Power of Text
1.S. Agarwal and H. Yu.Automatically classifying sentences in Mining”, Journal of ACM, Blacksburg, 2005.
full-text biomedical articles into introduction, method,

ISSN: 2395-1303 http://www.ijetjournal.org Page 598


International Journal of Engineering and Techniques - Volume 3 Issue 6, Nov - Dec 2017

8. G Erkan and Dragomir R. Radev, “LexRank: Graph-


based Centrality as Salience in Text
Summarization”, Journal of Artificial Intelligence
Research, Re-search, Vol. 22, pp. 457-479 2004.
9. UdoHahn and Martin Romacker, “The
SYNDIKATE text Knowledge base generator”,
Proceedings of the first International
conference on Human language technology
research, Association for Computational
Linguistics , ACM, Morristown, NJ, USA , 2001.
10. AniNenkovaand ,RebeccaPassonneau, “Evaluating
content selction in summarization: The Pyramid
method”, in HLT-NAACL, 145-152, 2004.
11. Chin-yew Lin, “A package for automatic evaluation
of summaries”,in Proc. ACL workshop on text
summarization branches out,2004.
12. Eduard Hovy, Chin-Yew Lin, Liang Zhou, and
Junichi Fukumoto, “Automated Summarization
Evaluation with Basic Elements”, In Proceedings of
the 5th International Conference on Language
Resources and Evaluation (LREC), 2006.
13. H. P. Luhn, “The Automatic Creation of Literature
Abstracts”, Presented at IRE National Convention,
New York, 159-165, 1958.

ISSN: 2395-1303 http://www.ijetjournal.org Page 599

You might also like