This document discusses automatic text summarization using Python. It describes extractive summarization, which selects important sentences to form a summary, using techniques like cosine similarity. Cosine similarity calculates sentence vectors from text and ranks sentences based on their similarity to produce a summary. Both abstractive and extractive summarization are covered, with extractive focusing on selecting and weighting important sentences while abstractive interprets text using natural language techniques to generate new summaries.
This document discusses automatic text summarization using Python. It describes extractive summarization, which selects important sentences to form a summary, using techniques like cosine similarity. Cosine similarity calculates sentence vectors from text and ranks sentences based on their similarity to produce a summary. Both abstractive and extractive summarization are covered, with extractive focusing on selecting and weighting important sentences while abstractive interprets text using natural language techniques to generate new summaries.
This document discusses automatic text summarization using Python. It describes extractive summarization, which selects important sentences to form a summary, using techniques like cosine similarity. Cosine similarity calculates sentence vectors from text and ranks sentences based on their similarity to produce a summary. Both abstractive and extractive summarization are covered, with extractive focusing on selecting and weighting important sentences while abstractive interprets text using natural language techniques to generate new summaries.
This document discusses automatic text summarization using Python. It describes extractive summarization, which selects important sentences to form a summary, using techniques like cosine similarity. Cosine similarity calculates sentence vectors from text and ranks sentences based on their similarity to produce a summary. Both abstractive and extractive summarization are covered, with extractive focusing on selecting and weighting important sentences while abstractive interprets text using natural language techniques to generate new summaries.
Download as DOC, PDF, TXT or read online from Scribd
Download as doc, pdf, or txt
You are on page 1of 8
Automatic Text Summarization using Python
Steven Ace B. Galedo Philippines
College of Information Technology Manila, Philippines Education Department [email protected] Technological Institute of the Abstract—Manual Text Summarization requires a lot time understanding, even those words did not appear in the when done. That’s why in this paper, a simple automatic text source documents. It aims at producing important material summarization tool using Cosine Similarity Measure was in a new way. They interpret and examine the text using created. Cosine Similarity is just one way of summarizing text advanced natural language techniques in order to generate a by making sentence vectors out of a text and then by using a formula ranking the top ranking sentence and then include it new shorter text that conveys the most critical information in the final summary. from the original text. However, in Extractive Summarization, Extractive methods attempt to summarize Keywords—Text Summarization, Python, Extractive articles by selecting a subset of words that retain the most Summarization, Cosine Similarity, Sentence Vectors important points. This approach weights the important part of sentences and uses the same to form the summary. I. INTRODUCTION Different algorithm and techniques are used to define weights for the sentences and further rank them based on Scanning large piles of text and interpreting its meaning is a hard task to do. That’s why summarization has been a importance and similarity among each other[3]. crucial part for many industries already. Reading a summary III. ABSTRACTIVE AND EXTRACTIVE TEXT SUMMARIZATION gives a glimpse for a reader the meaning or gist of a text without having to read the whole text especially if the A. Abstractive Text Summarization document is lengthy like a court case or scientific papers. Summarization is the process of reducing a text to its main In order to achieve abstractive text summarization, idea and necessary information. Summarizing differs from certain techniques are applied. The techniques are discussed paraphrasing in that summary leaves out details and terms below and are divided into two categories which is the while a paraphrase is a restatement of the meaning of a text Structured Based Approach and Semantic Based Approach. or passage using other words. Summarizing helps you Structured based approach encodes most important understand and learn important information by reducing information from the document through cognitive schemes information to its key ideas. Summaries can be used for such as templates, extraction rules and other structures such annotation and study notes as well as to expand the depth of as tree, ontology, lead and body phrase structure. In your writing[1]. It is very difficult for human beings to Semantic based approach, semantic representation of manually extract the summary of a large documents of text. document is used to feed into natural language generation There are plenty of text material available on the Internet. So (NLG) system. This method focuses on identifying noun there is a problem of searching for relevant documents from phrase and verb phrase by processing linguistic data. The the number of documents available, and absorbing relevant different techniques for each approach are explained better information from it. below in Table I and Table II:
II. AUTOMATIC TEXT SUMMARIZATION TABLE I. ABSTRACTIVE TEXT SUMMARIZATIONMETHODS:
USING STRUCTURED BASED APPROACH[4] Automatic text summarization deals with employing machines or computers to perform the summarization of a Methods Description document or documents using some form of heuristics or -It uses a dependency tree to represent the text of a statistical methods. There are different approaches to document. automatic text summarization. Each which will be explained -It uses either a language generator or an algorithm below: for generation of summary. Tree A. Automatic Text Summarization Based on Input Type Based Method Single Document Summarization focuses on summarizing single document only. Multi document Summarization focuses on summarizing multiple documents[2]. -It uses a template to represent a whole document. -Linguistic patterns or extraction rules are matched to B. Automatic Text Summarization Based on Purpose identify text snippets that will be mapped Automatic Text Summarization can also be done based into template slots. Template purpose. There are 3 approaches of Purpose Based Based Automatic Text Summarization these include Generic, Method Domain-Specific and Query-Based. Generic Based Summarization focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). Domain-Specific -Use ontology summarizes only a particular document and Query Based (knowledge base) to improve the process of summarizes objects specific to a query[2]. summarization. -It exploits fuzzy ontology to handle uncertain data that simple domain ontology cannot. Ontology C. Automatic Text Summarization Based on Output Type Based In terms of output type, there are also approaches to text Method summarization these include Abstractive and Extractive text summarization technique. In Abstractive Summarization, Abstractive methods select words based on semantic - This method is based on the operations of phrases TABLE III. EXTRACTIVE TEXT SUMMARIZATION (insertion and substitution) that have same syntactic TECHNIQUES[4] head chunk in the lead and body sentences in order Lead and to rewrite the lead sentence. Methods Description Body Phrase Term -Sentence frequency is defined as the Method Frequency- number of sentences in the document Inverse that contain that term. Document -Then this sentence vectors are scored Frequency by similarity to the query and the -Documents to be summarized are represented in terms highest scoring sentences are picked to Method of categories and a list of aspects. be part of the summary. Rule -It is intuitive to think that summaries Based should address different “themes” Method appearing in the documents. -If the document collection for which summary is being produced is of totally different topics, document clustering becomes almost essential to generate a meaningful summary. TABLE II. ABSTRACTIVE TEXT SUMMARIZATION - Sentence selection is based METHODS: USING SEMANTIC BASED APPROACH[4] Cluster Based on similarity of the sentences to the Method theme of the cluster (Ci). The next factor that is location of the Methods Description sentence in the document (Li). The last factor is its similarity to the first -A semantic model, which captures concepts sentence in the document to which it and relationship among concepts, is built to belongs represent the contents of multimodal (Fi). documents. Si =W1 * Ci + W2 * Fi+ W3 *Li Multimodal semantic model Where, W1, W2, W3 are weight age for inclusion in summary. - The clustering k-means algorithm is applied. -Graph theoretic representation of -The contents of summary are generated from passages provides a method of abstract representation of source documents, rather identification of themes. than from sentences of source documents. Graph -After the common pre-processing -The abstract Representation is Information Item, Theoretic steps, namely, stemming and stop word which is the smallest element of coherent Approach removal; sentences in the documents information in a text. are represented as nodes in an undirected graph. Information -The summarization process is Item Based Method modelled as a classification problem: sentences are classified as summary sentences and non-summary sentences Machine based on the features that they possess. Learning -The Classification probabilities are Approach studied statistically using Navie Bayes Classifier rule: P (s є <S | F1, F2, ..., FN) = P (F1, -This method is used to summarize a document F2, ..., FN | s є S) * P (s є S) / P (F1, by creating a semantic graph called Rich F2,..., FN) Semantic Graph (RSG) for the original - It gets this name LSA because SVD document, reducing the generated semantic applied to document word matrices, graph. group documents that are semantically Semantic LSA Method related to each other, even when they do Graph Based not share common words. Method
Text -This method involves training the
summarization neural networks to learn the types of With Neural sentences that should be included in the Networks summary. -It uses three- layered Feed Forward neural network. B. Extractive Text Summarization With respect to Extractive Text Summarization, the different techniques in order to achieve it are listed below in Table III. -This method considers each characteristic of a text such as similarity to title, sentence length and similarity to Automatic TS key word etc. as the input of the fuzzy based on fuzzy system. logic
The idea of this approach is to obtain Fig. 1 Cosine Similarity Formula
concepts of words based on HowNet, and use concept as feature, instead of The Python Program using Cosine Similarity An approach to word. This approach uses conceptual concept-obtained vector space model to form a rough By applying the formula in a python program, a simple text summarization, and then calculate text summarization was created. The code for the program is summarization degree of semantic similarity of reflected below: sentence for reducing its redundancy[5]. #!/usr/bin/env python Mathematical regression is a good # coding: utf-8 model to estimate the text feature weights. In this model, a mathematical import nltk function can relate output to input. The Text feature parameters of many manually from nltk.corpus import stopwords summarization summarized English documents are using regression used as independent input variables and from nltk.cluster.util import cosine_distance corresponding dependent outputs are for estimating specified in training phase. A relation import numpy as np feature weights between inputs and outputs is established. Then testing data are import networkx as nx introduced to the system model for evaluation of its efficiency[5]. def read_article(file_name): Multi document extractive summarization deals with extraction of file = open(file_name, "r") summarized information from multiple texts written about the same topic. filedata = file.readlines() Resulting summary report allows individual users, so as professional article = filedata[0].split(". ") information consumers, to quickly familiarize themselves with information sentences = [] Multi-document contained in a large cluster of extractive documents. Multi-document summarization summarization creates information reports that are both concise and for sentence in article: comprehensive. With different opinions #print(sentence) being put together & outlined, every topic is described from multiple sentences.append(sentence.replace("[^a-zA-Z]", " perspectives within a single document[5]. ").split(" ")) sentences.pop()
IV. COSINE SIMILARITY return sentences
While the approaches presented in Table I, II and III produced a good summary as been proved. It is still better to produce a simple text summarization tool using Cosine def sentence_similarity(sent1, sent2, stopwords=None): Similarity Measure. Cosine similarity is a measure of if stopwords is None: similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. stopwords = [] Since we will be representing our sentences as the bunch of vectors, we can use it to find the similarity among sentences. Its measures cosine of the angle between vectors. Angle will sent1 = [w.lower() for w in sent1] be 0 if sentences are similar[3]. Cosine similarity is sent2 = [w.lower() for w in sent2] computed to have a grasp of which sentences are related with each other and can be included in the final summary. Sentence Vectors are created by listing all the words in 2 sentences being compared and counting the occurrence of all_words = list(set(sent1 + sent2)) each word in each sentence. Then cosine similarity of 2 sentence vectors are calculated. The formula for cosine similarity is shown below in Fig. 1. vector1 = [0] * len(all_words) vector2 = [0] * len(all_words) # build the vector for the first sentence # Step 2 - Generate Similary Martix across sentences for w in sent1: sentence_similarity_martix = build_similarity_matrix(sentences, stop_words) if w in stopwords: continue # Step 3 - Rank sentences in similarity martix vector1[all_words.index(w)] += 1 sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_martix) # build the vector for the second sentence scores = nx.pagerank(sentence_similarity_graph) for w in sent2: #print("This is the scores: ", if w in stopwords: sentence_similarity_martix) continue vector2[all_words.index(w)] += 1 # Step 4 - Sort the rank and pick top sentences ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True) return 1 - cosine_distance(vector1, vector2) #print("Indexes of top ranked_sentence order are ", ranked_sentence) def build_similarity_matrix(sentences, stop_words): # Create an empty similarity matrix for i in range(top_n): similarity_matrix = np.zeros((len(sentences), summarize_text.append(" ".join(ranked_sentence[i] len(sentences))) [1]))
for idx1 in range(len(sentences)): # Step 5 - Offcourse, output the summarize texr
for idx2 in range(len(sentences)): print("Summarized Text: \n", ". if idx1 == idx2: #ignore if both are same ".join(summarize_text)) sentences continue # let's begin similarity_matrix[idx1][idx2] = generate_summary( "sample1.txt", 1) sentence_similarity(sentences[idx1], sentences[idx2], stop_words) So, there are 4 functions from the code. The read_article, sentence_similarity, build_similarity_matrix and generate_summary. The read_article function is where we return similarity_matrix open the text file defined in the code, the sentence_similarity is where we compute the sentence similarity of sentences using cosine similarity and the build_similarity_matrix is the matrix similarity for all sentences. Lastly, the generate_summary function is what we call to generate the def generate_summary(file_name, top_n=5): summary. Preferred top number of sentences and name of nltk.download("stopwords") text file is provided.
stop_words = stopwords.words('english') V. CONCLUSION
summarize_text = [] From the review of text summarization techniques, #Flow:Input article → split into sentences → remove we would able to create an automatic text summarization stop words → using python. The method for summarization used is Cosine Similarity. In the end, it has been proved that cosine #build a similarity matrix → generate rank based on similarity measure can be used to create automatic text matrix → summarization tool. #pick top N sentences for summary. ACKNOWLEDGMENT First of all, I would like to thank the Almighty God for # Step 1 - Read text and split it giving me overflowing knowledge and wisdom to make this review and making the project a successful one. I would like sentences = read_article(file_name) also to thank Him for providing me everything I need from the start until the end of this project. I would like also to [3] “Understand Text Summarization and create your own thank my parents for their unwavering support for us as I summarizer in python.” [Online]. Available: did this research. https://towardsdatascience.com/understand-text-summarization- and-create-your-own-summarizer-in-python-b26a9f09fc70. REFERENCES [Accessed: 18-Sep-2019]. [1] D. Dean, “Original Article with Highlighting and Annotations [4] D. K. Gaikwad and C. N. Mahender, “A Review Paper on Text Bats.” Summarization,” Int. J. Adv. Res. Comput. Commun. Eng., vol. 5, [2] “Unsupervised Text Summarization using Sentence no. 3, pp. 154–160, 2016. Embeddings.” [Online]. Available: [5] V. Gupta and G. S. Lehal, “A Survey of Text Summarization https://medium.com/jatana/unsupervised-text-summarization- Extractive techniques,” J. Emerg. Technol. Web Intell., vol. 2, no. using-sentence-embeddings-adb15ce83db1. [Accessed: 18-Sep- 3, pp. 258–268, 2010. 2019].