Text Summerization
Text Summerization
Text Summerization
A summary can be defined as a text that is produced from one or more texts, that contain a
significant portion of the information in the original text(s), and that is no longer than half of the
original text(s). Text summarization is the process of distilling the most important information
from a smyce (or smyces) to produce an abridged version for a particular user (or user) and task
(or tasks). When this is done by means of a computer, i.e. automatically, I call this Automatic
Text Summarization (ATS).
Automatic text summarization is the technique, where a computer summarizes a text. A text is
entered into the computer and a summarized text is returned, which is a non redundant extract
from the original text. The technique has its roots in the 60's and has been developed during 30
years, but today with the Internet and the WWW the technique has become more important.
Today basically, there are two common approaches to achieve the text summarization objective.
The first approach tries to analyze the text, and to rewrite or rephrase it in a short way. The
second approach, tries to extract the key sentences from the text by using text ranking algorithm,
and then tries to put them together properly. But the first approach didnt achieve any substantial
results until today.
Search engines to extract keyword and to obtain summaries of the found text
Search in foreign languages and obtain an automatic summary of the machine translated
text
1
Summarizing text which has been downloaded from the Internet from a WAP mobile
phone
In Ethiopia various attempts have been done to develop automatic text summarization (ATS)
systems for Amharic language. The attempts are done to summarized texts automatically by the
computer by using different approaches and algorithms to display a summarized text in Amharic
language; hence this project would help the users to achieve their text summarization objective
simply without any difficulty.
Ethiopia is a linguistically diverse country where more than 80 languages are used in day-to-day
communication. Although many languages are spoken in Ethiopia, Amharic is dominant in that it
is spoken as a mother tongue by a substantial segment of the population and it is the most
commonly learned second language throughout the country.
The language is the official language of the federal government of the country. According to the
1998 census of the country (ECSA, 1998), Amharic is the first language of more than 17 million
people and second language for more than 5 million people.
2 Statement of problem
Information overload is a problem in this information era due to the mass production of
information in many formats which is enhanced by the internet technology. Amharic text
documents are part of this mass production.
Amharic being the official spoken and written language of Ethiopia is used to produce text
documents for readers. These text documents are available digitally and the amount is highly
increasing every day. Even Google now supports searching for documents with Amharic fonts
using queries in Amharic font. Interested users are now spending time in searching for Amharic.
There are few researches conducted locally in the area of automatic text summarization applying
different methodologies. To find the best one that is compatible, effective and efficient for
2
Amharic text more research has to be done documents online, which provides lots of text
documents creating user information overload.
This project is part of the effort that should be done to fill the gap in the area. It aims to
Automatic text summarization for Amharic language by using python programming language.
The main aim of this project was to develop an Automatic Text Summarization (ATS) for
Amharic language. The summarization process is included only text document but other multi-
media data types are not included in the project.
4 Literature review
The first research Amharic News Text Summarization was done by Kamil Nuri, in 2004. The
system was developed by using natural language processing techniques and statistical methods.
Title words, head sentences, he ad sentence words, paragraph starting sentence, cue phrases and
high frequency key words are used as extraction features. The system evaluation of the research
showed 74.4% precision and 58% recall (Kamil, 2004). He used nine news articles that were in
printed format, since by the time the research was conducted there were no web based Amharic
news service providers. However, since he used Perl programming language which does not
support Amharic alphabets, he transliterated the nine news articles. That means he did not use
Amharic alphabets directly to the system which makes the system not to be directly implemented
to the real world Amharic text summarizing application.
The second work, The Application of Machine Learning Technique for Automatic Text
Summarization. The Case of Amharic News Text, is done by Teferi Andargie in 2005. He used
predefined features like location of a sentence in a document, title words occurring in the
sentence, and cue words occurring in the sentence that are found to be a good indicator in giving
an optimum summary. He used a corpus of 480 news articles in the experiment which was used
3
by Saba (2001), Theodros (2003) and Samuel (2004) for testing different retrieval models. A
manual summary at 30% extraction rate was prepared for the 480 articles. The nave Bayes
algorithm is used to classify sentences as a summary or not a summary based on the feature
vectors (Teferi, 2005). A prototype is developed which extracts sentences to a desired
compression level. The result of the experiment shows that the location features gives the best
result in the classification of sentences when using individual features.
The other study, Automatic Text Summarization for Amharic Legal Judgments is a work done
by Helen Adane in 2006. The study focuses on producing prototype system of text
summarization on Amharic legal judgments using Python programming tool (Helen, 2006). The
methodology employed is an extraction technique on Amharic legal judgments rendered by the
supreme court of Ethiopia. To evaluate the performance of the system a random automatic
summary is generated using the same extraction rates (10% and 20%) by the system for the
selected legal judgments. Using extrinsic evaluation technique, the performance of the system
summary and the random summary were compared with an ideal (manual) summary which is
manually prepared summary by legal experts. The result showed that the sentences extracted by
the system summary using different extraction features are much closer to the manual summary
(Helen, 2006). The domain of the research, legal judgment, writing style is inverted pyramid or
downward triangular which states the most important part of the paragraph on the last sentence
of the paragraph.
5 Implementation
The project mainly used python 2.5.4 programming tool to develop and implement my system. I
prefer python because it has a great deal of features to write my codes to implement the above
tasks. Regarding to the program it takes the content text or smyce text and that perform the above
tasks and the system automatically generate the summary.
In text summarization system there are many tasks which are performed on the text. The text
summarizer that I develop in my project tries to extract the key sentences from the text, and then
4
tries to put them together properly. Some of the tasks that are performed in my project to develop
the Automatic Text Summarization system are:
Segmenting the text into paragraphs, sentences and words is dividing of the text into a discrete
paragraphs, sentences and words/tokens. In splitting the text into paragraphs, sentences and
words my project first reads or find the content of the text and then split it into its respective
paragraphs, sentences and words. The followings are some of the sample code that used to split
the text into paragraphs, sentences and words/tokens.
content = content.replace("\n","::")
return content.split("\n\n")
s1 = set(sent1.split(" "))
s2 = set(sent2.split(" "))
5
This is used to find the common sentences that used to rank the sentences as a key for the
summarization. In this task there is a function that takes two sentences as an argument and
returns a score for the intersection betIen the sentences if the sentences have intersection
otherwise return zero. In the first task I split each sentence into words/tokens, and then count
how many common tokens that the sentences have, and then normalize the result with the
average length of the two sentences to find the score for the intersection betIen the sentences and
also in this step formatted sentences are identified and rank all the sentences according to their
score. Formatted sentences are sentence that are found after the removal of non-alphabetic
characters from the sentences. The followings are some of the sample code that used to calculate
the intersection of the sentences.
s1 = set(sent1.split(" "))
s2 = set(sent2.split(" "))
if (len(s1) + len(s2)) == 0:
return 0
6
return sentence
sentences = self.split_content_to_sentences(content)
n = len(sentences)
This is used to convert the content/ the smyces text into a dictionary that contains a key and a
rank. It receives my text as input, and calculates a score for each sentence. Basically the
calculation performed in two times.
In the first time the text splits into sentences, and store the intersection value betIen each two
sentences as a matrix. In the second time I calculate an individual score for each sentence and
store it in a key-value dictionary, where the sentence itself is the key and the value is the total
score. I do that just by summing up all its intersections with the other sentences in the text
excluding itself and then the best sentences are selected/got according to the sentences
dictionary.
The followings are some of the sample code that used to build the sentences dictionary.
# Build the sentences dictionary
7
sentences_dic = {}
score = 0
if i == j:
continue
score += values[i][j]
sentences_dic[self.format_sentence(sentences[i])] = score
return sentences_dic
sentences = self.split_content_to_sentences(paragraph)
if len(sentences) < 2:
return ""
best_sentence = ""
max_value = 0
for s in sentences:
strip_s = self.format_sentence(s)
8
if strip_s:
max_value = sentences_dic[strip_s]
best_sentence = s
return best_sentence
The last task which is performed by my project is generating a final summery. The final
summery is generated automatically since I spited the text into paragraphs, sentences and words,
then the intersection was calculated and then the sentences dictionary was build that contain the
best sentences according to their intersection score and finally the summery is generated by
selecting the best sentence from each paragraph according to the sentences dictionary. In
addition the summary it also generated the ratio betIen the summary length and the original text
length.
The followings are some of the sample code that used to generate the summary.
paragraphs = self.split_content_to_paragraphs(content)
summary = []
summary.append(title.strip())
summary.append("")
9
for p in paragraphs:
if sentence:
summary.append(sentence)
return ("\n").join(summary)
One of the biggest challenges that Ive encountered in the process of doing this project was lack
of reference related to text summarization in Amharic language which used as a further reference
and lack of experience to use the required tools used to develop the text summarization system.
One of the other challenges that made this task hard to deal with was shortage of time that is
given to accomplish it.
The followings are the limitations of the project which are not included in my project because of
the above challenges that are encountered during the project.
I didnt use any software package such as NTLK tool to do the segmentation
automatically.
My project doesnt perform stemming and stop word removal.
I didnt use any evaluation method other than calculating the ratio of the original and
summarized texts.
8 Future Work
The project accomplished the basic goals in which it is proposed to perform, but because of the
above challenges I would like to recommend a couple of feature which didnt developed by this
project and recommended for others to do on the following lists of recommendations :
10
It is important to develop the system by using software packages such as by using
Ntlk tool.
It is better to perform on abstraction summarization other than extraction
summarization.
It is better to perform stemming and stop word removal to increase the performance
of the system.
There is no GUI interface for the system, I recommend if someone interested can
develop it.
11