Wan 2008

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Multi-Document Summarization Using

Cluster-Based Link Analysis


Xiaojun Wan and Jianwu Yang
Institute of Computer Science and Technology, Peking University, Beijing 100871, China
{wanxiaojun, yangjianwu}@icst.pku.edu.cn

ABSTRACT special topic sessions in ACL, COLING, and SIGIR have


The Markov Random Walk model has been recently exploited for advanced the summarization techniques and produced a couple of
multi-document summarization by making use of the link experimental online systems.
relationships between sentences in the document set, under the A particular challenge for multi-document summarization is that a
assumption that all the sentences are indistinguishable from each document set might contain diverse information, which is either
other. However, a given document set usually covers a few topic related or unrelated to the main topic, and hence we need effective
themes with each theme represented by a cluster of sentences. The summarization methods to analyze the information stored in
topic themes are usually not equally important and the sentences different documents and extract the globally important information
in an important theme cluster are deemed more salient than the to reflect the main topic. Another challenge for multi-document
sentences in a trivial theme cluster. This paper proposes the summarization is that the information stored in different
Cluster-based Conditional Markov Random Walk Model documents inevitably overlaps with each other, and hence we need
(ClusterCMRW) and the Cluster-based HITS Model (ClusterHITS) effective summarization methods to merge information stored in
to fully leverage the cluster-level information. Experimental different documents, and if possible, contrast their differences. In
results on the DUC2001 and DUC2002 datasets demonstrate the recent years, both unsupervised and supervised methods have been
good effectiveness of our proposed summarization models. The proposed to analyze the information contained in a document set
results also demonstrate that the ClusterCMRW model is more and extract highly salient sentences into the summary, based on
robust than the ClusterHITS model, with respect to different syntactic or statistical features.
cluster numbers.
Most recently, the Markov Random Walk Model (abbr. MRW)
Categories and Subject Descriptors: has been successfully used for multi-document summarization by
making use of the “voting” or “recommendations” between
H.3.1 [Information Storage and Retrieval]: Content Analysis
sentences in the documents [4, 21, 25]. The model first constructs
and Indexing – abstracting methods; I.2.7 [Artificial Intelligence]:
a directed or undirected graph to reflect the relationships between
Natural Language Processing – text analysis
the sentences and then applies the graph-based ranking algorithm
General Terms: Algorithms, Experimentation, Performance to compute the rank scores for the sentences. The sentences with
large rank scores are chosen into the summary. However, the
Keywords: Multi-Document Summarization, Cluster-based model makes uniform use of the sentences in the document set, i.e.
Link Analysis, Conditional Markov Random Walk Model, HITS all the sentences are ranked without considering the higher-level
information beyond the sentence-level information. Actually,
given a document set, there usually exist a number of themes or
1. INTRODUCTION subtopics, and each theme or subtopic is represented by a cluster
Multi-document summarization aims to produce a summary of highly related sentences [6, 7]. The theme clusters are usually
delivering the majority of information content from a set of of different size and have different importance for users to
documents about an explicit or implicit main topic.
understand the document set. For example, the theme clusters
Multi-document summary can be used to concisely describe the
close to the main topic of the document set are usually more
information contained in a cluster of documents and facilitate the
important than the theme clusters far away from the main topic of
users to understand the document cluster. For example, a number
the document set. The cluster-level information is deemed to have
of news services (e.g. Google News) have been developed to
great influence on the sentence ranking process. Moreover, the
group news articles into news topics, and then produce a short
sentences in the same theme cluster cannot be treated uniformly.
summary for each news topic. The users can easily understand the
Some sentences in the cluster are more important than other
topic they have interest in by taking a look at the short summary.
sentences because of their different distances to the cluster’s
Automated multi-document summarization has drawn much centroid. In brief, neither the cluster-level information nor the
attention in recent years. In the communities of natural language sentence-to-cluster relationship can be taken into account in the
processing and information retrieval, a series of workshops and Markov Random Walk Model.
conferences on automatic text summarization (e.g. NTCIR, DUC),
In order to address the above limitations of the Markov Random
Permission to make digital or hard copies of all or part of this work for Walk Model, we propose two models to incorporate the
personal or classroom use is granted without fee provided that copies are cluster-level information into the process of sentence ranking. The
not made or distributed for profit or commercial advantage and that first model is the Cluster-based Conditional Markov Random
copies bear this notice and the full citation on the first page. To copy Walk Model (abbr. ClusterCMRW), which incorporates the
otherwise, or republish, to post on servers or to redistribute to lists, cluster-level information into the link graph. The second model is
requires prior specific permission and/or a fee. the Cluster-based HITS Model (abbr. ClusterHITS), which
SIGIR’08, July 20–24, 2008, Singapore. considers the clusters and sentences as hubs and authorities in the
Copyright 2008 ACM 978-1-60558-164-4/08/07...$5.00.

299
HITS algorithm. Experiments have been performed on the all the single summary of each document. Wan and Yang [25]
DUC2001 and DUC2002 datasets, and the results demonstrate the improve the graph-ranking algorithm by differentiating
good effectiveness of the two models. The experimental results intra-document links and inter-document links between sentences.
also demonstrate that the ClusterCMRW model is more robust All these methods make use of the relationships between sentences
than the ClusterHITS model, with respect to different cluster and select sentences according to the “votes” or
numbers. “recommendations” from their neighboring sentences.
The rest of this paper is organized as follows: Section 2 introduces Other related work includes topic-focused document
the related work. The basic Markov Random Walk Model is summarization [3], which aims to produce summary biased to a
introduced in Section 3. And the two proposed models are given topic or query.
presented in Sections 4. In Section 5, we describe the
experiments and results. Lastly we conclude this paper in Section 2.2 Link Analysis
6. PageRank [22] and HITS [9] are two popular algorithms for link
analysis between web pages and they have been successfully used
to improve web retrieval. More advanced web link analysis
2. RELATED WORK methods have been proposed to leverage the multi-layer
2.1 Multi-Document Summarization relationships between web pages. The Conditional Markov
A variety of multi-document summarization methods have been Random Walk Model has been successfully applied in the tasks of
developed recently. Generally speaking, those methods can be web page retrieval based on two-layer web graph [17].
either extractive summarization or abstractive summarization. Hierarchical structure of the web graph is also exploited for link
Extractive summarization involves assigning saliency scores to analysis in [26]. In recent years, a few researches have focused on
some units (e.g. sentences, paragraphs) of the documents and using link analysis methods to re-rank search results in order to
extracting those with highest scores, while abstractive improve the retrieval performance [12, 13, 27]. The links between
summarization (e.g. NewsBlaster) usually needs information documents are induced by computing the similarity between
fusion [2], sentence compression [10] and reformulation [20]. In documents using the Cosine measure or language model measure.
this study, we focus on extractive summarization. In addition, link analysis methods have also been applied in social
network analysis [28] and other tasks.
The centroid-based method [23] is one of the most popular
extractive summarization methods. MEAD1 is an implementation
of the centroid-based method that scores sentences based on 3. THE BASIC MODEL
sentence-level and inter-sentence features, including cluster The Markov Random Walk Model (MRW) is essentially a way of
centroids, position, TFIDF, etc. NeATS [15] uses sentence deciding the importance of a vertex within a graph based on global
position, term frequency, topic signature and term clustering to information recursively drawn from the entire graph. The basic
select important content, and use MMR [5] to remove redundancy. idea is that of “voting” or “recommendation” between the vertices.
To further explore user interface issues, iNeATS [14] is developed A link between two vertices is considered as a vote cast from one
based on NeATS. XDoX [7] is a cross document summarizer vertex to the other vertex. The score associated with a vertex is
designed specifically to summarize large document sets by determined by the votes that are cast for it, and the score of the
identifying the most salient themes within the set by passage vertices casting these votes.
clustering and then composes an extraction summary, which
reflects these main themes. The passages are clustered based on E
n-gram matching. Much other work also explores to find topic
themes in the documents for summarization, e.g. Harabagiu and Sentences
Lacatusu [6] investigate five different topic representations and
Figure 1: One-layer link graph
introduce a novel representation of topics based on topic themes.
In addition, Marcu [19] selects important sentences based on the Formally, given a document set D, let G=(V, E) be a graph to
discourse structure of the text. TNO’s system [11] scores reflect the relationships between sentences in the document set, as
sentences by combining a unigram language model approach with shown in Figure 1. V is the set of vertices and each vertex vi in V is
a Bayesian classifier based on surface features. a sentence in the document set. E is the set of edges, which is a
subset of V×V. Each edge eij in E is associated with an affinity
Most recently, the graph-based ranking methods have been
weight f(i→j) between sentences vi and vj (i≠j). The weight is
proposed to rank sentences or passages based on the “votes” or
computed using the standard cosine measure [1] between the two
“recommendations” between each other. Websumm [18] uses a
sentences.
graph-connectivity model and operates under the assumption that r r
nodes which are connected to many other nodes are likely to carry vi ⋅ v j
salient information. LexPageRank [4] is an approach for f (i → j ) = sim cos ine ( v i , v j ) = r r
vi × v j (1)
computing sentence importance based on the concept of
eigenvector centrality. It constructs a sentence connectivity matrix r r
where v i and v j are the corresponding term vectors of vi and vj.
and computes sentence importance based on an algorithm similar
to PageRank. Mihalcea and Tarau [21] also propose a similar Two vertices are connected if their affinity weight is larger than 0
algorithm based on PageRank [22] to compute sentence and we let f(i→i)=0 to avoid self transition.
importance for single document summarization, and for The transition probability from vi to vj is then defined by
multi-document summarization, they use a meta-summarization normalizing the corresponding affinity weight as follows.
process to summarize the meta-document produced by assembling

1
http://www.summarization.com/mead/

300
⎧ f (i → j ) theme cluster should be ranked higher than the sentences in other
⎪ |V | , if ∑ f ≠ 0 theme clusters, and an important sentence in a theme cluster
⎪ (2)
p (i → j ) = ⎨ ∑ f ( i → k ) should be ranked higher than other sentences in the cluster.
⎪ k =1 In order to leverage the cluster-level information, we propose two
⎩⎪0 , otherwise models to make use of the relationships between sentences and
clusters. The first model is the Cluster-based Conditional Markov
Note that p(i→j) is usually not equal to p(j→i). We use the Random Walk Model (ClusterCMRW), which is an improvement
~ ~ )
row-normalized matrix M =( M i , j |V|×|V|
to describe G with each of the MRW model or the PageRank algorithm [22] by
entry corresponding to the transition probability. incorporating the cluster-level information into the link graph. The
~ second model is the Cluster-based HITS Model (ClusterHITS),
M i,j = p (i → j ) (3) which formalizes the sentence-cluster relationships as the
~ authority-hub relationships in the HITS algorithm [9]. Both
In order to make M be a stochastic matrix, the rows with all models are based on link analysis techniques.
zero elements are replaced by a smoothing vector with all
Note that the above models are used to compute the saliency
elements set to 1/|V|.
~ scores of the sentences in a document set, and other steps are
Based on the matrix M , the saliency score SenScore(vi) for needed to produce the final summary. The overall summarization
sentence vi can be deduced from those of all other sentences linked framework consists of the following three steps:
with it and it can be formulated in a recursive form as in the
PageRank algorithm. 1. Theme cluster detection: This step aims to detect theme
clusters in the document set. In this study, we simply use the
~ (1 − μ ) clustering algorithm to group sentences into a few theme
SenScore(v i ) = μ ⋅ ∑ SenScore(v
all j ≠ i
j ) ⋅ M j,i +
|V | (4) clusters.
2. Sentence score computation: This step aims to compute the
And the matrix form is
saliency scores of the sentences in the document set by using
r ~ r (1 − μ ) r either the ClusterCMRW model or the ClusterHITS model to
λ = μM T λ + e incorporate the cluster-level information.
|V | (5)
r 3. Summary extraction: The same algorithm [27] as in the
where λ = [ SenScore ( v i )]|V|×1 is the vector of saliency scores for basic model is applied to remove redundancy and choose
r summary sentences.
the sentences. e is a column vector with all elements equaling
to 1. μ is the damping factor usually set to 0.85, as in the The first two steps are key steps and the details will be described
PageRank algorithm. in next sections respectively. The last step is quite straightforward
and we omit its details in this paper.
The above process can be considered as a Markov chain by taking
the sentences as the states and the final transition matrix is given 4.2 Theme Cluster Detection
~
by A = μ M (1 − μ ) r r T , which is irreducible. The In the experiments, three popular clustering algorithms are
T
+ ee explored to produce theme clusters. In this study, given a
|V|
stationary probability distribution of each state is obtained by the document set, it is hard to predict the actual cluster number, and
principal eigenvector of the transition matrix. thus we typically set the number k of expected clusters as follows.

For implementation, the initial scores of all sentences are set to 1 k= V (6)
and the iteration algorithm in Equation (4) is adopted to compute where |V| is the number of all sentences in the document set.
the new scores of the sentences. Usually the convergence of the
iteration algorithm is achieved when the difference between the The clustering algorithms are described as follows [8]:
scores computed at two successive iterations for any sentences Kmeans Clustering: It is a partition based clustering algorithm.
falls below a given threshold (0.0001 in this study). The algorithm randomly selects k sentences as the initial centroids
Note that after the saliency scores of sentences have been obtained, of the k clusters and then iteratively assigns all sentences to the
a variant of the MMR algorithm used in [27] is applied to penalize closest cluster, and recomputes the centroid of each cluster, until
the sentences highly overlap with informative sentences and the centroids do not change. The similarity between a sentence and
finally choose both informative and novel sentences into the a cluster centroid is computed using the standard cosine measure.
summary. Agglomerative Clustering: It is a bottom-up hierarchical
clustering algorithm and starts with the sentences as individual
4. THE PROPOSED MODELS clusters and, at each step, merge the most similar or closest pair of
clusters, until the number of the clusters reduces to the desired
4.1 Overview number k. The similarity between two clusters is computed using
In the basic MRW model, all the sentences are indistinguishable the AverageLink method, which computes the average of the
from each other, i.e. the sentences are treated uniformly. However, cosine similarity values between any pair of sentences belonging
as we mentioned in previous section, there may be many factors to the two clusters respectively.
that can have influence on the importance analysis of the Divisive Clustering: It is a top-down hierarchical clustering
sentences. As shown in [6, 7], a document set usually contains a algorithm and starts with one, all-inclusive cluster and, at each
few topic themes and each theme can be represented by a cluster step, splits the largest cluster (i.e. the cluster with most sentences)
of topic-related sentences. The theme clusters are not equally into two small clusters using the Kmeans algorithm until the
important. Our assumption is that the sentences in an important number of clusters increases to the desired number k.

301
4.3 Cluster-based Conditional Markov where λ∈[0,1] is the combination weight controlling the relative
contributions from the source cluster and the destination cluster.
Random Walk Model Various methods can be used to compute the cluster importance
In order to incorporate the cluster-level information and the
and the sentence-to-cluster correlation strength, including the
sentence-to-cluster relationship, the Conditional Markov Random
cosine measure, the language model measure, etc. In this study,
Walk Model is based on the two-layer link graph including both
we adopt the widely used cosine measure to measure the two
the sentences and the clusters. The novel representation is shown
factors.
in Figure 2. As can be seen, the lower layer is just the traditional
link graph between sentences in the basic MRW model. And the π(clus(vi)) aims to evaluate the importance of the cluster clus(vi) in
upper layer represents the theme clusters. The dashed lines the document set D, and it is set to the cosine similarity value
between these two layers indicate the conditional influence between the cluster and the whole document set2:
between the sentences and the clusters. π (clus (v i )) = sim cos ine (clus (v i ), D ) (9)
Theme Clusters ω(vi, clus(vi)) aims to evaluate the correlation between the
sentence vi and its cluster clus(vi), and it is set to the cosine
similarity value between the sentence and the cluster:
Esc ω (vi , clus(vi )) = simcos ine (vi , clus(vi )) (10)
~
Then the new row-normalized matrix M * is defined as follows:
~ (11)
M *i,j = p(i → j | clus(vi ), clus(v j ))
Ess
Sentences The saliency scores for the sentences are then computed based on
~
M * by using the iterative form in Equation (4). The final
Figure 2: Two-layer link graph transition matrix in the Markov chain is then denoted by
Formally, the new representation for the two-layer graph is ~ T (1 − μ ) r r T and the sentence scores is obtained
denoted as G*=<Vs, Vc, Ess, Esc>, where Vs=V={vi} is the set of A * = μM * + ee
|V|
sentences and Vc=C={cj} is the set of hidden nodes representing by the principle eigenvector of the new transition matrix A*.
the detected theme clusters; Ess=E={eij|vi,vj∈Vs} corresponds to
all links between sentences and Esc={eij|vi∈Vs, cj∈Vc and 4.4 Cluster-based HITS Model
cj=clus(vi)} corresponds to the correlation between a sentence and Different from the MRW model and the ClusterCMRW model, the
its cluster. Here, clus(vi) denotes the theme cluster containing HITS model distinguishes the hubs and authorities in the objects.
sentence vi. For further discussions, we let π(clus(vi)) ∈[0,1] A hub object has links to many good authorities, and an authority
denote the importance of cluster clus(vi) in the whole document set object has high-quality content and there are many hubs linking to
D, and let ω(vi, clus(vi)) ∈[0,1] denote the strength of the it. The hub scores and authority scores are computed in a
correlation between sentence vi and its cluster clus(vi). reinforcement way. In this study, we consider the theme clusters
as hubs and the sentences as authorities. Figure 3 gives the
We incorporate the two factors into the transition probability from bipartite graph representation, where the upper layer is the hubs
vi to vj and the new transition probability is defined as follows: and the lower layer is the authorities. The HITS model makes only
⎧ f (i → j | clus(vi ), clus(v j )) use of the sentence-to-cluster relationships.
⎪ |V | , if ∑ f ≠ 0
⎪ (7) Theme Clusters
p(i → j | clus(vi ), clus(v j )) = ⎨∑ f (i → k | clus(vi ), clus(vk ))
⎪ k =1
⎪⎩0 , otherwise
Esc
f(i→j|clus(vi), clus(vj)) is the new affinity weight between two
sentences vi and vj, conditioned on the two clusters containing the
two sentences. We propose to computes the conditional affinity
weight by linearly combining the affinity weight conditioned on
the source cluster (i.e. f(i→j|clus(vi))) and the affinity weight Sentences
conditioned on the destination cluster (i.e. f(i→j|clus(vj))) as Figure 3: Bipartite link graph
follows: Formally, the representation for the bipartite graph is denoted as
f (i → j | clus(vi ), clus(v j )) G#=<Vs, Vc, Esc>, where Vs=V={vi} is the set of sentences (i.e.
authorities) and Vc=C={cj} is the set of theme clusters (i.e. hubs);
= λ ⋅ f (i → j | clus(vi )) + (1 − λ ) ⋅ f (i → j | clus(v j ))
Esc={eij|vi∈Vs, cj∈Vc} corresponds to the correlations between
any sentence and any cluster. Each edge eij is associated with a
= λ ⋅ f (i → j ) ⋅ π (clus(vi )) ⋅ ω (vi , clus(vi )) weight wij denoting the strength of the relationship between the
+ (1 − λ ) ⋅ f (i → j ) ⋅ π (clus(v j )) ⋅ ω (v j , clus(v j )) (8) sentence vi and the cluster cj. Similarly, the weight wij is computed
= f (i → j ) ⋅ (λ ⋅ π (clus(vi )) ⋅ ω (vi , clus(vi ))
+ (1 − λ ) ⋅ π (clus(v j )) ⋅ ω (v j , clus(v j ))) 2
A sentence cluster (or document set) is treated as a single text by
concatenating all the sentence texts (or document texts).

302
by using the cosine measure. We let L = ( Li , j ) V × V denote the the sentence information has been stored into files. The summary
s c
of the two datasets are shown in Table 1.
adjacency matrix and L is defined as follows.
Table 1: Summary of data sets
Li , j = wij = simcos ine (vi , c j ) (12)
DUC 2001 DUC 2002
(t+1)
Then the authority score AuthScore (vi) of sentence vi and the Task Task 2 Task 2
hub score HubScore(t+1)(cj) of cluster cj at the (t+1)th iteration are Number of documents 309 567
computed based on the hub scores and authority scores at the tth
Number of clusters 30 59
iteration as follows.
Data source TREC-9 TREC-9
AuthScore ( t +1) (vi ) = ∑w
c j ∈Vc
ij ⋅ HubScore (t ) (c j ) (13)
Summary length 100 words 100 words

HubScore ( t +1) (c j ) = ∑w
vi ∈Vs
ij ⋅ AuthScore ( t ) (vi ) (14)
5.2 Evaluation Metric
We used the ROUGE [16] toolkit5 for evaluation, which has been
And the matrix form is widely adopted by DUC for automatic summarization evaluation.
r r It measures summary quality by counting overlapping units such
a (t +1) = Lh ( t ) (15)
as the n-gram, word sequences and word pairs between the
r r candidate summary and the reference summary. ROUGE-N is an
h (t +1) = LT a (t ) (16)
n-gram recall measure computed as follows:
r
where a ( t ) = [ AuthScore ( t ) (vi )]|V |×1 is the vector of authority
s ∑ ∑ Count (n − gram) match
scores for thesentences at the tth iteration and ROUGE− N = S∈{ Re f Sum} n-gram∈S
(19)
r
h (t ) = [ HubScore(c j ) ( t ) ]|V |×1 is the vector of hub scores for the
c
∑ ∑ Count(n − gram)
S∈{ Re f Sum} n-gram∈S
clusters at the tth iteration. In order to guarantee the convergence where n stands for the length of the n-gram, and
r r
of the iterative form, a and h are normalized after each Countmatch(n-gram) is the maximum number of n-grams
iteration as follows. co-occurring in a candidate summary and a set of reference
r r r summaries. Count(n-gram) is the number of n-grams in the
a ( t +1) = a (t +1) / a (t +1) (17)
reference summaries.
r r r (18) ROUGE toolkit reports separate scores for 1, 2, 3 and 4-gram, and
h (t +1) = h (t +1) / h (t +1) also for longest common subsequence co-occurrences. Among
r these different scores, unigram-based ROUGE score (ROUGE-1)
It can be proved that authority vector a converges to the has been shown to agree with human judgment most [16]. We
dominant eigenvector of the authority matrix LLT, and hub vector show three of the ROUGE metrics in the experimental results:
r ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), and
h converges to the dominant eigenvector of the hub matrix LTL.
ROUGE-W (based on weighted longest common subsequence,
For numerical computation of the scores, the initial scores of all
weight=1.2).
sentences and clusters are set to 1 and the above iterative steps are
used to compute the new scores until convergence. Usually the In order to truncate summaries longer than the length limit, we use
convergence of the iteration algorithm is achieved when the the “-l” option in ROUGE toolkit and we also use the “-m” option
difference between the scores computed at two successive for word stemming.
iterations for any sentences and clusters falls below a given
threshold (0.0001 in this study). 5.3 Experimental Results
The proposed ClusterCMRW and ClusterHITS models with
Finally, we use the authority scores as the saliency scores for the different clustering algorithms are compared with the baseline
sentences. The sentences are then ranked and chosen into MRW model, the top three performing systems and two baseline
summary. systems on DUC2001 and DUC2002 respectively. The top three
systems are the systems with highest ROUGE scores, chosen from
5. EXPERIMENTS the performing systems on each task respectively. The lead
baseline and coverage baseline are two baselines employed in the
5.1 Data Set generic multi-document summarization tasks of DUC2001 and
Generic multi-document summarization has been one of the DUC2002. The lead baseline takes the first sentences one by one
fundamental tasks in DUC 20013 and DUC 20024 (i.e. task 2 in in the last document in the collection, where documents are
DUC2001 and task 2 in DUC2002), and we used the two tasks for assumed to be ordered chronologically. And the coverage baseline
evaluation. DUC2001 provided 30 document sets and DUC2002 takes the first sentence one by one from the first document to the
provided 59 document sets (D088 is excluded from the original 60 last document. Tables 2 and 3 show the comparison results on
document sets by NIST) and generic abstracts of each document DUC2001 and DUC2002 respectively. In Table 2, SystemN,
set with lengths of approximately 100 words or less were required SystemP and SystemT are the top three performing systems for
to be created. The documents were news articles collected from DUC2001. In Table 3, System19, System26, System28 are the top
TREC-9. The sentences in each article have been separated and

5
3
http://www-nlpir.nist.gov/projects/duc/guidelines/2001.html We use ROUGEeval-1.4.2 downloaded from http://haydn.
4
http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html isi.edu/ROUGE/

303
three performing systems for DUC2002. ClusterCMRW and effective as each other. It is a little disappointing that one
ClusterHITS rely on the underlying clustering algorithm. For proposed model cannot always outperform the other proposed
example, ClusterCMRW(Kmeans) refers to the ClusterCMRW model on both datasets.
model using the Kmeans algorithm to detect theme clusters. For
the ClusterCMRW models, the combination weight λ is typically In order to investigate how the combination weight influences the
set to 0.5 without tuning, i.e. the two clusters for two sentences summarization performance of the ClusterCMRW model, we vary
contribute equally to the conditional transition probability. the combination weight λ from 0 to 1 and Figures 4-7 show the
ROUGE-1 and ROUGE-2 curves on the DUC2001 and DUC2002
Table 2: Comparison results on DUC2001 datasets respectively. The similar ROUGE-W curves are omitted
System ROUGE-1 ROUGE-2 ROUGE-W due to the page limit. We can see from the figures that the
ClusterCMRW proposed ClusterCMRW model with different clustering
(Kmeans) 0.35824 0.06458* 0.10770 algorithms can almost always outperform the baseline MRW
ClusterCMRW model, under different values of λ. The results show the robustness
(Agglomerative) 0.35707 0.06548* 0.10841 of the proposed ClusterCMRW model, with respect to different
combination weights.
ClusterCMRW
(Divisive) 0.35549 0.06073 0.10722
ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative)
ClusterHITS
0.35756 0.05944 0.10771 ClusterCMRW(Divisive) MRW
(Kmeans)
0.366
ClusterHITS
0.36897* 0.06392* 0.11139* 0.364
(Agglomerative)
0.362
ClusterHITS

ROUGE-1
0.36
(Divisive) 0.37419* 0.06881* 0.11245*
0.358
MRW 0.35527 0.05608 0.10641 0.356
SystemN 0.33910 0.06853 0.10240 0.354
SystemP 0.33332 0.06651 0.10068 0.352
0.35
SystemT 0.33029 0.07862 0.10215
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Coverage 0.33130 0.06898 0.10182 λ
Lead 0.29419 0.04033 0.08880 Figure 4: ROUGE-1 vs. λ for ClusterCMRW on DUC2001

ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative)
Table 3: Comparison results on DUC2002
ClusterCMRW(Divisive) MRW
System ROUGE-1 ROUGE-2 ROUGE-W
ClusterCMRW 0.07
(Kmeans) 0.38221* 0.08321 0.12362
0.065
ClusterCMRW
ROUGE-2

(Agglomerative) 0.38546* 0.08652* 0.12490*


0.06
ClusterCMRW
(Divisive) 0.37999 0.08389 0.12384*
0.055
ClusterHITS
(Kmeans) 0.37643 0.08135 0.12141
0.05
ClusterHITS 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(Agglomerative) 0.37768 0.07791 0.12271 λ
ClusterHITS Figure 5: ROUGE-2 vs. λ for ClusterCMRW on DUC2001
(Divisive) 0.37872 0.08133 0.12282

MRW 0.37595 0.08304 0.12173


System26 0.35151 0.07642 0.11448 ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative)
System19 0.34504 0.07936 0.11332 ClusterCMRW(Divisive) MRW

System28 0.34355 0.07521 0.10956 0.39


Coverage 0.32894 0.07148 0.10847
0.385
ROUGE-1

Lead 0.28684 0.05283 0.09525

(* indicates that the improvement over the baseline MRW model is 0.38
statistically significant.)
0.375

Seen from the tables, both the ClusterCMRW model and the 0.37
ClusterHITS model with different clustering algorithms can 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
outperform the basic MRW model and other baselines over almost λ
all three metrics on both DUC2001 and DUC2002 datasets. The
Figure 6: ROUGE-1 vs. λ for ClusterCMRW on DUC2002
results demonstrate the good effectiveness of the proposed models.
Moreover, the three clustering algorithms are validated to be as

304
ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative) ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative)
ClusterCMRW(Divisive) ClusterHITS(Kmeans)
ClusterCMRW(Divisive) MRW ClusterHITS(Agglomerative) ClusterHITS(Divisive)
MRW
0.09
0.07
0.088
0.065
ROUGE-2

ROUGE-2
0.086
0.06
0.084

0.082 0.055

0.08 0.05
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
λ r

Figure 7: ROUGE-2 vs. λ for ClusterCMRW on DUC2002 Figure 9: ROUGE-2 vs. r on DUC2001
Note that in the above experiments, the cluster number k is
ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative)
typically set to the square root of the sentence number. We further ClusterCMRW(Divisive) ClusterHITS(Kmeans)
vary k to investigate how the cluster number influences the ClusterHITS(Agglomerative) ClusterHITS(Divisive)
summarization performance. Given a document set, we let V MRW
0.39
denote the sentence collection for the document set, and k is set in
the following way: 0.385

k = r× | V |

ROUGE-1
(20) 0.38
where r (0,1) is a ratio controlling the expected cluster number 0.375
for the document set. The larger r is, the more clusters will be
0.37
produced and used in the algorithm. r ranges from 0.1 to 0.9 in the
experiments and Figures 8-11 show the ROUGE-1 and ROUGE-2 0.365
results of ClusterCMRW and ClusterHITS on the DUC2001 and 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
DUC2002 datasets, respectively. r

Seen from the figures, the ClusterCMRW models can almost Figure 10: ROUGE-1 vs. r on DUC2002
always outperform the baseline MRW model, no matter how many
clusters are used. However, the ClusterHITS models are much ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative)
ClusterCMRW(Divisive) ClusterHITS(Kmeans)
influenced by the cluster number and very many clusters will ClusterHITS(Agglomerative) ClusterHITS(Divisive)
deteriorate the performances of the ClusterHITS models. We can MRW
see from Figures 8, 10 and 11 that the performances of the 0.088
ClusterHITS models are even worse than the baseline MRW
0.086
model, when r is set to a large value. The results demonstrate
ROUGE-2

that the ClusterCMRW model is more robust than the ClusterHITS 0.084
model, with respect to different cluster numbers. The results can 0.082
be explained that the ClusterCMRW model involves both the
sentence-to-sentence relationships and the sentence-to-cluster 0.08
relationships, while the ClusterHITS model makes only use of the 0.078
sentence-to-cluster relationships, so the performance of the 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
ClusterHITS model will be highly affected by the detected theme r
clusters. Figure 11: ROUGE-2 vs. r on DUC2002

ClusterCMRW(Kmeans) ClusterCMRW(Agglomerative) 6. CONCLUSION AND FUTURE WORK


ClusterCMRW(Divisive) ClusterHITS(Kmeans)
ClusterHITS(Agglomerative) ClusterHITS(Divisive) In this paper we propose two novel summarization models to
MRW make use of the theme clusters in the document set. The first
0.375 model incorporates the cluster information in the Conditional
0.37 Markov Random Walk Model and the second model uses the
0.365 HITS algorithm by considering the cluster as hubs and the
ROUGE-1

0.36 sentences as authorities. Experimental results on the DUC2001


0.355 and DUC2002 datasets demonstrate the good effectiveness of the
0.35 models, and the cluster-based Conditional Markov Random Walk
0.345 Model is validated to be more robust than the Cluster-based HITS
0.34 Model.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 In this study, the themes in the document set are discovered by
r
simply clustering the sentences, and the quality of the clusters
might not be guaranteed. In future work we will use other theme
Figure 8: ROUGE-1 vs. r on DUC2001 detection methods to find meaningful theme clusters. Moreover,

305
we will exploit other link analysis methods to incorporating the [14] A. Leuski, C.-Y. Lin and E. Hovy. iNeATS: interactive
cluster-level information. multi-document summarization. In Proceedings of ACL2003.
[15] C.-Y. Lin and E.H. Hovy. From Single to Multi-document
7. ACKNOWLEDGMENTS Summarization: A Prototype System and its Evaluation. In
This work was supported by the National Science Foundation of
Proceedings of ACL2002.
China (No.60703064) and the Research Fund for the Doctoral
Program of Higher Education of China (No.20070001059). We [16] C.-Y. Lin and E.H. Hovy. Automatic Evaluation of
thank the anonymous reviewers for their useful comments. Summaries Using N-gram Co-occurrence Statistics. In
Proceedings of HLT-NAACL 2003.
8. REFERENCES [17] T.-Y. Liu and W.-Y. Ma. Webpage importance analysis using
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Conditional Markov Random Walk. In Proceedings of IEEE
Retrival. ACM Press and Addison Wesley, 1999. WI2005.
[2] R. Barzilay, K. R. McKeown, and M. Elhadad. Information [18] I. Mani and E. Bloedorn. Summarizing Similarities and
fusion in the context of multi-document summarization. In Differences Among Related Documents. Information
Proceedings of ACL1999. Retrieval, 1(1), 2000.
[3] H. Daumé and D. Marcu. Bayesian query-focused [19] D. Marcu. Discourse-based summarization in DUC–2001.
summarization. In Proceedings of COLING-ACL2006. 2001. In SIGIR 2001 Workshop on Text Summarization.
[4] G. Erkan and D. Radev. LexPageRank: prestige in [20] K. McKeown, J. Klavans, V. Hatzivassiloglou, R. Barzilay,
multi-document text summarization. In Proceedings of and E. Eskin. Towards multidocument summarization by
EMNLP2004. reformulation: progress and prospects, in Proceedings of
AAAI1999.
[5] J. Goldstein, M. Kantrowitz, V. Mittal, and J. Carbonell.
Summarizing Text Documents: Sentence Selection and [21] R. Mihalcea and P. Tarau. A language independent algorithm
Evaluation Metrics. In Proceedings of ACM SIGIR1999. for single and multiple document summarization. In
Proceedings of IJCNLP2005.
[6] S. Harabagiu and F. Lacatusu. Topic themes for
multi-document summarization. In Proceedings of [22] L. Page, S. Brin, R. Motwani and T. Winograd. The pagerank
SIGIR2005. citation ranking: Bringing order to the web. Technical report,
Stanford Digital Libraries, 1998.
[7] H. Hardy, N. Shimizu, T. Strzalkowski, L. Ting, G. B. Wise,
and X. Zhang. Cross-document summarization by concept [23] D. R. Radev, H. Y. Jing, M. Stys and D. Tam. Centroid-based
classification. In Proceedings of SIGIR2002. summarization of multiple documents. Information
Processing and Management, 40: 919-938, 2004.
[8] K. Jain, M. N. Murty and P. J. Flynn. Data clustering: a
review. ACM Computing Surveys, 31(3):264-323, 1999. [24] H. Saggion, K. Bontcheva, and H. Cunningham. Robust
generic and query-based summarization. In Proceedings of
[9] J. M. Kleinberg. Authoritative sources in a hyperlinked EACL2003.
environment. Journal of the ACM, 46(5):604-632, 1999.
[25] X. Wan and J. Yang. 2006. Improved affinity graph based
[10] K. Knight and D. Marcu. Summarization beyond sentence multi-document summarization. In Proceedings of
extraction: a probabilistic approach to sentence compression, HLT-NAACL2006.
Artificial Intelligence, 139(1), 2002.
[26] G.-R. Xue, Q. Yang, H.-J. Zeng, Y. Yu and Z. Chen.
[11] W. Kraaij, M. Spitters and M. van der Heijden. Combining a Exploiting the hierarchical structure for link analysis. In
mixture language model and Naïve Bayes for multi-document Proceedings of SIGIR2005.
summarization. In SIGIR2001 Workshop on Text
Summarization. [27] B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and
W.-Y. Ma. Improving web search results using affinity graph.
[12] O. Kurland and L. Lee. PageRank without hyperlinks: In Proceedings of SIGIR2005.
structural re-ranking using links induced by language models.
In Proceedings of SIGIR2005. [28] D. Zhou, S. A. Orshanskiy, H. Zha and C. L. Giles.
Co-ranking authors and documents in a heterogeneous
[13] O. Kurland and L. Lee. Respect my authority! HITS without network. In Proceedings of IEEE ICDM2007.
hyperlinks, utilizing cluster-based language models. In
Proceedings of SIGIR2006.

306

You might also like