Using Incremental PLSI For Threshold-Resilient Online Event Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 20,

NO. 3,

MARCH 2008

289

Using Incremental PLSI for Threshold-Resilient


Online Event Analysis
Tzu-Chuan Chou and Meng Chang Chen
AbstractThe goal of online event analysis is to detect events and track their associated documents in real time from a continuous
stream of documents generated by multiple information sources. Unlike traditional text categorization methods, event analysis
approaches consider the temporal relations among documents. However, such methods suffer from the threshold-dependency
problem, so they only perform well for a narrow range of thresholds. In addition, if the contents of a document stream change, the
optimal threshold (that is, the threshold that yields the best performance) often changes as well. In this paper, we propose a thresholdresilient online algorithm, called the Incremental Probabilistic Latent Semantic Indexing (IPLSI) algorithm, which alleviates the
threshold-dependency problem and simultaneously maintains the continuity of the latent semantics to better capture the story line
development of events. The IPLSI algorithm is theoretically sound and empirically efficient and effective for event analysis. The results
of the performance evaluation performed on the Topic Detection and Tracking (TDT)-4 corpus show that the algorithm reduces the cost
of event analysis by as much as 15 percent  20 percent and increases the acceptable threshold range by 200 percent to 300 percent
over the baseline.
Index TermsClustering, online event analysis, probabilistic algorithms, text mining.

INTRODUCTION

UBLISHING

activities are now ubiquitous because of the


rapid growth of the Internet. When an event occurs,
numerous independent authors (called information sources)
publish articles about the event during its life span. Such
articles form a document stream, which is a collection of
chronologically ordered documents reporting concurrent
events. The goal of the DARPA-sponsored Topic Detection
and Tracking (TDT) Project [1] is to promote TDT research,
where a topic is defined as something associated with a
specific action that occurs at some place and time. Event
analysis is similar to TDT in that it automatically identifies
events and associated documents in a document stream. In
many ways, an event is the same as a topic in TDT, except
that from our perspective, an event has a story line, and its
focus may change during its life span. In addition, an event
may last for a long time (for example, an international
conflict) or for a short period (for example, a car accident).
In terms of processing outcomes, event analysis is similar to
traditional document clustering. Moreover, both techniques
partition a set of documents into several coherent parts,
each of which relates to a single event.
There are two types of event analysis: online and
retrospective. In online event analysis, documents are
generated continuously and ordered chronologically so
that the analysis process needs to make decisions in real
time with all or some of the published documents. When a
new document arrives, the online analysis process either
. The authors are with the Institute of Information Science, Academia Sinica
Taiwan, 128 Sec. 2, Academia Rd., Nankang, Taipei 115 Taiwan.
E-mail: {tzuchuan, mcc}@iis.sinica.edu.tw.
Manuscript received 4 Jan. 2007; revised 10 July 2007; accepted 4 Oct. 2007;
published online 12 Oct. 2007.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-0006-0107.
Digital Object Identifier no. 10.1109/TKDE.2007.190702.
1041-4347/08/$25.00 2008 IEEE

assigns it to a known event or creates a new event for it. In


the latter case, it becomes the first document of the new
event. In addition to accurate analysis, prompt responses
and reasonable computational overheads are the major
issues in online event analysis. In contrast, in retrospective
event analysis, all the documents are known in advance,
and the processing time may not be a concern, because the
processing is usually performed offline.
In online event analysis, the incremental clustering
algorithm [1] is employed to cluster documents on by one
in chronological order. A document is deemed a member of
a certain cluster if the similarity between its text and that of
the other documents in the cluster is above a predefined
threshold. If no cluster is similar enough to the document, a
new cluster is created, and the document is treated as the
first document of the newly created topic. The performance
of the incremental clustering algorithm on the TDT task is
excellent, as long as the contents of the topic maintain a
high degree of similarity during the topics life span.
However, the textual content of a long-term event or a
multitopic event may change over time to reflect theme
changes or different aspects of the event. Allan et al. [2]
noted that temporally approximate documents probably
relate to the same event. By constantly raising the similarity
threshold of the event detection method for each time
increment, it is possible to prevent temporally remote
documents being merged into the same cluster. Therefore,
documents with similar contents that relate to different
events can be correctly distinguished.
Traditional document classification methods employ the
conventional vector space model (VSM) to represent
documents as vectors. Each vector is a set of terms whose
weights are usually determined by the TF-IDF scheme [17],
whereas the similarity between two vectors is measured by
the cosine similarity. Yang et al. [19] applied the VSM to the
Published by the IEEE Computer Society

290

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Fig. 1. The detection error trade-off costs of TW-DF and PLSI for TDT-2,
TDT-3, and TDT-4. (a) Cost of the TW-DF method. (b) F1 score of the
TW-DF method. (c) Cost of the PLSI algorithm. (d) F1 score of the PLSI
algorithm.

task of news TDT and used a time window with a decaying


function (TW-DF) to model the temporal relations between
documents and events. In this method, the size of the time
window dictates the number of previous documents to be
considered for clustering, and the decaying function weights
the influence of a document in the window according to the
gap between it and the newest document. Like the timebased threshold approach, remote documents within the
time window have less impact on clustering than documents
that are chronologically close. In addition, analogous events
that occur in different time periods are less likely to be
clustered together. This method is one of the best online
news detection and tracking algorithms available [20], [21].
The TF-IDF approach achieves a reasonably good
classification and clustering performance if a proper threshold is selected. However, the performance degrades rapidly
when the selected threshold differs from the optimum
threshold (that is, the threshold that achieves the best
performance) by even a small amount. We apply the TWDF method to the official TDT corpora TDT-2, TDT-3, and
TDT-4 [22]. The computational costs and F1 scores are
shown in Figs. 1a and 1b, respectively. In the figures, we
observe that the computational cost and F1 score degrade
sharply from their respective optimal thresholds. We call
this phenomenon the threshold-dependency problem. Interestingly, although this is an important issue, it is frequently
neglected.
The Probabilistic Latent Semantic Indexing (PLSI)
algorithm proposed by Hofmann [10], [11] uses a probabilistic model and the expectation-maximization (EM)
method for text classification and other applications. We
found that the PLSI algorithm can alleviate the thresholddependency problem. Figs. 1c and 1d show that the

VOL. 20,

NO. 3,

MARCH 2008

computational cost and F1 score degrade gently from the


optimal threshold, which suggests that a near-optimal
threshold can achieve a reasonable performance.
The PLSI algorithm uses the training data to build a
probabilistic model, which estimates the parameters of new
documents in the test phase. This process is called the foldin approach [10], [11]. Because of this characteristic, the
PLSI algorithm is only suitable for offline applications and
not for applications with a continuous incoming data
stream such as online news event analysis. To apply the
PLSI algorithm in online event analysis, the algorithm has
to be rerun, and the latent semantic indices have to be
reestimated for every time period. Thus, to analyze events
that cover different time periods, one can establish the
connection between the latent semantic indices of adjacent
time periods by calculating their similarities. For instance,
to construct event trails, Mei and Zhai utilized the KullbackLeibler (KL) divergence to calculate the similarities between
the latent semantic indices of different time periods [15].
However, under this approach, the latent semantics
between time periods may be incompatible, because the
PLSI algorithm may converge to different local optima.
Thus, since this approach cannot maintain latent continuity,
the smooth presentation of the resulting event trails may be
disrupted.
In this paper, we propose a threshold-resilient online
algorithm, called the Incremental Probabilistic Latent
Semantic Index (IPLSI) algorithm. In contrast to PLSI, IPLSI
processes incoming documents incrementally for each time
period (or after collecting a certain number of documents),
discards out-of-date documents and terms not used in
recent documents already processed, and folds in new
terms and documents for that time period. Therefore, the
latent semantic indices are likely preserved from one time
period to the next. (We call this property maintaining latent
continuity.) Consequently, IPLSI can track the development
of events better than PLSI. In addition, IPLSI alleviates the
threshold-dependency problem and thereby extends the
acceptable threshold range. The evaluation results of the
IPLSI algorithm on the TDT-4 corpus show that it reduces
the trade-off cost of errors in event detection by as much as
15 percent  20 percent and increases the acceptable
threshold range by 200 percent  300 percent over the
baseline.
The remainder of this paper is structured as follows: In
Section 2, we introduce some related work. In Section 3, we
describe the original PLSI algorithm, the naive incremental
approach, and our proposed IPLSI algorithm. Section 4
considers our test corpora, the performance measures, and
the baseline method and discusses the threshold-dependency problem. Section 5 details the experiment results. In
Section 6, we discuss the issue of latent semantic continuity.
Then, in Section 7, we summarize our work and present our
conclusions.

RELATED WORK

There are several noteworthy related work. Li et al. [13]


adopted a mixture of unigram models to handle text
information, a Gaussian Mixture Model to handle time
information, and a generative model to combine text and
time information for clustering and summarizing news
topics. Although the method avoids the inflexible use of

CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS

time stamps employed in traditional event detection


algorithms, it is designed for retrospective, not online,
news event detection. For the online approach, Morinaga
and Yamanishi [16] use a Gaussian Mixture Model to deal
with text information and a time-stamp-based discounting
learning algorithm that tracks topic structures adaptively by
deleting out-of-date statistical data. For computational
simplicity, the Gaussian Mixture Model employed in [16]
assumes that the covariance matrices of all Gaussian
distributions are diagonal.
For online event analysis, Surendran and Sra [18]
proposed incrementally Built Aspect Models (BAMs) to
dynamically discover new themes from document streams.
BAMs are probabilistic models designed to accommodate
new topics with the spectral algorithm and use a fold-in
approach similar to that of the original PLSI. This approach
retains all the conditional probabilities of the old words,
given the old latent variables, and the spectral step is used
to estimate the probabilities of new words and new
documents. The new model becomes the starting point for
discovering subsequent new themes so that the latent
variables in the BAM model can be grown incrementally.
Under this approach, the probabilities of old latent variables
are retained, and the probabilities of new latent variables
are detected as needed while the streaming data is being
processed. This is an excellent means of applying incremental algorithms to online text analysis. Although this is a
new theme (or latent) detection mechanism, it is not an
online text clustering approach; therefore, its purpose
differs from that of online event analysis.
Chakrabarti et al. proposed a framework of evolutionary
clustering [6]. They argued that evolutionary clustering
should simultaneously optimize the clustering accuracy of
snapshots and the clustering consistency along a timeline.
In their work, a user-defined change parameter is required to
arrange trade-offs between the two objectives. The authors
proposed several greedy approaches to modify traditional
clustering methods, K-Means methods, and agglomerative
hierarchical clustering algorithms to fulfill their requirements. These greedy approaches provide users with a
smooth view of cluster changes when the input data drifts
from the current clusters. However, in the TDT new event
detection and tracking task, it is a requirement that every
news document should be grouped with documents related
to the same real-world event. Evolutionary clustering tends
to maintain the consistency of clustering by sacrificing the
clustering accuracy; hence, it is not suitable for event
analysis tasks.

3
3.1

THE PROBABILISTIC LATENT SEMANTIC INDEXING


ALGORITHM AND THE PROPOSED ALGORITHM

Probabilistic Latent Semantic Indexing


Algorithm
The PLSI model incorporates higher level latent concepts/
semantics to smooth the weights of terms in documents
[10], [11]. The latent semantic variables can be viewed as
intermediate concepts or topics placed between documents
and terms. Meanwhile, the associations between documents, concepts, and terms are represented as conditional
probabilities and are estimated by the EM algorithm, an
iterative technique that converges to a maximum likelihood

291

estimator under incomplete data [9]. After the PLSI


parameters have been estimated, the similarities between
new documents (called query documents in [10], [11]) and
existing documents can be calculated by using the
smoothed term vectors. The PLSI algorithm, which can be
used in text classification and information retrieval applications [4], [12], achieves better results than traditional VSM
methods [10], [11]. We discovered another advantage of the
PLSI model in that it can expand the acceptable threshold
range. We discuss this aspect later in the paper.
The following notations are used in the PLSI algorithm: d
denotes a document, w denotes a term in a document, z
denotes a latent variable, D denotes the set of documents,
W denotes the set of terms, Z denotes the set of latent
variables, and q denotes a new (query) document. We also
use these notations in the proposed IPLSI algorithm.
In the PLSI algorithm, a latent variable z is introduced
between documents and terms so that their association can
be represented as conditional probabilities P wjz and
P zjd [4], [11]. The probabilities can also be represented
by P z, P djz, and P wjz, as reported in [3], [10]. The
PLSI algorithm assumes that P w; d, that is, the distribution of a term w and a document d, is conditionally
independent, given z; that is, P w; djz P wjzP djz,
P wjz; d P wjz, and P djz; w P djz. Using these
definitions and assumptions, we can define a generative
model for term/document co-occurrences as follows:
. select a document d with probability P d,
. pick a latent class z with probability P zjd, and
. generate a word w with probability P wjz.
This yields an observation pair d; w, whereas the latent
class variable z is discarded. The translation of the data
generation process into a joint probability model is shown
as follows:
X
P zP wjzP djz
P w; d
z2Z

P d

P wjzP zjd

z2Z

P w

P djzP zjw:

z2Z

The parameters of the PLSI model are estimated by the


iterative EM algorithm, which uses a training document set
D to maximize the log-likelihood function L:
XX
L
fw; d log P w; d;
2
d2D w2d

where fw; d is the frequency of a word w in a


document d. The PLSI parameters P wjz and P zjd are
initialized randomly and normalized, after which they are
revised by applying the EM procedure iteratively until
they converge. According to the Bayes rule, the conditional probability of P zjw; d can be estimated by the
following in the E (Estimation) step:
P wjzP zjd
:
P zjw; d P
P wjz0 P z0 jd
z0 2Z

292

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 20,

NO. 3,

MARCH 2008

Then, after weighting by the IDF, the similarity between any


two documents can be calculated by the following cosine
function:
*

simd~1 ; d~2

d1  d2
*

10

j d1 j  j d2 j
where
d~ P w1 jd  IDF w1 ; P w2 jd  IDF w2 ; . . .;

Fig. 2. The PLSI training process.

In the M (Maximization) step, the probabilities P wjz


and P zjd can be estimated, respectively, by the following:
P
fw; dP zjw; d
d2D
;
4
P wjz P P
fw0 ; dP zjw0 ; d
d2D w0 2d

P
P zjd

w2d

fw; dP zjw; d
P
:
fw; d

w2d

The equations used in the M step are derived by the


Lagrange Multiplier Method. Note that the parameters
P zjw; d, P wjz, and P zjd are iteratively refined until
they converge. The above process is called the training
process of the PLSI model (see Fig. 2).
After the completion of the above training process, the
estimated P wjz parameters are used to estimate new
parameters P zjq and P zjw; q for a new document q. This
is called the folding-in process [4], [10]. In this process, the
probability P zjq is first initialized randomly and normalized and, then, it is revised using (6) and (7) for the E and
M steps, respectively. Note that in the EM procedure, all
P wjz remain fixed. Consequently, the folding-in process
can usually be accomplished in just a few iterations:
P wjzP zjq
P zjw; q P
;
P wjz0 P z0 jq

z0 2Z

P
P zjq

w2q

fw; qP zjw; q
P
:
fw; q

w2q

When the PLSI algorithm is used in text classification


applications, a document d is represented by a smoothed
version of the term vector P w1 jd; P w2 jd; P w3 jd; . . .,
where
X
P wjd
P wjzP zjd:
8
z2Z

The new document q is represented by


P w1 jq; P w2 jq; P w3 jq; . . .;
where
P wjq

X
z2Z

P wjzP zjq:


IDF w log


N
:
DF w

11
12

In (12), N is the total number of documents (that is, the


existing documents d plus new documents q), and DF w is
the number of documents that contain the term w. In the
original work on PLSI [10], [11], it was shown by
experiment that the algorithm improves text classification
performance.
The model of the PLSI algorithm, which is learned from
training data, is constant during the whole processing
period; that is, data that arrives after the training phase does
not affect the original model. When a new document q is
input to the system, the parameters of the PLSI algorithm,
that is, P zjw; d, P wjz, and P zjd, remain fixed; thus,
only the parameters related to the new document q, that is,
P zjq and P zjw; q, are estimated in the folding-in process.
For example, all the P wjz remain fixed during the foldingin process, even if the term w occurs in the new documents.
In addition, new terms wnew , which only occur in new
documents, are ignored in the folding-in process. In an
online application, new documents arrive continuously, so
the parameters of the PLSI algorithm must be updated
regularly to accommodate changes, because the concepts of
the latent variables may change. Since the original PLSI
algorithm cannot handle this process effectively, we
designed the IPLSI algorithm.

3.2

A Naive Incremental Probabilistic Latent


Semantic Indexing Approach
In online event analysis, the system contains an initial set of
documents, and new documents arrive continuously. The
system compares an incoming document with existing
documents to decide which event the new document
belongs to, or it generates a new event if the document
does not relate to an existing event. Because of the aging
nature of events [7], an old inactive event less likely attracts
new documents than a recently active event. Therefore, an
event analysis system has to consider the temporal relations
of documents. Incorporating a lookup window is a popular
way of limiting the time frame that an incoming document
can relate to. An example of a window-based document
scope system is shown in Fig. 3. With each advance of the
window, which can be measured in time units or by a
certain number of documents, the system discards old
documents and folds in new ones.
A naive way of performing event analysis in a windowbased system using the PLSI technique is to run the PLSI
algorithm for each advance of the window. This means that
for every advance of the window, the EM algorithm uses a

CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS

293

respectively, whereas W0 and D0 are the respective


sets of the remaining terms and documents):
P0 wjz
P0 djz
; P djz P
: 13
P wjz P
P0 w0 jz
P0 d0 jz
w0 2W0

Fig. 3. The discarding and folding-in processes.

2.

new random initial setting to reestimate all the parameters


of the PLSI algorithm. Although the approach is straightforward, some problems may arise. For example, the EM
algorithm can only guarantee obtaining a local optimum, so
different initial settings of the PLSI parameters may result
in different local optima. Even if we choose the same initial
setting every time, the EM algorithm may still derive
different local optima if the processed documents sets are
different. Thus, the latent semantic variables may not be
continuous for each window advance. In Section 6, we
present some evaluation results that exemplify this discontinuous scenario. Another problem is that the long
execution time of the naive approach for each window
advance makes it unsuitable for online processing. We also
address this point in Section 5.

The Proposed Incremental Probabilistic Latent


Semantic Indexing Algorithm
In event analysis, an incoming document q is classified into
an existing category or is assigned as the first story of a new
category. In the original PLSI algorithm, only the probabilities P zjq and P zjw; q are estimated when a new
document q is added to the system. In other words, the
other system parameters are not adjusted, and new terms in
the new documents are completely ignored. However, in the
case of online analysis, the story line of an event may evolve,
so a new document may actually indicate a turning point in
the story line of the event. Hence, a new document dnew must
be folded into the system with all other PLSI parameters, that
is, P zjwold ; dold , P zjwold ; dnew , P zjwnew ; dnew , P wold jz,
P wnew jz, P zjdold , and P zjdnew , where dold and wold are
old documents and old terms, respectively. Since the
original PLSI algorithm is not suitable for online news
analysis, we propose the IPLSI algorithm to resolve the
problems in online event analysis.
In the IPLSI algorithm, the PLSI algorithm is executed
once on the initial documents. Then, for each window
advance, the IPLSI algorithm performs four steps to
update the model (note that steps 2, 3, and 4 employ the
EM algorithm):
Discard old documents and terms. As the time
window advances, the IPLSI algorithm removes outof-date documents dout and old terms wout not used
in recent documents. During this process, the PLSI
parameters P wout jz, P dout jz, P zjwout , and
P zjdout are also removed. In order to observe the
basic principle of probability that the total probability will be equal to 1, the remaining P wjz and
P djz must be renormalized proportionally as
follows (note that P0 wjz and P0 djz are the
probabilities of the remaining terms and documents,

Fold in new documents. In this step, the new


documents dnew are folded in, and all P zjdnew are
initialized randomly. P wjz are fixed during this
step and are used to estimate P zjdnew . The EM
algorithm is used to estimate P zjw; dnew and
P zjdnew , where the following are the E and
M steps, respectively (note that the folding-in
method, which is the same as that used in [4]and
[10], simply replaces a query document q with a
new document dnew ):
P wjzP zjdnew
;
P zjw; dnew P
P wjz0 P z0 jdnew

14

z0 2Z

fw; dnew P zjw; dnew

w2d

new
P zjdnew P P

z0 2Z

3.3

1.

d0 2D0

3.

fw; dnew P z0 jw; dnew

15

w2dnew

Fold in new terms. New terms wnew found in the


new documents are folded in. To estimate P zjwnew ,
the P dnew jz must be calculated as follows, since
P zjdnew have been estimated in the previous step
(note that Dnew is a set of new documents):
P wjzP zjdnew
;
P zjw; dnew P
P wjz0 P z0 jdnew

16

z0 2Z

P
P dnew jz

fw; dnew P zjw; dnew


P P
:
fw; dP zjw; d

w2dnew

17

d2Dnew w2d

After all P dnew jz have been calculated, P zjwnew


are initialized randomly and are normalized. Meanwhile, the P dnew jz parameters are fixed and used to
estimate P zjwnew for new terms. The EM algorithm
is applied in this folding-in process, and the
probabilities P zjwnew ; dnew and P zjwnew are estimated as follows in the E and M steps, respectively:
P dnew jzP zjwnew
P zjwnew ; dnew P
;
P dnew jz0 P z0 jwnew

18

z0 2Z

P
P zjwnew

d2Dnew

fwnew ; dP zjwnew ; d
P
:
fwnew ; d0

19

d0 2Dnew

4.

Update the PLSI parameters. Before revising the


PLSI parameters, we need to calculate P wnew jz and
adjust P wold jz, because the values of P wnew jz did

294

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 20,

NO. 3,

MARCH 2008

TABLE 1
The Statistical Data of the Evaluation Corpora

Fig. 4. The flowchart of the IPLSI algorithm.

not exist in the previous window, and wout may


occur in the new documents dnew . For the new terms
wnew , we use (18) to calculate P zjw; d, and for the
old terms wold , we use (4) to calculate P zjw; d. Next,
to observe the basic principle that the total probability will be equal to 1, all P wjz are normalized
using the following:
P
fw; dP zjw; d
d2D[Dnew
P
P
:
20
P wjz
fw0 ; d0 P zjw0 ; d0

The advantages of the IPLSI algorithm are twofold. First,


the convergence time of the EM algorithm is reduced
substantially, because most parameters remain the same, or
they are only modified slightly. The second advantage is
that the continuity of the latent semantic variables is
maintained. Because of these advantages, the proposed
IPLSI algorithm is more effective and efficient than the
naive IPLSI approach, which reestimates all the parameters
by random initialization of the PLSI algorithm for each
window advance.

d0 2D[Dnew w0 2d0

Then, the EM algorithm used by the PLSI algorithm,


as described in (3), (4), and (5), is executed to revise
all the PLSI parameters, because new terms wnew and
new documents dnew have been introduced.
The IPLSI process for one window advance is summarized in the flowchart shown in Fig. 4. As the window
advances, the four-step design of the IPLSI algorithm
preserves the probability and continuity of the latent
parameters during each revision of the model.
Our proposed approach folds in new terms and new
documents in separate steps. In contrast, a naive approach
folds in such terms and documents simultaneously. The
problem with this simple approach is that after each
advance of the window, the sum of all probabilities of
terms in W0 under z is 1, that is,
X
P wjz 1:
21
w2W0

After new documents arrive, all P wnew jz are initialized


randomly, which means that the sum of the probabilities of
terms under z will be larger than 1. Hence, the probabilities
must be normalized such that
X
P wjz 1;
22

CORPORA

SETTINGS

FOR

EVALUATION

4.1 Corpora
In the evaluation, we used the standard corpora TDT-2,
TDT-3, and TDT-4 from the NIST TDT corpora [22]. Only
English documents tagged as definitely news topics (that is,
tagged YES) were chosen for evaluation. The statistical data
of the corpora is shown in Table 1.
4.2 Performance Metrics
We follow the performance measurements defined in [19].
An event analysis system may generate any number of
clusters, but only the clusters that best match the labeled
topics are used for evaluation.
Table 2 illustrates a 2  2 contingency table for a clustertopic pair, where a, b, c, and d represent the numbers of
documents in the four cases. Four singleton evaluation
measures, Recall
Recall, P recision
recision, Miss
Miss, and F alse Alarm
Alarm, and
two primary evaluation measures, F 1 and normalized Cost
(also called the Normalized Detection Error Tradeoff Cost [5],
[14]), are defined as follows:
.
.

w2W

where W W0 Wnew , and Wnew is the set of new terms.


However, as P wnew jz are random values between 0 and 1,
it is meaningless to normalize the probabilities P wjz for all
w in W .
To avoid the above problem, we fold in new documents
in the second step. In the third step, we derive P dnew jz
from P zjdnew by (16) and (17), and we fix P dnew jz to
estimate P zjwnew by (18) and (19). This way, we can
systematically ensure that the total probability is equal to 1,
and we add the parameters of the new terms wnew and new
documents dnew smoothly at different times, unlike in
simple methods.

AND

.
.

Recall a=a c if a c > 0; otherwise, it is


undefined.
P recision a=a b if a b > 0; otherwise, it is
undefined.
Miss c=a c if a c > 0; otherwise, it is
undefined.
F alse Alarm b=b d if b d > 0; otherwise,
it is undefined.
TABLE 2
A Cluster-Topic Contingency Table

CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS

F 1 2  Recall  P recision
recision=Recall
Recall P recision
recision.
Cost Det Cmiss  Ptarget  Miss CFA  1  Ptarget 
F alse Alarm
Alarm.
. Cost
CostDet Norm CostDet = minCmiss  Ptarget ; CFA 
1  Ptarget .
In the definition of Cost
Cost, Cmiss , and CFA are the costs of
missed detection and false alarms, respectively, and Ptarget
is the probability of finding a relevant story. According to
the standard TDT cost function used for all evaluations in
TDT, Cmiss 1, CFA 0:1, and Ptarget 0:02 [14]. In the
following, we use Cost to denote the Normalized Detection
Error Tradeoff Cost Cost
CostDet Norm .
In our evaluations, we apply the microaverage method to
the global performance measurement. The microaverage is
obtained by merging the contingency tables of the topics (by
summing the corresponding cells) and then using the
merged table to derive global performance measurements.
.
.

4.3

Evaluation Baseline and the


Threshold-Dependency Problem
We use the TW-DF method [19] as the baseline, since it is
one of the most efficient online news event detection and
tracking algorithms available [20], [21]. For comparison
purposes, we also incorporate a time window and a time
decay function into the proposed IPLSI algorithm. As the
online process does not have any knowledge of incoming
documents, the statistics such as the IDF must be modified
to process new vocabulary from such documents. For
instance, the IDF is modified by substituting (12) as follows:
IDF w; t log

Nt
;
DF w; t

23

where t is the current time, Nt is the number of documents


in the window at time t, and DF w; t is the number of
documents containing the term w at time t. The similarity
between two documents in the same time window is
modified by substituting (10) as follows, where T d is the
time stamp of document d:
*

jT d1  T d2 j
d1  d2
 *
simd1 ; d2
* :
window size
j d1 j  j d2 j

24

The similarity between a new document and an event is


defined as the maximum similarity between the new
document and documents previously clustered into the
event. A document is deemed the first story of a new event
if its similarity with all the events in the current window is
below a predetermined threshold; otherwise, it is assigned
to the event that is the most similar.
Following the conventions of the TDT contest, the
window size of TW-DF is set to 500 source files, which
covers a period of 30 to 40 days. The obtained detection
error trade-off costs and F1 scores are shown in Figs. 1a and
1b, respectively. Based on the figures, we observe that the
computational cost and the F1 score degrade sharply from
their respective optimal thresholds. In practice, it is not
possible to determine the true optimal threshold, because
there is no knowledge about incoming and future news
documents. Thus, to preserve the high performance for
nonoptimal thresholds, an event analysis algorithm has to

295

be resilient. We call the range of thresholds whose


performance is within an error bound of the best performance the Optimal Threshold-Resilient Range (OTRR). For
instance, the OTRR of the TW-DF method for TDT-4 for
F 1 > 0:800 is 0.09, and for Cost < 0:3, it is 0.11. In the next
section, we show that the IPLSI can provide a wider OTRR
to alleviate the threshold-dependency problem.

PERFORMANCE EVALUATION

To evaluate the efficacy of the proposed IPLSI algorithm,


we compare its performance with that of two traditional
methods and three variants of the PLSI algorithm. The
traditional methods are 1) the TW-DF algorithm [19], which
was mentioned in Sections 1 and 4, and 2) the Evolutionary
Clustering algorithm [6], which modifies traditional clustering methods so that they can support the evolutionary
clustering task. For the EM algorithm, we use the modified
K-Means method and pick the best performance from
different change parameters cp for comparison (see [6] for
further details).
The three PLSI variants are the original PLSI algorithm,
the Naive-IPLSI algorithm, and the IPLSI algorithm that
ignores all new words. As mentioned previously, the
original PLSI algorithm is not an online algorithm, so all
the documents for processing are required at the start time.
Therefore, the original PLSI algorithm is expected to
outperform all the online algorithms, because they do not
have complete knowledge of all the documents for
processing at the start time; that is, the online algorithms
use the knowledge of documents chronologically. The
second method is the Naive-IPLSI algorithm discussed in
Section 3.2. The third method is the IPLSI algorithm that
ignores all new words as the window advances. The
window size for each of the three variants and IPLSI is set
to 500 source files, which is identical to the setting of the
TW-DF algorithm. The deferral periods are all set to
10 source files. In this experiment, the number of latent
variables is set to 32. Experiments with different numbers of
latent variables are discussed later in this section. We
evaluated all the methods by using 99 threshold values
(from 0.01 to 0.99). Then, for each method, we picked the
threshold that yielded the best performance.
The results of the six methods evaluated on TDT-4 are
listed in Table 3. The performance of the Evolutionary
Clustering (K-Means) algorithm is the least accurate. This is
because the objective of this algorithm is to maintain the
consistency of clustering along a timeline, which may not be
suitable for applications like event analysis, where an
events life cycle consists of the phases of creation, growth,
and decay.
The performance of the original PLSI algorithm is second
to that of the proposed IPLSI algorithm. It performs better
than the remaining methods, because it ignores the
chronological order of documents and uses the complete
knowledge of all documents, including any available
information about documents yet to be published. Even
so, the IPLSI algorithm still slightly outperforms the
original PLSI algorithm. The proposed IPLSI algorithm
and the Naive-IPLSI algorithm are both variants of the PLSI
algorithm, and they both use the EM algorithm to estimate

296

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

TABLE 3
Performance of the Six Evaluated Methods

latent variables. The latent variables generated by the


Naive-IPLSI algorithm are discontinuous, whereas the
latent variables generated by the IPLSI algorithm are
continuous. This is indirect evidence that latent continuity
can improve the performance.
The IPLSI algorithm clearly outperforms the baseline and
the other online methods. The improvements in terms of the
best F 1 and Cost over the baseline are 2.7 percent and
14.5 percent, respectively.
The F 1 performance of the Naive-IPLSI algorithm is
slightly better than that of the baseline algorithm, but the
Cost performance is not as good. In the Naive-IPLSI
approach, the PLSI algorithm is applied for every advance
of the window; hence, it has a long execution time and large
memory space overhead compared to IPLSI. Thus, the
Naive-IPLSI approach is unsuitable for online processing.
The experiment was performed on a PC with an AMD
Athlon 3200+ CPU, 2 Gbytes of memory, and the Windows
XP Professional SP2 platform. Table 4 details the average
execution times and the number of iterations required to
achieve convergence. The computation time of the NaiveIPLSI approach is 10 times longer than that of the proposed
IPLSI algorithm. For example, the total time of the NaiveIPLSI approach is 5,731 seconds (1:35:31), with the number
of latent number K set at 16. However, for the same setting,
the proposed IPLSI algorithm takes only 468 seconds (7:48),
which is the sum of the PLSI time (4:13) and the folding-in
time (3:35). IPLSI also converges much faster than the
Naive-IPLSI approach.
To understand why the Naive-IPLSI approach does not
perform substantially better than the baseline method, we
investigate the continuity of the latent variables. We believe
that the IPLSI algorithms good convergence time is due to
the fact that it successfully maintains the continuity of latent

VOL. 20,

NO. 3,

MARCH 2008

TABLE 5
The Results of the Proposed IPLSI Algorithm (Minimum Cost)

semantics, as it discards out-of-date documents and old


terms and folds in new ones. We consider the issue of latent
continuity in detail in Section 6.
To assess the influence of the number of latent variables
used in the IPLSI algorithm, we set the number of variables
at 16, 24, 32, 40, and 48. The results of the evaluations, as
shown in Tables 5 and 6, demonstrate that the proposed
IPLSI algorithm outperforms the TW-DF method by
1.85 percent and 14.73 percent for F 1 and Cost
Cost, respectively, as mentioned previously.
We further examined the impact of different numbers of
latent variables by averaging the conditional probabilities
P wjd of the models with different numbers of latent
variables. In the experiment, we averaged the conditional
probabilities of 24, 32, and 40 latent variables (denoted as
IPLSI- 1 in Tables 5 and 6) and 16, 24, 32, 40, and 48 latent
variables (denoted as IPLSI- 2 in Tables 5 and 6), as follows
(both IPLSI- 1 and IPLSI- 2 outperformed the baseline
approach in most cases, and in this evaluation, IPLSI is not
sensitive to the number of latent variables):
1
P w jd P24 w jd P32 w jd P40 w jd;
3

25

1
P16 w jd P24 w jd P32 w jd
5
P40 w jd P48 w jd:

26

P w jd

The F1 and Cost performances with different thresholds


for the proposed IPLSI algorithm are shown in Fig. 5. We
observe that the ranges of the OTRRs of the IPLSI
algorithm are much larger than those of the baseline
method. For example, Fig. 5a shows that for Cost 0:3, the
OTRR of the TW-DF method is 0:22  0:11 0:11, whereas
the OTRR of IPLSI-32 is 0:42  0:17 0:25. Similarly,
Fig. 5b shows that for F1 0:8, the OTRR of the TW-DF
TABLE 6
The Results of the Proposed IPLSI Algorithm (Maximum F1)

TABLE 4
Execution Times of Naive IPLSI and IPLSI

CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS

297

Fig. 5. (a) Cost and (b) F 1 of TW-DF and IPLSI.

Fig. 6. OTRR of (a) Cost and (b) F 1 for TW-DF and IPLSI.

method is 0:22  0:13 0:09, whereas the OTRR of IPLSI32 is 0:44  0:18 0:26.
To examine the changes in OTRR over a spectrum of
values of the evaluation metrics (for example, F 1 and
Cost
Cost), we apply the TW-DF and IPLSI algorithms to the
TDT-2, TDT-3, and TDT-4 corpora and record the OTRR
values, as shown in Fig. 6. This figure shows that the
OTRR of the proposed IPLSI algorithm for any value of
Cost and F 1 is wider than that of the TW-DF method. In
other words, the IPLSI algorithm is less dependent on the
selected threshold to achieve an acceptable performance.
Note that the OTRR properties of the other PLSI variants
are similar. This positive characteristic is evidently
inherited from the PLSI algorithm and is useful for realworld applications. In practice, the optimal threshold is
usually unknown; thus, the proposed ILPSI algorithm can
help alleviate the threshold-dependency problem and
achieve a reasonable performance.

small blocks are latent IDs. For instance, in the upper left
hand corner of Fig. 7, 40004 is the event ID, and 4 is the
latent ID. To determine the event to which a latent variable
belongs, we use the KL divergence rate [8] to measure the
distance between the latent variables and events as follows:
KLekz

X
w2e

pwje log

pwje
;
pwjz

27

DISCUSSION

With regard to the issue of the continuity of the latent


variables, Figs. 7 and 8 show part of the evolution of the
variables in the proposed IPLSI algorithm and the NaiveIPLSI approach, respectively. In these figures, the numbers
in the large blocks are event IDs, and the numbers in the

Fig. 7. Part of the evolution of the latent semantics in the proposed IPLSI
algorithm.

298

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Fig. 8. Part of the evolution of the latent semantics in the Naive-IPLSI


approach.

where e is an event, and pwje is the normalized


frequency of a word w that occurs in the documents of
the event e. For a latent variable z, the event e with the
smallest KLekz below a certain threshold  is deemed
the event to which z belongs. In other words, if all
KLekz are greater than , then Ez does not exist;
otherwise, Ez arg mine KLekz. Thus, an event may
be associated with more than one latent variable.
As shown in Fig. 7, the IPLSI algorithm successfully
maintains the continuity of the latent semantics as the
window advances. In contrast, the evolution of the latent
semantics in the Naive-IPLSI approach is discontinuous, as
shown in Fig. 8. This is because the Naive-PLSI approach
recalculates the EM algorithm for each iteration; thus, it
requires a large number of iterations to converge to a
different local optimum after the window advances.
However, in the proposed IPLSI algorithm, randomized
initiation is only needed for the first window, and the latent
semantics are adjusted by the discarding and folding-in
processes as the window advances. Hence, the algorithm
maintains the content of latent variables for each event. This
explains why the Naive-IPLSI approach takes much longer
to converge to the local optima than the proposed IPLSI
algorithm.
Based on Figs. 7 and 8, we observe that the latent
variables of the same event in the IPLSI algorithm are more
continuous than the latent variables in the Naive-IPLSI
approach. To measure the continuity, we calculate 1) the
average KL divergence rate of the same real event in two
adjacent time windows, as shown in (28), and 2) the two
closest latent variables in two adjacent time windows of the
same event for both the Naive-IPLSI approach and the
IPLSI algorithm, as shown in (29).
KLAV G Event t
X
1
KLet1 ket KLet ket1
jCt j e 2C ;Ee Ee
2
t

t1

28

KLAV G Latent t
1 X
KLzt1 kzt KLzt kzt1 29
min
:
for all zt1 ;zt
jCt j e 2C where Ez
2
Ez Ee
t

t1

VOL. 20,

NO. 3,

MARCH 2008

Fig. 9. Average KL divergence of events and latent variables for the


same event.

In the above equations, Ct is a set of events that occur in


time window t and time window t  1 simultaneously, jCt j
is the number of sets, zt and zt1 are latent variables
produced in time window t and time window t  1,
respectively, et and et1 are events in time window t and
time window t  1, respectively, and Eet denotes the
event ID of et . Note that comparing the differences among
all time windows via an asymmetric distance measure, that
is, the KL divergence, is not feasible; therefore, we average
KLet1 ket and KLet ket  1 in (28), and KLzt1 kzt and
KLzt kzt1 in (29). Based on Fig. 9, we observe that for the
proposed IPLSI algorithm, the average KL divergence rates
between latent variables in adjacent windows of the same
event are much lower than those for the Naive-IPLSI
approach. They are also much closer to the average KL
divergence rates of events between adjacent windows than
those in the Naive-IPLSI approach. Furthermore, the
continuity of the content in the latent variables generated
by the Naive-IPLSI approach is clearly very unstable,
whereas the proposed IPLSI algorithm maintains good
continuity in the content of latent variables and in the
content of real events.

CONCLUSIONS

Event analysis is a challenging research topic that has many


applications such as hot news stories provided by
Internet portals, Internet event detection, e-mail event
detection [16], and discussion board topic detection. The
challenge of online event analysis is to detect unknown
events and track their story line development from a
continuous document stream generated by uncoordinated
information sources. However, conventional text classification methods do not perform the tasks well, because the
temporal relationships among documents are difficult to
handle. The proposed IPLSI algorithm not only improves
event detection and reduces the computation time but also
alleviates the threshold-dependency problem, which traditional event analysis methods do not consider. The
performance of such methods depends on optimal thresholds that are usually unknown in practice. Even though the
thresholds are obtained by using training data sets, which is
a popular text classification method, the story lines of
events develop over time; hence, the training data sets
become unrepresentative. The proposed IPLSI algorithm

CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS

alleviates the threshold-dependency problem in online


event analysis tasks. Furthermore, the algorithm successfully maintains the continuity of the latent semantics along
the time line and thus ensures the quality of event detection.

ACKNOWLEDGMENTS
Meng Chang Chen is the corresponding author. The authors
wish to thank the anonymous reviewers for their valuable
and constructive comments, which have helped improve
the quality of this paper. This work was supported in part
by the National Science Council of Taiwan under Grants 942524-S-001-001 and 95-2524-S-001-001 and by the National
Digital Archives Program, Taiwan.

REFERENCES
[1]

[2]
[3]
[4]

[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]

[21]

J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang,


Topic Detection and Tracking Pilot Study: Final Report, Proc.
DARPA Broadcast News Transcription and Understanding Workshop,
1998.
J. Allan, R. Papka, and V. Lavrenko, Online New Event Detection
and Tracking, Proc. ACM SIGIR 98, 1998.
D.M. Blei and P.J. Moreno, Topic Segmentation with an Aspect
Hidden Markov Model, Proc. ACM SIGIR 01, 2001.
T. Brants, F. Chen, and I. Tsochantaridis, Topic-Based Document
Segmentation with Probabilistic Latent Semantic Analysis, Proc.
11th ACM Intl Conf. Information and Knowledge Management (CIKM
02), 2002.
T. Brants and F. Chen, A System for New Event Detection, Proc.
ACM SIGIR 03, 2003.
D. Chakrabarti, R. Kumar, and A. Tomkins, Evolutionary
Clustering, Proc. ACM SIGKDD 06, 2006.
C.C. Chen, Y.T Chen, and M.C. Chen, An Aging Theory for Event
Life Cycle Modeling, IEEE Trans. Systems, Man, and Cybernetics
Part A, vol. 37, no. 2, pp. 237-248, Mar. 2007.
Language Modeling and Information Retrieval, WB Croft and
J. Lafferty, eds. Kluwer Academic Publishers, 2003.
A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, J. Royal
Statistical Soc. B, vol. 39, pp. 1-38, 1977.
T. Hofmann, Probabilistic Latent Semantic Indexing, Proc. ACM
SIGIR 99, 1999.
T. Hofmann, Unsupervised Learning by Probabilistic Latent
Semantic Analysis, Machine Learning, vol. 42, pp. 177-196, 2001.
X. Jin, Y. Zhou, and B. Mobasher, Web Usage Mining Based on
Probabilistic Latent Semantic Analysis, Proc. ACM SIGKDD 04,
2004.
Z.W. Li, B. Wang, M.J. Li, and W.Y. Ma, A Probabilistic Model for
Retrospective News Event Detection, Proc. ACM SIGIR 05, 2005.
R. Manmatha, A. Feng, and J. Allan, A Critical Examination of
TDTs Cost Function, Proc. ACM SIGIR 02, 2002.
Q. Mei and C.X. Zhai, Discovering Evolutionary Theme Patterns
from Text: An Exploration of Temporal Text Mining, Proc. ACM
SIGKDD 05, 2005.
S. Morinaga and K. Yamanishi, Tracking Dynamics of Topic
Trends Using a Finite Mixture Model, Proc. ACM SIGKDD 04,
2004.
G. Salton, Automatic Text Processing: The Transformation, Analysis,
and Retrieval of Information by Computer. Addison-Wesley, 1989.
A. Surendran and S. Sra, Incremental Aspect Models for Mining
Document Streams, Proc. 17th European Conf. Machine Learning
(ECML 06), 2006.
Y. Yang, T. Pierce, and J. Carbonell, A Study on Retrospective
and Online Event Detection, Proc. ACM SIGIR 98, 1998.
J. Zhang, Y. Yang, and J. Carbonell, New Event Detection with
Nearest Neighbor, Support Vector Machines and Kernel Regression, Technical Report CMU-CS-04-118 (CMU-LTI-04-180), Carnegie Mellon Univ., 2007.
J. Zhang, Z. Ghahramani, and Y. Yang, A Probabilistic Model for
Online Document Clustering with Application to Novelty Detection, Proc. Conf. Neural Information Processing Systems (NIPS 04),
2004.

299

[22] NIST Topic Detection and Tracking Corpus, http://www.nist.


gov/speech/tests/tdt/tdt98/index.htm, 1998.
Tzu-Chuan Chou received the MS and PhD
degrees in computer science and information
engineering from Tamkang University, Taipei, in
1998 and 2004, respectively. He is currently a
postdoctoral fellow in the Institute of Information
Science, Academia Sinica, Taiwan. His research
interests include clustering algorithms, information retrieval, image compression, and prediction
market.

Meng Chang Chen received the BS and MS


degrees in computer science from the National
Chiao-Tung University, Taiwan, in 1979 and
1981, respectively, and the PhD degree in
computer science from the University of California, Los Angeles, in 1989. He joined AT&T Bell
Laboratories in 1989 as a member of technical
staff and led several R&D projects in the area of
data quality of distributed databases for mission
critical systems. From 1992 to 1993, he was an
associate professor at the National Sun Yat-Sen University, Taiwan.
Since then, he has been with the Institute of Information Science,
Academia Sinica, Taiwan, where he is currently a research fellow. He
was the deputy director from August 1999 to July 2002. For three years
(from 2001), he was the chair of the Standards and Technology Transfer
Group, National Science and Technology Program for Telecommunications Office (NTPO), Taiwan. His current research interests include
information retrieval, knowledge management and engineering, wireless
network, QoS networking, and operating systems.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like