Using Incremental PLSI For Threshold-Resilient Online Event Analysis
Using Incremental PLSI For Threshold-Resilient Online Event Analysis
Using Incremental PLSI For Threshold-Resilient Online Event Analysis
VOL. 20,
NO. 3,
MARCH 2008
289
INTRODUCTION
UBLISHING
290
Fig. 1. The detection error trade-off costs of TW-DF and PLSI for TDT-2,
TDT-3, and TDT-4. (a) Cost of the TW-DF method. (b) F1 score of the
TW-DF method. (c) Cost of the PLSI algorithm. (d) F1 score of the PLSI
algorithm.
VOL. 20,
NO. 3,
MARCH 2008
RELATED WORK
CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS
3
3.1
291
P d
P wjzP zjd
z2Z
P w
P djzP zjw:
z2Z
292
VOL. 20,
NO. 3,
MARCH 2008
simd~1 ; d~2
d1 d2
*
10
j d1 j j d2 j
where
d~ P w1 jd IDF w1 ; P w2 jd IDF w2 ; . . .;
P
P zjd
w2d
fw; dP zjw; d
P
:
fw; d
w2d
z0 2Z
P
P zjq
w2q
fw; qP zjw; q
P
:
fw; q
w2q
X
z2Z
P wjzP zjq:
IDF w log
N
:
DF w
11
12
3.2
CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS
293
2.
14
z0 2Z
w2d
new
P zjdnew P P
z0 2Z
3.3
1.
d0 2D0
3.
15
w2dnew
16
z0 2Z
P
P dnew jz
w2dnew
17
d2Dnew w2d
18
z0 2Z
P
P zjwnew
d2Dnew
fwnew ; dP zjwnew ; d
P
:
fwnew ; d0
19
d0 2Dnew
4.
294
VOL. 20,
NO. 3,
MARCH 2008
TABLE 1
The Statistical Data of the Evaluation Corpora
d0 2D[Dnew w0 2d0
CORPORA
SETTINGS
FOR
EVALUATION
4.1 Corpora
In the evaluation, we used the standard corpora TDT-2,
TDT-3, and TDT-4 from the NIST TDT corpora [22]. Only
English documents tagged as definitely news topics (that is,
tagged YES) were chosen for evaluation. The statistical data
of the corpora is shown in Table 1.
4.2 Performance Metrics
We follow the performance measurements defined in [19].
An event analysis system may generate any number of
clusters, but only the clusters that best match the labeled
topics are used for evaluation.
Table 2 illustrates a 2 2 contingency table for a clustertopic pair, where a, b, c, and d represent the numbers of
documents in the four cases. Four singleton evaluation
measures, Recall
Recall, P recision
recision, Miss
Miss, and F alse Alarm
Alarm, and
two primary evaluation measures, F 1 and normalized Cost
(also called the Normalized Detection Error Tradeoff Cost [5],
[14]), are defined as follows:
.
.
w2W
AND
.
.
CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS
F 1 2 Recall P recision
recision=Recall
Recall P recision
recision.
Cost Det Cmiss Ptarget Miss CFA 1 Ptarget
F alse Alarm
Alarm.
. Cost
CostDet Norm CostDet = minCmiss Ptarget ; CFA
1 Ptarget .
In the definition of Cost
Cost, Cmiss , and CFA are the costs of
missed detection and false alarms, respectively, and Ptarget
is the probability of finding a relevant story. According to
the standard TDT cost function used for all evaluations in
TDT, Cmiss 1, CFA 0:1, and Ptarget 0:02 [14]. In the
following, we use Cost to denote the Normalized Detection
Error Tradeoff Cost Cost
CostDet Norm .
In our evaluations, we apply the microaverage method to
the global performance measurement. The microaverage is
obtained by merging the contingency tables of the topics (by
summing the corresponding cells) and then using the
merged table to derive global performance measurements.
.
.
4.3
Nt
;
DF w; t
23
jT d1 T d2 j
d1 d2
*
simd1 ; d2
* :
window size
j d1 j j d2 j
24
295
PERFORMANCE EVALUATION
296
TABLE 3
Performance of the Six Evaluated Methods
VOL. 20,
NO. 3,
MARCH 2008
TABLE 5
The Results of the Proposed IPLSI Algorithm (Minimum Cost)
25
1
P16 w jd P24 w jd P32 w jd
5
P40 w jd P48 w jd:
26
P w jd
TABLE 4
Execution Times of Naive IPLSI and IPLSI
CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS
297
Fig. 6. OTRR of (a) Cost and (b) F 1 for TW-DF and IPLSI.
method is 0:22 0:13 0:09, whereas the OTRR of IPLSI32 is 0:44 0:18 0:26.
To examine the changes in OTRR over a spectrum of
values of the evaluation metrics (for example, F 1 and
Cost
Cost), we apply the TW-DF and IPLSI algorithms to the
TDT-2, TDT-3, and TDT-4 corpora and record the OTRR
values, as shown in Fig. 6. This figure shows that the
OTRR of the proposed IPLSI algorithm for any value of
Cost and F 1 is wider than that of the TW-DF method. In
other words, the IPLSI algorithm is less dependent on the
selected threshold to achieve an acceptable performance.
Note that the OTRR properties of the other PLSI variants
are similar. This positive characteristic is evidently
inherited from the PLSI algorithm and is useful for realworld applications. In practice, the optimal threshold is
usually unknown; thus, the proposed ILPSI algorithm can
help alleviate the threshold-dependency problem and
achieve a reasonable performance.
small blocks are latent IDs. For instance, in the upper left
hand corner of Fig. 7, 40004 is the event ID, and 4 is the
latent ID. To determine the event to which a latent variable
belongs, we use the KL divergence rate [8] to measure the
distance between the latent variables and events as follows:
KLekz
X
w2e
pwje log
pwje
;
pwjz
27
DISCUSSION
Fig. 7. Part of the evolution of the latent semantics in the proposed IPLSI
algorithm.
298
t1
28
KLAV G Latent t
1 X
KLzt1 kzt KLzt kzt1 29
min
:
for all zt1 ;zt
jCt j e 2C where Ez
2
Ez Ee
t
t1
VOL. 20,
NO. 3,
MARCH 2008
CONCLUSIONS
CHOU AND CHEN: USING INCREMENTAL PLSI FOR THRESHOLD-RESILIENT ONLINE EVENT ANALYSIS
ACKNOWLEDGMENTS
Meng Chang Chen is the corresponding author. The authors
wish to thank the anonymous reviewers for their valuable
and constructive comments, which have helped improve
the quality of this paper. This work was supported in part
by the National Science Council of Taiwan under Grants 942524-S-001-001 and 95-2524-S-001-001 and by the National
Digital Archives Program, Taiwan.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
299