Automatic Performance Evaluationfor Video Summarization
Automatic Performance Evaluationfor Video Summarization
Automatic Performance Evaluationfor Video Summarization
net/publication/228731668
CITATIONS READS
14 63
4 authors, including:
Daniel DeMenthon
University of Maryland, College Park
113 PUBLICATIONS 4,316 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Sports training with augmented reality glasses and realistic force feedback View project
All content following this page was uploaded by Daniel DeMenthon on 26 November 2018.
Abstract
This paper describes a system for automated performance evaluation of video summarization
algorithms. We call it SUPERSIEV (System for Unsupervised Performance Evaluation of Ranked
Summarization in Extended Videos). It is primarily designed for evaluating video summarization
algorithms that perform frame ranking. The task of summarization is viewed as a kind of database
retrieval, and we adopt some of the concepts developed for performance evaluation of retrieval
in database systems. First, ground truth summaries are gathered in a user study from many
assessors and for several video sequences. For each video sequence, these summaries are combined
to generate a single reference file that represents the majority of assessors’ opinions. Then the
system determines the best target reference frame for each frame of the whole video sequence and
computes matching scores to form a lookup table that rates each frame. Given a summary from
a candidate summarization algorithm, the system can then evaluate this summary from different
aspects by computing recall, cumulated average precision, redundancy rate and average closeness.
With this evaluation system, we can not only grade the quality of a video summary, but also (1)
compare different automatic summarization algorithms and (2) make stepwise improvements on
algorithms, without the need for new user feedback.
Keywords:
Automatic performance evaluation, video summarization, frame ranking, average precision,
TREC, ground truth
Support of this research by Hitachi Systems Development Laboratory under Contract 0210108898 and by the
Department of Defense under Contract MDA904-02-C-0406 is gratefully acknowledged.
1
1 Introduction
1. Speed watching: helping the user understand videos at a faster pace than the normal pace.
This is a form of speed reading for videos.
2. Browsing: allowing the user to navigate the video at a higher altitude and ignore details,
and dive down when a region of interest is detected.
3. Query guided search: allowing the user to see subsets of videos in response to queries such
as “Show me all the goals in this recorded soccer game”.
For each of these modalities, the summaries can chop the videos either at the level of shots or
at the level of frames: in the first case, summaries take the form of video skims, in which some
shots are dropped completely, while others are played with audio but may be truncated [17, 13, 7];
in the second case, they use individual key frames. The key frames can be presented to the user
either in temporal form as a smart fast-forward function [1, 2], in spatial form as a storyboard
[14], or in spatial-temporal form as a storyboard of animated thumbnails.
Methods that let the user set up the level of summarization (also called abstraction level [8],
compression level [11], or altitude [1, 2]) of the video stream seem desirable as they provide
more control and more freedom compared with those that decide for the user what the best
level of summarization should be. A better solution seems to be a combination of these two
views, where the system is initialized to an optimal default level when it is started; the user can
explore other altitudes by dragging the slider of a scrollbar, and return to the default level by
clicking a button [1]. This slider control can be implemented by ranking individual frames or
shots. The evaluation system described in this paper was developed specifically for such ranked
summarization algorithms that rank each individual frame [1, 2]. For more complete surveys on
video summarization techniques, refer to [6, 8].
1. In the most elementary type, storyboards produced by the candidate algorithm and by a
competing algorithm (such as an elementary summarization by equal-time-interval sampling)
are shown in parallel figures, and the reader is asked to confirm that candidate algorithm
has produced a better storyboard [2, 15]. It is evident that if there are more than two
algorithms to be compared, it becomes harder for readers to rank them without quantifying
their intuitive estimations against specified criterions.
2
2. In a more quantitative approach, a pool of assessors (often reduced to the author of the eval-
uated algorithm alone) review a given video, and designate specific frames (for key frame
summarization) or shots (for video skim generation) as being important for the understand-
ing of the video. Then the output key frames or shots of the algorithm are manually matched
by researchers to those of the ground truth. The matches are counted, and some measures,
such as precision and/or recall, are computed to grade the algorithms [6, 4, 12]. This grade
can be used to compare the candidate algorithm to a previously published one [4].
3. Another quantitative approach uses psychological metrics and evaluates the cognitive value
of summaries. Questionnaires are used to measure the degree to which users understand the
original video when they only see the summaries. Subjects are asked to judge the coherence
of summaries on a scale of 1 to n, and give their confidence in answering questions about
the content of the video (who, what, where, when) on preset scales [13, 7, 19], or give
their satisfaction when watching summaries compared with watching the whole video [19].
Summarization algorithms are then ranked on the grades received on these tests. This type
of evaluation is complex if it is to be done right. For example, assessors may have some prior
knowledge about the video content, so special questionnaires should be designed to address
this, e.g., subjects are asked to answer questions before and after they see the summary [7].
Furthermore, it is hard to decide the appropriate number of grades or scales; in [19], three
grades may be too coarse to reflect individual difference, while too many grades need very
detailed guidelines and definitions, thus increasing the difficulty for assessors.
The above approaches are similar in three main aspects, which we see as their main drawbacks.
First and foremost, they are designed just to tell whether the algorithm is good or not or is
better or worse than another, and therefore they are not aimed at giving us hints on how to
improve the algorithm, which we think is precious to researchers and should be included in the
evaluation. Second, the time investment in user study for evaluating one algorithm is not reusable
for evaluating another algorithm; all the effort has to be repeated each time an algorithm has been
changed or a new algorithm has been developed. Finally, these approaches are manual, which is
very time consuming and error-prone.
On the other hand, there are several well developed automatic evaluation systems in the fields of
language processing, specifically in the areas of retrieval of relevant text documents [16], machine
translation [10], and automatic text summarization [11]. Though they are not targeted at video
summarization algorithms, their principles and techniques can be adopted and adapted to our
purpose.
element is matched to a relevant item in the reference file. When a specified number N of elements
from the ranked list has been considered, all the precisions are averaged to give the final mark.
This protocol yields a grade for each comparison to a reference file. After repeating the process
for many data sets, the average of the resulting grades, called the mean average precision (MAP)
is used as a global grade (between 0 and 1) for the algorithm under study.
In TREC, the elements are documents, and the algorithms under study are designed for query-
based retrieval of documents. An instance would be “Find all the documents about the indepen-
dence of Quebec”, and assessors were asked to mark the documents as relevant or non-relevant
under this criterion. In the evaluation of the evaluation system itself, it was shown that assessors
sometimes have non-overlapping opinions about which documents are relevant for specific queries,
but that MAP averages out these differences and is a useful indicator of quality for algorithm
improvement [16]. Clearly, the principles of the TREC evaluation can be applied to any system
that ranks elements of a data set according to some criterion, provided that human assessors can
be used to produce reference files under the same criterion.
The principle of BLEU (BiLingual Evaluation Understudy), a system developed at IBM for
evaluation of machine translation [10], is similar to that of TREC, but the grading process is
different: a single precision is calculated, using the rule that the relevant marking of an element
by an algorithm is correct so long as there was one assessor who judged it as relevant; conversely
it would be marked incorrectly if none of the assessors regarded it as relevant. In the case of
machine translation, the data set is the whole vocabulary of the target language, and a specific
translation can be interpreted as a file in which the words required to express the meaning of
the translation are marked as relevant in the data set, while the other words are not relevant.
The order of the translated word is not used in the evaluation; i.e., a scrambled translation gets
4
the same grade as one in a grammatically correct order. Also, there is no attempt to match
words with their synonyms; only exact matches are used. Intuitively, the need for synonyms
subsides as more translations are pooled as references and more variations for translating the
same sentence are included. In spite of its simplicity, BLEU was shown to be useful for improving
machine translation algorithms, even when few reference translations are used [16]. There were
attempts to extend this framework to the domain of automatic text summarization, with mixed
preliminary results [11]. Note that in BLEU, the algorithm is asked to perform the same task
as the human assessors (mark elements of the data set as relevant). Therefore with BLEU one
can use a human assessment and use it as if it came from a machine. This was useful to show
that BLEU is consistently giving better grades to humans than to machines, a comforting sanity
check. Such tests are not possible with TREC-type evaluation systems, since the element ranking
by algorithms is required for MAP estimation, while assessors are not asked to rank the data set
elements (it would be too much work). Even with this advantage for BLEU, the average precision
of the TREC approach is more appropriate for evaluating ranked summarization algorithms for
videos than the basic precision used by BLEU, as will become clear in Section 4.
5
Figure 2: Three steps of fine-to-coarse polygon simplification, where at each step the vertex
removal that produces the minimal reduction in the total length of the polygon is selected. A
removed vertex is indicated by a crossed-out point and a line segment bypassing the point. Note
that the points removed first are those most aligned to their neighbors. The mapping insures
that such points correspond to predictable frames. The order of removal provides a ranked list
of video frames from least relevant to most relevant. The process is stopped when only the end
points remain.
We apply the automatic evaluation system to compare the performance of the distance function
using the following feature vectors, which use the same construction scheme with different color
components combinations.
• Basic YUV (bYUV for short): We define four histogram buckets of equal size for each of the
three color attributes in the YUV color space of MPEG encoding. Each bucket contributes
three feature components: the pixel count and the 2D spatial coordinates of the centroid of
the pixels in the bucket.
• Basic Y (bY): same as Basic YUV, with removal of U and V information.
• Smooth YUV (sYUV): There could be occurrences when a large population of pixels falls
in one bin, but very close to the border with a neighbor bin. Illumination flickering could
make this population jump to the neighbor bin, producing a spike in the trajectory and
an erroneous high relevance. To make the histogram less sensitive to noise, we smooth a
256-bin histogram by Gaussian convolution to make the large peaks spread out over several
bins and then combine bins of the smoothed histogram to form the final histogram as in
Basic YUV.
• Smooth Y (sY): same as Smooth YUV, with removal of U and V information.
• 3DYUV: This construction considers each pixel to be a 3D point with Y, U, V components.
Each bin has the same three components as in other constructions.
• logRGB : Here we apply Slaney’s technique for constructing a 3D histogram [12]. The total
bin number is 512, and each bin index is expressed using 9 bits.
4 System Design
Before we describe the evaluation system’s architecture and performance measures for video sum-
marization algorithms, a key issue we must face and solve is how to define a good summary in
general. With algorithm improvement as our final goal, we need to compare the outputs of can-
didate algorithms with some benchmark summaries, which should be optimal ones. We think
summaries from human assessors can be used as the benchmarks, because, for one thing, video
summaries are targeted at human users and people know what a good summary is and how to make
a good summary; second, we do not believe and do not expect that computers can outperform
general human assessors in understanding video content.
Therefore, we define a good video summarization algorithm as one producing summaries that
match human summaries well. For improvement purposes, it is preferable to break this general
definition into detailed aspects, i.e. to define a “good match” from different view angles, so that
it will be clear to researchers which aspects of the algorithm contribute to its overall performance
after evaluation. Specifically, a good match implies four aspects:
1. As many frames in the reference summary are matched by the candidate summary as pos-
sible;
2. As few frames in the candidate summary do not match reference frames as possible;
3. As few frames in the candidate summary match the same reference frames as possible;
4. Each matched frame in the candidate is as similar to its corresponding target frame in the
reference summary as possible.
Generally, the former three aspects are necessary, and (4) is a reinforcement of (1). But these
are not always sufficient requirements. For specially designed video summarization algorithms,
additional aspects should be considered, e.g. for frame ranking algorithms such as [1], one would
also like to see whether the top ranked part of the output frame list is enjoyable and informative
enough (See Section 4.2.2).
Once we have reached an agreement on what makes a good summary, we can build our system
architecture and design measures to evaluate the given candidate summary from different aspects.
7
Intuitively, there are two possible schemes to evaluate a candidate summary against the reference
summaries from assessors. One is to compare the candidate summary with each of the reference
file and compute the performance measures for each comparison; As Fig. 3(a) shows, the final
performance is the average of the scores over all reference summaries. The other, as shown in
Fig. 3(b), synthesizes reference summaries first to get a final reference file, and then evaluates the
candidate summary against this synthesized reference.
(a) (b)
Figure 3: Two possible schemes (a) Compare the candidate with each of the N references and com-
pute the mean performance. (b) Synthesis the N references into one and compute the performance
against this synthesized one.
When the number of assessors is small, the advantage of synthesizing is not so significant.
However if there are many assessors, through synthesis a large amount of time can be saved by
avoiding repeating the same frame pair matching procedure. In addition, we argue that synthesis
can lead to a more reasonable and efficient performance evaluation (See Section 6.1) and we adopt
this scheme in SUPERSIEV.
8
interpreted as high recall , high precision, and low redundancy respectively.
It is evident that the comparison of two frame lists is based on the comparison of single frame
pairs, thus we need define some distance measure between frames first. Clearly, within shots,
two frames with exactly the same index can be safely matched, but in addition, two frames with
different indices could also be matched if their visual content is similar. For long static shots,
two frames could be several seconds apart in the video and still be good matches. Therefore,
we need to define a measure of dissimilarity between frames characterizing the distance between
visual content of the frames. In our experiment, we represent each frame as a thumbnail, by
applying a 2D Gaussian smoother to the original image, and line-scan all pixels to form a high
dimensional vector. Distance between frames is computed as the Euclidean distance between
these feature vectors, although the proposed performance evaluation scheme could accommodate
other distances as well.
In the following subsections, we describe how we adapted conventional performance measures
to our purpose of evaluating curve simplification video summarization algorithms.
4.2.1 Recall
In our settings, the goal of recall is to measure how many frames in the ground truth are matched
by the candidate summary. A reference frame is matched if there exists at least one frame in the
candidate summary that is similar to the reference frame according to some preset criterion.
Recall is computed as:
Recall = N ref m / N ref
where Nref is the total number of frames in a reference summary, and Nref m is the number of
reference frames being matched.
9
to reach 100%. Since Nref is the same for all candidates, precision can be interpreted as the time
cost of reaching a certain degree of recall.
Is this enough for comparing such algorithms? Let’s look at an example. As Fig. 4 illustrates,
the reference summary is composed of five frames (which are matched by Class A frames) and
the Case 1 candidate summary and Case 2 candidate summary both use eight frames to reach
100% recall, i.e. precision is the same for both cases (P = 5/8 = 0.625). Does this mean the
two have equal performance? For frame ranking video summarization algorithms, one would also
like to evaluate the ranking performance without obtaining ranked reference summaries. We can
measure the ranking quality in a more general way. As Fig. 4 shows, it is easy to tell that the
candidate summary in Case 1 is better than that in Case 2, because high relevance frames in the
first candidate correspond to reference frames.
Figure 4: Evaluation of video summarizers that rank frames by relevance. In Case 1, frames
of highest relevance correspond to reference frames. In Case 2, one has to go further down the
ranked list to retrieve all the reference frames.
Based on this observation, we see that the precision at the point when recall reaches a certain
degree in the first time is not enough to reflect the quality of the whole ranked list, and turn to
Average Precision (AP for short), which is the averaged value of the precisions at all iteratively
increased numbers of Class A frames.
!N ref m
AP(k) = k=1 p(k)I(k)/N ref m
th
where p(k) is the precision at the k frame in the candidate summary, I(k) is an indicator which
is 1 only when frame k causes the increasing of recall, i.e., is a Class A frame and brings a new
frame match.
The previous example shows that AP captures the requirement for the candidate summary of
Case 1 to receive a higher mark than that of Case 2. For Case 1, a new Class A frame is found
in iteration 1, 2, 3, 4 and 8, and its average precision is
AP 1 = (P1 + P2 + P3 + P4 + P5 )/5 = (1/1+2/2+3/3+4/4+5/8)/5 = 0.925
10
Figure 5: AP-frame index plot of Figure 6: CAP - frame index plot of the two
two ranked frame list for video Fly- candidate summaries in Fig. 5
ingCar.mpeg. Summary 1 uses 38
frames to reach 100% recall, and has
a much higher AP than Summary 2,
which uses 27 frames to reach 100%
recall.
where p(k) is the precision at the k th frame in the candidate summary. Fig. 6 shows the CAP for
the same two summaries.
11
Figure 7: 0-1 match for a single frame pair Figure 8: Probability match for a single
frame pair
where N can v is the number of frames in the candidate summary that can match at least one
frame in the reference summary. For frame ranking based summarization algorithms, N can v is no
more than N100 .
where d i is the distance between the i th matched reference frame and its corresponding candidate
frame; and V(d i ) is a map function with range [0,1].
So far, we have discussed the four aspects mentioned in the beginning of Section 4 and designed
performance measures for them. In the following sections, we describe the evaluation process and
the reference gathering and processing procedures.
5 Evaluation Process
After reference synthesis, we now can evaluate a candidate summary by single frame matching
test and sequence alignment.
12
5.1 Continuous Scoring Function
As mentioned in introducing the performance measure Average Closeness, the typical 0-1 classi-
fication enforces a discontinuity and will lead the left and right neighbor frames in the distance
dimension of the frame corresponding to threshold T to belong to different classes (see Fig. 7),
even though these two frames are very much alike and are probably neighboring frames along the
time axis. So the enforced discontinuity in assertion contradicts human judgment. Fig. 8, on the
other hand, effectively “softens” the classification procedure by applying a continuous matching
score function of image distance, thus we can compare two frame pairs in more details.
For choosing such a continuous function, its desirable properties should be first determined.
Obviously, it should be a function of image distance, and decreases as distance increases. But
a simple reciprocal function is not suitable because it is infinite for zero distance. The function
should have the following properties:
1. continuous;
These properties let one think of the exponential family. In SUPERSIEV, we define the mapping
2 2
function as e−dist(i,a) /λ , where dist(i,a) is a distance metric, i and a are indices of frames
whose matching degree is to be estimated. λ is an adjustable parameter, reflecting the subjective
strictness exerted in the scoring procedure. One way to choose λ is to define a reference matching
score for two randomly selected images and then to do a parameter estimation. In our experiments,
we set λ = 10. The value itself does not contain much intrinsic information; it is meaningful and
useful only in comparisons.
14
5.3 Offline Lookup Table
When the number of candidate summaries needed to be compared is large, there would be a lot of
repetitive frame pair matching tests, which degrades the system efficiency. Here we propose to use
an offline lookup table to accelerate the online evaluation procedure. Given a reference summary,
the matching target and matching score of each frame in the video sequence can be determined
shot by shot. Specifically, the lookup table of one shot can be computed in the following steps:
1. locate the potential targets of each frame i , which are its neighbors along the temporal
dimension in the reference of the same shot;
2. for each potential target of framei , compute its matching score using the continuous scoring
function mentioned above;
3. if frame i has two potential targets, the one that gets the larger matching score is chosen as
the final target of frame i , and this target’s frame index and the matching score are recorded
in the lookup table.
After the lookup table is obtained, the online matching procedure for a single frame in the
candidate summary only consists of retrieving the corresponding table entries by frame index.
Generally speaking, a summary should be concise and keep the main points of the content of
the original material. This is very straightforward, yet still too general to allow users to grasp the
appropriate abstraction level. So we give detailed guidelines as follows. A summary must be:
2. as neutral as possible, to let the viewers themselves discern the abstract meanings
For the first guideline to be well carried out, we suggest to our subjects that they summarize
only after viewing the whole video so that they can get an overall impression about the content.
We also suggest that after summarization they replay the summary in the user study interface (see
Fig. 10) to refine it, for example by deleting non-important and repetitive consecutive frames.
The second guideline is in consideration of the neutrality of automatic summarization algo-
rithms. People from different background may have different viewpoints about the same video, so
we ask subjects that they try to exclude their personal preferences as much as possible, suggesting
that they put themselves in a position of making a summary for general audiences. This is for
general summarization algorithms; if an algorithm is aimed at generating summaries for special
audiences, this should be adapted correspondingly.
16
6.3.1 Motivation
As we mentioned before, there are two ways to evaluate a candidate summary against a pool of
reference summaries from assessors, and we argue that synthesizing references can lead to a more
reasonable evaluation besides its time efficiency. Here we point out the shortcomings of the scheme
based on N comparisons and performance averaging, as shown in Fig. 3(a), to demonstrate the
advantage of reference synthesis.
One problem is illustrated in Fig. 11. In this case, reference S1 selects frame A, B, E for a
shot, and reference S2 selects A and E for this same shot. Suppose candidate summary Conly
selects D for the shot and D is within the distance threshold from A and B, but is not similar to
E. Then, when C is evaluated against S1 , D is matched to B due to the enforcement of time order
or probably due to its shorter distance to B than to A; while when C is evaluated against S2 , D is
matched to A. This seems not to affect the result of recall and AP or CAP, but will affect average
closeness. And the inconsistency of D matching different reference frames for different references
will render the performance averaging procedure unreasonable.
Figure 11: Inconsistence of matching target leads the averaged performance unreasonable
The second problem is with the rationality of the averaging procedure. By averaging perfor-
mance results over all individual comparisons against each reference, the final performance favors
a candidate that compromises to all references under the criterion of minimum summed distances.
In Fig. 12, the line is some kind of projection axis (e.g. a performance measure); the filled black
circles represent the reference summaries; the circled 2 represents their mean; and the circled 1
represents a local mean. When the circled 2 is viewed as the optimal, the candidate summary
(represented as a triangle) that is close to it is better than the one that is near to the circled 1.
However, it is impractical and not our aim to design an auto-summarization system that can cater
to all users; our goal is to generate a video summary that matches the opinion of the majority of
users. Thus for this purpose, it is clear the circled 1 is better than the circled 2 as the optimal.
Figure 12: A good summary under mean performance criterion may deviate far from the majority
of opinions
To deal with the above two problems, we adopt the second evaluation scheme (Fig. 3(b)), i.e.
17
use a synthesis of all the reference summaries that can reflect the mainstream opinion among all
assessors, and then evaluate the candidate summary against this synthesized summary.
18
Figure 13: Pseudocode for references synthesis
19
Figure 14: Histogram used to decide the summary length of a shot
• FlyingCar.mpeg, composed of 2 shots, is a sports news story full of quick action. The whole
length is 451 frames, and the length of the synthesized reference summary is 12 frames.
• WizzardDies.mpeg, composed of 17 shots, is extracted from the movie Lord of the Ring.
The whole length is 1024 frames, and the length of the synthesized reference summary is 19
frames.
20
Figure 15: Distance bisection method for 13 frames. The frames are represented by circles. The
number along each circle is the frame index. The beginning of the sequence is static, with slow
distance change, and the end is dynamic, with large inter-frame distances. Bisection takes place
along the vertical axis, and the order of choice of frames in the ranking process is given by the
numbers along the vertical axis. Frames of the dynamic part of the sequence tend to be ranked
higher than with a time-based bisection.
Our goal is to see whether Curve Simplification algorithm outperforms the other two
in summarization ability and make corresponding improvements, so we tested several different
types of feature vectors (See Section 3) and for the same feature vector construction scheme,
we tried various numbers of bin or bin combinations to find out which feature vector performs
best. To find out whether non-histogram based feature vector performs better, we also tried
thumbnail image vectors C8, BW8, C16, and BW16, corresponding to RGB image after 8*8 block
Gaussian smoothing, Gray image after 8*8 block Gaussian smoothing, RGB image after 16*16
block Gaussian smoothing, and Gray image after 16*16 block Gaussian smoothing. Thumbnail
image vector is shortened as tImg in the tables.
21
use this cutoff point to approximate curve-comparison. Table 1 to table 7 present the performance
of each candidate summary. AC, CAP and N80 represent average closeness, cumulated average
precision, and the number of frames it costs a candidate to reach 80% recall. To save space,
results of redundancy rate are not presented; one can have a coarse estimate of redundancy by
comparing Nref with N80 .The entries in bold and italic style correspond to the best performance
in the tables.
23
Table 5: Performance of various bin combinations used in sYUV
24
2. About assessors’ behaviors
(a) for videos with many shots, assessors tend to choose one frame for each shot
(b) very short shots are usually ignored
(c) usually the first and last frames of a video, especially the last one, are ignored
(d) for a shot with regular motion, such as the second shot of FlyingCar.mpg, key frames
selected are quite evenly distributed in temporal dimension.
25
8 Conclusions and Future Work
We have developed a performance evaluation system, SUPERSIEV, for video summarization algo-
rithms, which has three main characteristics: (1) automatic evaluation; (2) synthesizing reference
summaries obtained from user study into one that expresses the majority of assessors’ opinions;
(3) using four performance measures to quantitatively evaluate a candidate summary from dif-
ferent view angles, which can help give a clear view about the behavior of the summarization
algorithm and shed light on algorithm improvements. We also proposed to use an offline lookup
table to accelerate the online performance computation.
To illustrate the operation of our system and how it is used to help analyze and compare algo-
rithms, we evaluated three frame ranking based summarization algorithms. We have performed a
large amount of experiments to help selecting appropriate feature vectors for curve simplification
algorithm.
For the current evaluation system itself, there is still a lot of room for improvement, such as:
• Image content distance computed by means of feature vectors may sometimes contradict
human’s intuition, thus leading to wrong matching relationships. This is inevitable. Im-
age distances that better model human cognition can be introduced in the framework of
SUPERSIEV.
• Since the current evaluation scheme is shot based, it will treat two coherent shots, such
as shots from the same scene, the same as incoherent shots. And because assessors are
not required to choose at least one frame per shot, the final reference summary is not a
shot based one. One way to deal with this inconsistency is to make sure that the final
reference has at least one frame for each shot. High level video segmentation techniques,
e.g., scene clustering, can help lessen this shot boundary dependency, but are more difficult
to implement.
• It is worth finding a way to align different videos (shot number, video length, genre, etc.),
so that we could give a more global mark on the quality of an algorithm by averaging its
performance over different videos.
• For frame ranking algorithms, the current system still falls short of proposing an effective
means to evaluate the correctness of individual rankings of frames, because assessors are not
required to rank frames (this would be too much work), the same problem as in TREC-type
evaluation systems.
In addition, our current system is based on low level features, so it inherits all the drawbacks
of video applications using low level features. Finally, algorithm complexity and robustness are
not evaluated, since our present focus is only on the summarization process.
References
26
[2] D. DeMenthon, L. Latecki, A.Rosenfeld and M. Vuilleumier Stckelberg, “Relevance Ranking
of Video Data using Hidden Markov Model Distances and Polygon Simplification”, VISUAL
2000, pp. 49–61.
[3] A. Divakaran, K.A. Peker, R. Radhakrishnan, Z. Xiong and R. Cabasson, “Video Summariza-
tion using MPEG-7 Motion Activity and Audio Descriptors”, Video Mining, A. Rosenfeld,
D. Doermann et D. DeMenthon, eds., Kluwer Academic Press, 2003.
[4] M.S. Drew and J. Au, “Video Keyframe Production by Efficient Clustering of Compressed
Chromaticity Signatures”, ACM Multimedia 2000, Juan-Les-Pins, France, pp. 365–368,
November 2000.
[5] Y. Gong and X. Liu, “Video Summarization using Singular Value Decomposition”, CVPR
2000, pp. 2174–2180.
[6] A. Hanjalic and H. Zhang, “An Integrated Scheme for Automated Video Abstraction based
on Unsupervised Cluster-Validity Analysis”, IEEE Transactions on Circuits and Systems for
Video Technology, Vol.9, No.8, pp.1280–1289, 1999
[8] Y. Li, T. Zhang and D. Tretter, “An Overview of Video Abstraction Techniques”, HP Lab-
oratory Technical Report, HPL-2001-191, July 2001.
[9] S.B. Needleman and C.D. Wunsch, “A general method applicable to the search for similarities
in the amino acid sequence of two proteins”, J. Molecular Biology, vol. 48, no. pp. 443-453,
1970.
[10] K. Papineni, S.Roukos, T. Ward, W-J. Zhu, “Bleu: a Method for Automatic Evaluation of
Machine Translation”, ACL 2002, pp. 311–318.
[12] M. Slaney, D. Ponceleon and J. Kaufman, “Understanding the Semantics of Media”, Video
Mining, A. Rosenfeld, D. Doermann et D. DeMenthon, eds., Kluwer Academic Press, 2003.
[13] H. Sundaram, L. Xie, S-F. Chang,“A utility framework for the automatic generation of audio-
visual skims”, ACM Multimedia 2002, pp.189–198.
[14] S. Uchihashi, J. Foote, A. Girgensohn and J. Boreczky, “Video Manga: Generating Seman-
tically Meaningful Video Summaries”, ACM Multimedia 99, Orlando, vol. 1, pp. 383–392,
1999.
[15] J. Vermaak, P. Perez, M. Gangnet and A. Blake, “Rapid Summarization and Browsing of
Video Sequences”, BMVC 2002, pp. 424–433.
[16] E. M. Voorhees, “Variations in relevance judgments and the measurement of retrieval effec-
tiveness”, Information Processing and Management, vol. 36, no. 5, pp. 697–716, 2000.
27
[17] H. Wactlar, M. Christel, Y. Gong and A. Hauptman,“Lessons Learned from Building a
Terabyte Digital Library”, IEEE Computer, Vol. 32, No. 2, pp. 66–73, 1999.
[18] K. Yoon, D. DeMenthon and D. Doermann, “Event detection from MPEG video in the
compressed domain, ICPR 2000, Vol. 1 , pp. 819–822.
[19] Yu-Fei Ma, Lie Lu, Hong-jiang Zhang and Mingjing Li, “A User Attention Model for Video
Summarization”, ACM Multimedia 2002, Juan-les-Pins, France,December, 2002
Appendix
A Assumptions about Assessors Behavior
The reference synthesis procedure is based on two assumptions about assessors’ behaviors:
1) Each assessor has his/her own stable semantic gap in consecutive summary frames; which
can be viewed as a constant in one video. — That is, the reference summary from one assessor
will not give viewers a feeling that the change between the content of some neighboring frames
jumps sharply while in other parts the content goes on continuously and smoothly. Under this
assumption, we can define the semantic gap that most assessors agree upon as the main stream
gap and normalize each assessor’s summary by inserting frames for a sparse style or combining
several frames into one for a dense style. We then can cluster all normalized reference summary
frames into several groups with the gap between centers of neighboring groups along the time
dimension close to the main stream semantic gap. The resultant summary, composed of group
centers, can be viewed as consistent with the majority opinion of assessors.
For example, in Fig. 17, suppose the lines represent the semantics axis (equal intervals means
equal semantic gaps). The squares and circles on the first two lines correspond to reference frames
from two assessors (S1 and S2 ) respectively. And the filled circles on the third line correspond
to the main stream summary of all assessors, which is 4 frames in length. Given a main stream
summary, we can normalize those summaries (S1 and S2 in this case) whose style (semantic gap)
is so different from average, based on the above assumption: combining every 4 frames into one
forS1, and insert two frames for S2 (results are marked as filled triangles).
The problem is that what we discuss here is based on an ideal semantics axis, whose implemen-
tation involves semantic level content representation and distance computation, but at present
there is no mature technology to deal with this. So, we alternatively try to figure out the summary
28
length (number of frames) that most assessors agree upon as the expected number of clusters and
use an iterative procedure to model the main stream summary in low level image feature space.
2) Assessors have some common judgment about the current video. That is, the following
situation is defined as weird and is not assumed to occur in real cases. In Fig. 18, S1 thinks the
former part of the video is very important and reflects the whole content, while the rest of the
video clip is insignificant. And S2 and S3 have very different feelings. If we allow this to happen,
the main stream summary obtained by using the above method is in fact an incorrect model of
assessors’ summaries. It is a simple add-up of the three summaries and none of the assessors
would agree it is representative of his/her opinion.
Figure 18: When no common opinion about the video content can be derived, no main stream
summary can be extracted. The so called main stream summary in this case in fact does not
consist with any assessors’ summaries.
We assume that video producers (movie makers) will not waste films on producing a large
amount of trivial frames, i.e. each frame does contribute to the whole content. So the above
situation means the assessors have evidently forced their personal interests on their summaries
and the completeness of the video content is not preserved, which are against our guidelines.
Note there does exist situations such as Fig. 19 shows, in which most assessors select one key
frame. This implies that the information amount of the segment is low, and assessors think that
one frame is enough for abstraction. This is typical for a stationary video segment. For whom
that chooses two frames (S2 ), we think he/she does not follow our suggestion; this is a minority
case which will not impede the main stream summary extraction.
Though both figures show large difference between assessors, the main difference between Fig.
18 and Fig. 19 is that in the former case, assessors choose more than one frame at very close
positions, that is, they make dense sampling locally, which cannot be regarded as a random
selection. And these local dense clusters are far away from each other, implying large cognitive
divergence between assessors. While in the latter case, the randomly distributed key frames are a
29
good sign that the information distributes relatively even in the segment and assessors agree that
the segment can be summarized by one frame. This, of course, is a kind of cognitive convergence.
When no mainstream opinion of assessors can be extracted, evaluation cannot be conducted.
So this second assumption of consistent common judgment is to provide a condition such that
reasonable mainstream reference does exist.
Table 8 to Table 14 show the performance at 100% recall. The column N 100 is the number of
frames it costs the candidate summary to reach 100% recall.
30
Table 10: Performance of various bin combinations used in 3DYUV
31
Table 13: Performance of thumbnail image vector
32