An Experimental Comparative Study On Slide Change Detection in Lecture Videos

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Int. j. inf. tecnol.

https://doi.org/10.1007/s41870-018-0210-4

ORIGINAL RESEARCH

An experimental comparative study on slide change detection


in lecture videos
Purushotham Eruvaram1 • Kasarapu Ramani2 • C. Shoba Bindu3

Received: 20 September 2017 / Accepted: 30 June 2018


 Bharati Vidyapeeth’s Institute of Computer Applications and Management 2018

Abstract In today’s world e-learning is one of the popular Keywords Video lecture  Key frame  Color Histogram 
modes of learning and video lectures are more prominent in Black Pixel Distribution Difference  Precision  Recall 
keeping learners engaged with course. Internet enabled to Slide change detection
keep a large number of video lectures on-line. To search
for a required topic or subtopic from this huge video
repository is becoming very tedious. One way to search for 1 Introduction
a particular topic is through keyword based search and it is
based on extraction of text content available in lecture E-learning has changed the way the knowledge acquired, It
video files and to achieve it one has to maintain metadata. is providing the knowledge and education to the places and
To maintain the metadata associated with video the frames situations which was never imagined [1]. Lecture videos
of video containing text are required to be processed. As are more popular among the e-learning resources as can be
the video contains multiple frames per second it is not viewed, paused and replayed if required at the convenient
required to consider each and every frame. So, the frames of users. Now-a-days, the universities and education
containing distinct content called key frames need to be organizations producing large collection of lecture videos
identified. The identification of key frames plays crucial and publishing it on-line for users to view irrespective of
role in the lecture video searching process. In this paper, time and their location. This resulted in huge increase of
different techniques for key frame identification are lecture video in web. For the user it is very difficult to find
experimentally tested and results were compared. required video without any search function within a video
repository. Even after finding the video it is hard for the
user to decide whether video is useful just by looking at the
title and other metadata tagged to the video which is very
brief. Sometimes, the required information may be covered
only for few minutes in the whole video, so the user wants
& Purushotham Eruvaram to view that particular portion of the video only without
e.purushotham@gmail.com
watching the complete video.
Kasarapu Ramani To make the search for specific information is possible
ramanidileep@yahoo.com
with metadata tagged in the lecture videos, which usually
C. Shoba Bindu prepared manually, is not sufficient. So, a system is
shobabindhu@gmail.com
required which can look into the content of video, extract
1
Department of CSE, JNTUK, Kakinada, information and prepare metadata for the video file to
Andhra Pradesh 533003, India provide the accurate results to the user.
2
Department of IT, Sree Vidyanikethan Engineering College, Usually, lecture videos are of different types, entire
A.Rangampet, Tirupati, Andhra Pradesh 517102, India frame shows only lecture slides or with two parts in a
3
Department of CSE, JNTUA College of Engineering, single frame one part for the speaker and the other for
Ananthapuramu, Andhra Pradesh 515002, India slides or both speaker and slides appear in the same screen.

123
Int. j. inf. tecnol.

The lecture videos have two types of information sour- at title region the CC differencing with small threshold is
ces, the audio or from the text in video. To extract the considered, the same was applied for first and last bounding
information available in lecture videos different methods box objects in content region and slide change is identified.
are available, for audio information retrieving Automatic Jeong et al. [14] presented an automatic slide transition
Speech Recognition (ASR) and for text information Opti- detection method for lecture videos. In this method, the
cal Character Recognition (OCR) can be used. ASR tech- slide region is identified and Shift Invariant Feature
nique extracts the speech or voice from the audio and Transform (SIFT) algorithm is used on slide region to
converts it into textual information. Information retrieved extract its features. Slide change detection time was
using ASR depends upon the speaker’s fluency in speech. reduced by using recursive interval pruning algorithm,
There are different lecture video indexing and searching threshold for slide change identification is estimated based
methods which makes use of text extracted from the lecture on mean and standard deviation of sample frame similari-
video [2–6], audio extracted [7, 8] and both text and audio ties, back word slide transitions are also considered and
[9–12]. detected in this method.
The OCR recognizes the text in the images and produces Ma et al. [4] proposed a machine learning based
the editable text. A video can be converted into sequence of approach for automatic lecture video segmentation and
images (frames) and text can be extracted from frame. As indexing, in this approach five features (Color Histogram
the video consists of many frames per second, many of the Difference, Adjacent Frame Difference, Black Pixel
frames extracted will have similar content. For example, a Distribution Difference, size of skin area and features from
video of 60 min duration may consists of about 40–50 Gabor filters with different orientation) are extracted and
slides, but the number of frames present in a video may be Decision Tree Classifier is used to label the video frames.
in the order of thousands. It is not required to process all Wang and Kankanhalli [15] proposed a method for
the frames to extract the text. So, the identification of aligning the presentation videos with slides, for this SIFT
distinct slide (frames) is crucial, these slides are referred as key points and color features are combined. The video
key frames. segmentation method first calculated the 64 bin gray level
In this paper, different approaches for key frame iden- histogram for the adjacent frames apart by 1 s and Chi
tification are tested on different lecture videos and the square distance is calculated and slide transition is detected
results were compared by considering precision, recall as when the value crosses the threshold. Texture features are
performance measures. used to handle defocused slide regions.
De Lucia et al. [16] presented a method where RGB
average image of frame is computed and the sum pixel to
2 Literature survey pixel difference of adjacent RGB averaged frames are used
to detect the slide changes. The detection accuracy can be
Automatic Video indexing and retrieval is an active area of improved with user interaction and a false slide change due
research and has wide range of applications [13], various to camera motion is prevented.
approaches were developed to perform this. Usually videos Masneri and Schreer [5] proposed a method in which the
are segmented into shots first, this segmentation process video is classified as one among the four classes: Talk,
have three steps: feature extraction, computing similarity Presentation, Black Board, Mix. Presentation part of the
and detecting shot boundary [13]. video were further examined to identify slide transitions. In
Lecture videos are different from normal videos where it this Color Histogram Difference metric was used and when
displays a sequence of slides, where a slide is displayed for the value crosses the detection threshold the slide change is
some time duration. So, in this case shot becomes a series detected.
of frames that display same slide. Slide change detection is Adcock et al. [6] developed a lecture webcast search
critical task for lecture video indexing; various methods are engine where frame difference based analysis was used for
in use for detecting the slide transition. slide changes within the lecture video. The Global Pixel
Yang et al. [10] developed a lecture video retrieval Difference between frames at 3 s apart were computed
system where adjacent frames at 3 s apart are compared to when 1% of the pixels in the frame exceeds the threshold
identify the slide change detection. For each selected pair then key frames is extracted.
of frames canny edge maps were created and pixel level Haubold and Kender [12] performed augmented seg-
differential image was created, connected component mentation of presentation videos, in which histogram
analysis is performed in the image and connected compo- changes are computed between frames and long term
nents (CC) count is used as segmentation threshold. Fur- changes are identified by comparing degree of change over
ther, the title and content region is divided at 23 and 70% time. For shot boundary detection 2–4 s windows are
of frames height, respectively. To identify the slide change compared and when difference between them changes

123
Int. j. inf. tecnol.

significantly with the mean of both the windows then shot dh=10e
X
boundary is considered. Py ¼ jbyj  b0yj j ð3Þ
i¼0

h ðiþ1Þ10
X X
3 Slide change detection methods bx i ¼ ð1  Iðx; yÞÞ
y¼0 x¼i10 ð4Þ
Slide change detection in the videos is crucial to make the where i 2 0; 1; 2. . . dwidth/10e
content based video search possible. For this the video need
h ðjþ1Þ10
X X
to be processed by dividing it into frames, the methods of
by j ¼ ð1  Iðx; yÞÞ
slide change detection are compared in this paper. y¼0 x¼j10 ð5Þ
where j 2 0; 1; 2. . . dheight/10e:
3.1 Color Histogram based algorithm
In the above equations Px and Py denotes the pixel
Slide change in lecture video is recognized by computing distribution difference along X and Y axis. bxi and byj are
the difference between two adjacent frames which are values in different bins. I(x,y) is the binarized image pixel
separated by different time intervals using intensity.
 
XX X 

When pixel distribution difference of two frames along
iþ1 i
Di ¼  hcb ðx; yÞ  hcb ðx; yÞ ð1Þ X axis or Y axis exceeds the threshold then a slide tran-

c;b x;y x;y

sition is detected.
where hcb ðx; yÞ ¼
 Algorithm:
0 if 32  b  Ic ðx; yÞ  32  ðb þ 1Þ
1 otherwise. 1. Select a lecture video file as input.
In Eq. (1) ‘c’ represents color {R,G,B}, ‘b’ specify the 2. Extract the frames at different uniform time intervals.
number of bins in the histogram, x and y indicates image 3. Convert the frame into binary image
4. Compute the horizontal and vertical pixel distribution
pixel coordinates. Di is the difference between frames i and
difference using Eqns. (2), (3), (4) and (5).
i ? 1 and Ic (x,y) indicates the intensity value of pixel. 5. If the Horizontal or Vertical Pixel Distribution Difference ≥
When the difference between two frames (Di) is greater Slide Change Threshold
than a threshold value then a slide transition is detected. Then, Slide Change is detected.
6. Repeat the above steps till the end of the video.
Algorithm:
1. Select a lecture video file as input.
2. Extract the frames at different uniform time intervals. 3.3 Connected component based differencing
3. Compute the color histogram difference between every pair
of adjacent frames extracted using Eqn. (1). The lecture slides consist of lines of text, diagrams and
4. If the Histogram Difference ≥ Slide Change Threshold
tables etc., which are actually the connected components.
Then, Slide change is detected
5. Repeat the above steps till the end of the video. In this method first the canny edge maps for two consec-
utive frames are generated and pixel difference image from
these edge maps are computed. Connected component
analysis is performed on the differential image, the number
3.2 Dark pixel distribution difference
of connected components is used as measure to identify the
slide change, when this value crosses a threshold (20) then
Usually, the lecture slide consists of text, each slide will a slide transition is identified.
have different number of lines and every line’s character
count is different. This distribution can also be used to
identify the change in slides. For this, the image is bina-
rized first.
The pixel distribution difference is given by
dw=10e
X
Px ¼ jbxi  b0xi j ð2Þ
i¼0

123
Int. j. inf. tecnol.

compared with adaptive threshold computed using Eqs. (8)


Algorithm:
and (9):
1. Select a lecture video file as input.  
1
2. Extract the frames at different uniform time intervals. s¼M 1 : ð8Þ
3. Compute the canny edge maps for the frames MAD
4. Compute the pixel difference image from edge maps.
5. Compute the number of connected components from edge
MAD is defined by
maps. 1X a
6. If the Connected Components Difference ≥ Slide Change MAD ¼ jMi  Mj ð9Þ
3 i¼1
Threshold
Then, Slide Change is detected where M is an average value of Mis.
7. Repeat the above steps till the end of the video.
The slide transition is detected between St and St?k when
M1\s, M2[s and M3\s, between St?k and St?2k when
M1[s, M2\s and M3\s, between both (St and St?k) and
3.4 Global pixel difference (St?k and St?2k) when M1\s, M2\s and M3[s (or) M1\s,
M2\s and M3\s.
In this method the frames are extracted from lecture video
at uniform time intervals and are binarized, then the pixel Algorithm:
level differencing is computed by image subtraction
1. Select a lecture video file as input.
method. The non zero pixel count of the resulting image is 2. Extract the frames at different uniform time intervals.
normalized with the size of image: 3. Compute SIFT features for the frame.
w X
X h 4. Find the matching features between Fi & Fi+1 : M1 , Fi+1 &
Di ¼ jbi ðx; yÞ  bi1 ðx; yÞj ð6Þ Fi+2 : M2 and Fi & Fi+2: M3
x¼0 y¼0 5. Compute the adaptive threshold using equation 8.
6 .If the (M1 < and M2 > and M3 < )
Di slide change between Fi and Fi+1
Pi ¼ ð7Þ
wh else if( M1 > and M2 < and M3 < )
slide change between Fi+1 and Fi+2
in the above equation bi(x,y) represents intensity value of else if (M1 < , M2 < and M3 > ) or
binarized image at the pixel (x,y) and w 9 h is the image (M1 < , M2 < and M3 < )
size. When Pi value exceeds a transition threshold (0.03) Slide Change between Fi & Fi+1 and Fi+1 & Fi+2
then a slide transition is detected. 7. Repeat the above steps till the end of the video.

Algorithm:
1. Select a lecture video file as input.
2. Extract the frames at different uniform time intervals. 4 Experimental results
3. Convert the frame in to binary image
4. Compute the pixel level difference using Eqn.(6). Two types lecture videos are chosen as input for experi-
5. Compute Non-zero Pixel Count and Normalize with size of mentation, type-1: video consisting of only slides and type-
image to get the Pi using Eqn. (7).
6. If the Pi ≥ Slide Change Threshold
2: video consisting of slide view and presenter view, Fig. 1
Then, Slide Change is detected shows the example of each type of video. The video lec-
7. Repeat the above steps till the end of the video. tures related to computer science domain are selected for
this process.
Frames are extracted from videos at different time
3.5 Using SIFT and adaptive threshold intervals (1, 2, 3 and 4 s). The algorithms—Color His-
togram Difference, Global Pixel Difference, Connected
In this method to find the similarity between two slides the Component based differencing, SIFT with adaptive
SIFT features are used, the SIFT features are extracted for threshold and Dark Pixel Distribution Difference are
frames at uniform intervals. Let M(St, St?k) denote the applied to frames obtained from input videos.
matching features between the slides St in frame Ft at time t To measure the slide change detection accuracy, the
and St?k in frame Ft?k at time t ? k. Slide transition is slide changes in all the videos are identified manually.
detected using three slide to slide similarities M1:M(St,- These slide changes identified are compared with the slide
St?k), M2:M(St?k,St?2k) and M3:M(St,St?2k) these Mi are changes detected with the methods in this paper. The
precision and recall used to evaluate the performance are
defined as follows [4]:

123
Int. j. inf. tecnol.

Fig. 1 Example of two types of lecture videos

Table 1 Showing the bar


colors and corresponding slide S.No. Color Method
change detection methods in the
graphs from Figs. 2, 3, 4, 5, 6, 7, 1 Color Histogram based
8 and 9
2 Global Pixel Difference

3 Connected Component based Differencing


Using SIFT and Adaptive Threshold
4
methods
5 Dark Pixel Distribution Difference

Number of correctly recognized slide changes From graphs in Figs. 4 and 5 with time intervals 3 and
Recall ¼
Number of actual slide changes 4 s, respectively it was observed that recall values were
slightly less than the values in Figs. 2 and 3 with time
Number of correctly recognized slide changes
Precission ¼ : interval 1 and 2 s, respectively. The unidentified slide
Number of slide changes recognized
transition in earlier case (time intervals of 3 and 4 s), which
play for one or two second duration was actually slide
The Average Recall and Precision for videos of two
transitions that occurred accidentally while presenter nav-
types are calculated for performance comparison.
igate the slide from one to other in most of the cases so
The graphs from Figs. 2, 3, 4 and 5 show the performance
these can be neglected.
of the slide change detection methods (Table 1) with recall as
The graphs from Figs. 6, 7, 8 and 9 show the perfor-
parameter (with frame extraction time intervals of 1, 2, 3 and
mance of the slide change detection methods (Table 1) with
4 s, respectively). It is observed from the recall graphs that
precision as parameters (with frame extraction time inter-
Dark Pixel Difference method and Color Histogram based
vals of 1, 2, 3 and 4 s, respectively). The precision graphs
method has given a high recall values for both type-1 and
exhibited that Global Pixel Difference, Connected Com-
type-2 videos. Connected components and SIFT based
ponent based and Dark Pixel Distribution Difference
methods has given same recall values for type-1 videos. For
methods precision values were higher for type-1 videos and
type-2 videos SIFT based method values is slightly greater
only Global Pixel Difference and Connected Component
than the Connected Component based method.
based methods recall values are higher for the type-2 videos.
Though the Global Pixel Difference methods recall
Global Pixel Difference and Connected Component
value was lower than the recall values of other methods for
methods values were dominating the values of other
both type-1 and type-2 videos, the results was not poor. As
methods for both type-1 and type-2 videos. Precision val-
the time interval increased the recall values slightly
ues of Color Histogram, SIFT and Dark Pixel Distribution
reduced for all methods and also for both types of videos.

123
Int. j. inf. tecnol.

Fig. 2 Performance
comparison of different
methods on two types videos
with recall where frames are
extracted with time interval of
1s

Fig. 3 Performance
comparison of different
methods on two types videos
with recall where frames are
extracted with time interval of
2s

Fig. 4 Performance
comparison of different
methods on two types videos
with recall where frames are
extracted with time interval of
3s

Fig. 5 Performance
comparison of different
methods on two types videos
with recall where frames are
extracted with time interval of
4s

Difference methods for type-2 videos were very low more than the values in Figs. 6 and 7 with time interval 1
compared to other methods. and 2 s, respectively.
As the time interval increased the precision values were The experimental results show that:
increased for all the methods and also for the two types of
• The Dark Pixel Distribution Difference can identify
videos.
most of the slide changes accurately for all the time
From graphs in Figs. 8 and 9 with time intervals 3 and
intervals considered between frames of video belonging
4 s, respectively, it was clear that precision values were
to type-1 videos, however, for type-2 videos number of

123
Int. j. inf. tecnol.

Fig. 6 Performance
comparison of different
methods on two types videos
with precision where frames are
extracted with time interval of
1s

Fig. 7 Performance
comparison of different
methods on two types videos
with precision where frames are
extracted with time interval of
2s

Fig. 8 Performance
comparison of different
methods on two types videos
with precision where frames are
extracted with time interval of
3s

Fig. 9 Performance
comparison of different
methods on two types videos
with precision where frames are
extracted with time interval of
4s

false slide transitions detected are high compared to • It is observed that the time interval between frames
other methods. selected for transition detection influences the slide
• When both type-1 and type-2 videos were considered in change detection performance.
evaluating Slide Changes Detection performance, the • The time interval between frames can be selected as 3
Global Pixel Difference and Connected Component or 4 s because the missed slides of duration 1 or 2 s can
based methods were good in both Recall and Precision be neglected.
values.

123
Int. j. inf. tecnol.

• By choosing the interval as 3 or 4 s the complexity can 2. Daga BS, Thakare VM (2017) Semantic enriched lecture video
be reduced as the number of frames used for the slide retrieval system using feature mixture and hybrid classification.
Adv Image Video Process 5(3):01
change detection is greatly reduced. But, with the time 3. Li K et al (2015) Structuring lecture videos by automatic pro-
interval as 3 or 4 s the exact slide transition time may jection screen localization and analysis. IEEE Trans Pattern Anal
not be detected. Mach Intell 37(6):1233–1246
4. Ma D, Xie B, Agam G (2014) A machine learning based lecture
video segmentation and indexing algorithm. In: Document
recognition and retrieval XXI, vol 9021. International Society for
Optics and Photonics, p 90210V
5 Conclusions 5. Masneri S, Schreer O (2014) SVM-based video segmentation and
annotation of lectures and conferences. In: Computer vision
Identifying key frame is crucial for content based Lecture theory and applications (VISAPP), 2014 international conference
video indexing and search. In this paper slide changes in a on, vol 2. IEEE
6. Adcock J et al (2010) Talkminer: a lecture webcast search engine.
video are detected using different methods. The experi-
In: Proceedings of the 18th ACM international conference on
mental results show that Global Pixel Difference and multimedia. ACM
Connected Component based methods are good in both 7. Sandesh BJ et al (2017) Lecture video indexing and retrieval
Recall and Precision values compared to all other methods using topic keywords. World Acad Sci Eng Technol Int J Comput
Electr Autom Control Inf Eng 11(9):1007–1011
mentioned in this paper for both type-1 and type-2 lecture
8. Repp S, Grob A, Meinel C (2008) Browsing within lecture videos
videos for different segmentation intervals. Choosing either based on the chain index of speech transcription. IEEE Trans
Connected Component based or Global Pixel Difference Learn Technol 1(3):145–156
methods with a time interval of 4 s is recommended for 9. Radha N (2016) Video retrieval using speech and text in video.
In: Inventive computation technologies (ICICT), international
slide transition detection.
conference on, vol 2. IEEE
In future work the other types of videos with presenter 10. Yang H, Meinel C (2014) Content based lecture video retrieval
interference to slide region can also be consider in evalu- using speech and video text information. IEEE Trans Learn
ating the performance of methods in the paper. A key Technol 7(2):142–154
11. Kamabathula VK, Iyer S (2011) Automated tagging to enable
frame detection method can be identified which suits for
fine-grained browsing of lecture videos. In: Technology for
different types lecture videos and a lecture video indexing education (T4E), 2011 IEEE international conference on. IEEE
and search system can be developed. 12. Haubold A, Kender JR (2005) Augmented segmentation and
visualization for presentation videos. In: Proceedings of the 13th
Acknowledgements This work is supported by University Grants annual ACM international conference on multimedia. ACM
Commission (UGC) under Minor Research Project titled ‘‘Fast Con- 13. Hu W et al (2011) A survey on visual content-based video
tent Based Search, Navigation and Retrieval system for E-Learning’’. indexing and retrieval. IEEE Trans Syst Man Cybern Part C
Project Id: F.No:4-4/2015(MRP/UGC-SERO). (Appl Rev) 41(6):797–819
14. Jeong HJ et al (2015) Automatic detection of slide transitions in
lecture videos. Multimed Tools Appl 74(18):7537–7554
15. Wang X, Kankanhalli M (2009) Robust alignment of presentation
References videos with slides. In: Proceedings of the 10th pacific rim con-
ference on multimedia, PCM’09. Springer, Berlin, pp 311–322
1. Yamin M, Aljehani SA (2016) E-learning and women in Saudi 16. De Lucia A et al (2008) Migrating legacy video lectures to
Arabia: an empirical study. BVICAM’s Int J Inf Technol multimedia learning objects. Softw Pract Exp 38(14):1499
8(1):950–954

123

You might also like