An Overview of Audio Event Detection Methods From Feature Extraction To Classification
An Overview of Audio Event Detection Methods From Feature Extraction To Classification
An Overview of Audio Event Detection Methods From Feature Extraction To Classification
I.
INTRODUCTION
Audio Event Detection (AED) aims to detect different types of audio signals
such as speech and non-speech within a long and unstructured stream of audio.
AED can be considered as a new area of research with the ambitious goal of
replacing the intelligent surveillance systems (ISS) with traditional surveillance
systems (TSS). The traditional systems, requires the regions of interest (ROI)
equipped with cameras, microphones, or other kinds of sensors to be constantly
monitored by human operators and to record audio data to multimedia dataset. A
multimedia dataset often consists of millions of audio clips, which includes
environmental, speech and music sounds with other non-speech utterances to use
in AED. The basis for most of AED related research fields and applications are
feature extraction and audio classification. These appear to be significant tasks in
many approaches that were carried out in numerous areas and environments. They
include the detection of abnormal event (gunshot) in security [1], speech
recognition [2-4], speaker recognition [5, 6], animal vocalization [7-10], home care
application [11], medical diagnostic problems [12], bioacoustics monitoring [8, 9],
sport events [13-15], faults and failure detection in complex industrial systems [16]
and several others. The performance of an AED system, such as its complication,
accuracy of classification and false alarm relies extremely on the extraction of the
audio features and the classifiers [17, 18].
Feature extraction is one of the most significant factors in audio signal
processing [19]. Audio signal comprises many features of which not all are essential
for audio processing. A set of extracted features from the input audio signal were
employed by all classification systems where each features represents an element
of the vector in the feature space. Therefore, many different audio classification