Audio Watermark Using Matlab
Audio Watermark Using Matlab
Audio Watermark Using Matlab
Abdulla
Audio
Watermark
A Comprehensive Foundation Using
MATLAB
Audio Watermark
Yiqing Lin • Waleed H. Abdulla
Audio Watermark
A Comprehensive Foundation
Using MATLAB
123
Yiqing Lin Waleed H. Abdulla
The University of Auckland The University of Auckland
Auckland, New Zealand Auckland, New Zealand
v
vi Preface
to see how the algorithms work. This book is introduced to establish a shortcut to get
into this interesting field with minimal efforts. The commonly known techniques are
well explained and supplemented with MATLAB codes to get a clear idea about how
each technique performs. In addition, the reader can reproduce the functional figures
of the book with provided MATLAB scripts written specifically for this purpose.
From the robustness and security perspectives, the commonly used audio water-
marking techniques have limitations on the resistance to various attacks (especially
desynchronization attacks) and/or security against unauthorized detection. Thus,
in this book we develop new robust and secure audio watermark algorithm; it is
well explained and implemented in MATLAB environment. This algorithm can
embed unperceivable, robust, blind, and secure watermarks into digital audio files
for the purpose of copyrights protection. In the developed algorithm, additional
requirements such as data payload and computational complexity are also taken
into account and detailed.
Apart from the improvement of audio watermarking algorithms, another land-
mark of this book is the exploration of benchmarking approaches to evaluate
different algorithms in a fair and objective manner. For the application in copyrights
protection, audio watermarking schemes are mainly evaluated in terms of imper-
ceptibility, robustness, and security. In particular, the extent of imperceptibility is
graded by perceptual quality assessment, which mostly involves a laborious process
of subjective judgment. To facilitate the implementation of automatic perceptual
measurement, we explore a new method for reliably predicting the perceptual
quality of the watermarked audio signals. A comprehensive evaluation technique
is illustrated to let the readers know how to pinpoint the strengths and weaknesses
of each technique. The evaluation techniques are supported with tested MATLAB
codes.
Furthermore to what we have just stated that this book extensively illustrates
several commonly used audio watermarking algorithms for copyrights protection
along with the improvement of benchmarking approaches, we may pinpoint the
following new contributions of the current book:
• We introduce a spread spectrum based audio watermarking algorithm for copy-
rights protection, which involves Psychoacoustic Model 1, multiple scrambling,
adaptive synchronization, frequency alignment, and coded-image watermark.
In comparison with other existing audio watermarking schemes [1–10], the
proposed scheme achieves a better compromise between imperceptibility, robust-
ness, and data payload.
• We design a performance evaluation which consists of perceptual quality assess-
ment, robustness test, security analysis, estimations of data payload, and com-
putational complexity. The presented performance evaluation can serve as one
comprehensive benchmarking of audio watermarking algorithms.
• We portray objective quality measures adopted in speech processing for per-
ceptual quality evaluation of audio watermarking. Compared to traditional
perception modelling, objective quality measures provide a faster and more
Preface vii
the SNR value). Then, the basic robustness test and the advanced robustness test
(including a test with StirMark for Audio, a test under collusion, and a test under
multiple watermarking) are carried out. In addition, a security analysis is followed
by estimations of data payload and computational complexity. At the end of this
chapter, a comparison between the proposed scheme and other reported systems is
also presented.
Chapter 6 presents an investigation of objective quality measures for perceptual
quality evaluation in the context of different audio watermarking techniques. The
definitions of selected objective quality measures are described. In the experiments,
two types of Pearson correlation analysis are conducted to evaluate the performance
of these measures for predicting the perceptual quality of the watermarked audio
signals.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Hiding: Steganography and Watermarking . . . . . . . . . . . . . . 1
1.2 Overview of Digital Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Framework of the Digital Watermarking System . . . . . . . . . . . . . 4
1.2.2 Classifications of Digital Watermarking . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Applications of Digital Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3.1 Copyrights Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3.2 Content Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3.3 Broadcast Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3.4 Copy Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Audio Watermarking for Copyrights Protection . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Requirements for the Audio Watermarking System . . . . . . . . . . 8
1.3.1.1 Imperceptibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1.3 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1.4 Data Payload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Benchmarking on Audio Watermarking Techniques . . . . . . . . . . 10
1.3.2.1 Perceptual Quality Assessment . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2.2 Robustness Test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.2.3 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Principles of Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Physiology of the Auditory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 The Outer Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 The Middle Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 The Inner Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Sound Perception Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Sound Pressure Level and Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Hearing Range and Threshold in Quiet . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Critical Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
ix
x Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
List of Figures
xiii
xiv List of Figures
xix
xx List of Tables
Since the last decade, online distribution of digital multimedia including images,
audio, video, and documents has proliferated rapidly. In the open environment, it is
easy to get free access to various information resources. Along with the convenience
and high fidelity by which digital formatted data can be copied, edited, and
transmitted, massive amounts of copyright infringements have arisen from illegal
reproduction and unauthorized redistribution, which hinders the digital multimedia
industry from progressing steadily [12]. To prevent these violations, the enforcement
of ownership management has become an urgent necessity and is claiming more and
more attention. As a result, digital watermarking has been proposed to identify the
owner or distributor of digital data for the purpose of copyrights protection.
This chapter serves as an overall introduction to the book. First of all, background
knowledge on information hiding, focusing on the differences between steganogra-
phy and watermarking, is presented to ascertain the essence of watermarking. Then
an overview of digital watermarking technology, including system framework, clas-
sifications, and applications, is introduced. Afterward, we focus on the requirements
and benchmarking of audio watermarking for copyrights protection.
Information hiding is a general concept of hiding data in content. The term “hiding”
can be interpreted as either keeping the existence of the information secret or making
the information imperceptible [13]. Steganography and watermarking are two
important subdisciplines of information hiding. Steganography seeks for ways to
make communication invisible by hiding secrets in a cover message, whereas water-
marking originates from the need for the copyrights protection of the content [14].
The word steganography is derived from the Greek steganos & graphia, which
literally mean “covered writing.” As defined in [13], steganography refers to the
practice of undetectably altering a cover to embed a secret message, i.e., conveying
hidden information in such a manner that nobody apart from the sender and intended
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 1
DOI 10.1007/978-3-319-07974-5__1, © Springer International Publishing Switzerland 2015
2 1 Introduction
recipient suspects the very existence of the message. Steganography has been used
in a number of ways throughout time, for example, hidden tattoos, invisible inks,
microdots, character arrangement, null ciphers, code words, covert channels, and
spread spectrum communication [15, 16].
Note that steganography appears to be akin to cryptography, but not synonymous.
Both cryptography and steganography are means to provide secrecy, but their
methods of concealment are different. In cryptography, the message is encrypted
to protect its content. One can tell that a message has been encrypted, but cannot
decrypt it without the proper cipher. Once the data are decrypted, the protection is
removed and there is no privacy any longer. In steganography, the message exists,
but its presence is unknown to the receiver and others, such as the adversary. It
is due to this lack of attention that the secret is well preserved. As stated in [15],
“A cryptographic message can be intercepted by an eavesdropper, however, the
eavesdropper may not even know a steganographic message exists.” Therefore,
steganography not only protects confidential information, as does cryptography,
but also keeps the communicating parties safe to some extent. In the meantime,
steganography and cryptography can be combined to provide two levels of security.
That is, we encrypt a message using cryptography and then hide the encryption
within the cover using steganography. This notion can be adopted in digital
watermarking system to increase security.
Watermarking refers to the practice of imperceptibly altering an object to embed
a message about that object [13], i.e., hiding specific information about the object
without noticeable perceptual distortion. Watermarking has a long history dating
back to the late thirteenth century, when “watermarks” were invented by paper
mills in Italy to indicate the paper brand or paper maker and also served as the
basis of authenticating paper. By the eighteenth century, watermarks began to be
used as anticounterfeiting measures on money and other documents. So far, the
most common form of paper watermark remains the bill in many countries. The
first example of a technology similar to our notion of watermarks—imperceptible
information about the objects in which they are embedded—was a patent filed for
“watermarking” musical works by Emil Hembrooke in 1954. He inserted Morse
code to identify the ownership of music, so that any forgery could be discerned. The
term “digital watermarking” is the outcome of the digital era, which appears to have
been first used by Komatsu and Tominaga in 1988. Since 1995, digital watermarking
has gained a lot of attention and has evolved very fast [13, 14].
Watermarking and steganography are two areas of information hiding with differ-
ent emphases. Both of them are required to be robust to protect the secret message.
However, secrecy in watermarking is not strictly necessary, whereas steganography
has to be secret by definition. For instance, it is preferred that everybody knows the
presence of the watermark on bills and can recognize it easily against the light.
Steganography requires only limited robustness as it generally relates to covert
point-to-point communication between trusting parties, while watermarking must
be quite robust to resist any attempts at removing the secret data as it is open
to the public. Furthermore, the concealed message in watermarking is related to
the object which has the same importance as itself. Therefore, no deterioration
1.2 Overview of Digital Watermarking 3
of the perceptual quality of the object is desired. But this is not compulsory in
steganography, because the object there may be merely a carrier and has no intrinsic
value [13, 14].
In the next section, we focus on digital watermarking, that is, watermark-
ing applied to digital data. Key aspects will be discussed towards a deeper
understanding.
Digitization over all fields of technology has greatly broadened the notion of
watermarking, and many new possibilities have been opened up. In particular, it
is possible to hide information within digital image, audio, and video files in an
unperceived and statistically undetectable sense. Driven by concerns over digital
rights management (DRM), a new technique called digital watermarking has been
put forward for intellectual property and copyrights protection [17, 18]. Digital
watermarking is not designed to reveal the exact relationship between copyrighted
content and the users, unless one violates its legal use.
Digital watermarking is the process of imperceptibly embedding watermark(s)
into digital media as permanent signs and then extracting the watermark(s) from
the suspected media to assure the authenticity [19]. The watermark(s) is always
associated with the digital media to be protected or to its owner, which means
that each digital media has its individual watermark or each owner has his/her
sole watermark. For the purpose of copyrights protection, the advantage of digital
watermarking over traditional steganography and cryptography is that digital media
can be used in an overt manner, despite the presence of watermarks. In other words,
we do not restrict the access to the watermarks residing in digital media, but make
extra efforts to enhance their robustness against various attacks.
It is worth mentioning that some researchers provide another term closely related
to the issue of copyrights protection, the so-called digital fingerprinting [20–23].
Fingerprints are characteristics of an object that tend to distinguish it from other
similar objects. In a strict sense, fingerprinting refers to the process of identifying
and recording fingerprints that are already intrinsic to the object1 [14]. It is often
regarded as a form of forensic watermarking used to trace authorized users who
distribute them illicitly, i.e., the traitor tracing problem. Note that the greatest
differences between digital watermarking and digital fingerprinting are the origin
of hidden messages and operating mode. In digital watermarking, the watermark
is an arbitrary message containing the information on proprietorship, while the
fingerprint in digital fingerprinting is derived from the host itself and converted
into a unique but much shorter number or string. Essentially, digital fingerprinting
1
Although fingerprinting sometimes is related to the practice of extracting inherent features that
uniquely identify the content, we avoid using this term to prevent confusion [13].
4 1 Introduction
Original Watermark
watermark key
wo kw
Watermark
generator
Watermark ws
signal Attacks AT ( )
Watermarked Attacked Extracted
Host signal signal signal watermark
Watermark Transmission & Watermark
so embedder s communication sa detector we
w
Secret key
ks
produces a metafile which describes the contents of the source file, so that a piece
of work can be easily found and compared against other works in the database
[17]. For this reason, digital fingerprinting was initially conceived for use in high-
speed searching. Somewhat differently, digital watermarking stemmed from the
motivation for copyrights protection of digital multimedia products. It is able to
stand alone as an effective tool for copyright enforcement.
2
Hereafter, if the items in the graphs are indicated by dashed lines, it means that they are optional.
1.2 Overview of Digital Watermarking 5
signal, where the secret key ks is employed to provide extra security and outputs
the watermarked signal sw . The embedding process is mathematically described as
follows:
to less redundancy in audio files and the high sensitivity of the human auditory
system (HAS). With the rapid development of audio compression techniques,
audio products are becoming ever more popular on the Internet. Therefore, audio
watermarking has attracted more and more attention in recent years. Text document
watermarking also has applications wherever copyrighted electronic documents are
distributed [17, 24, 28].
• Imperceptible or perceptible
For images and video, perceptible watermarks are visual patterns such as the logos
merged into one corner of the images, ocular but not obstructive. Although percep-
tible watermarking is easy for practical implementation, it is not the focus of digital
watermarking. As defined before, digital watermarking intends to imperceptibly
embed the watermark into digital media [25].
• Robust, semi-fragile, or fragile
Watermark robustness accounts for the capability of the watermark to survive
various manipulations. A robust watermark is a watermark that is hard to remove
without deterioration of the original digital media. It is usually involved in copy-
rights protection, ownership verification, or other security-oriented applications.
Conversely, a fragile watermark is a watermark that is vulnerable to any mod-
ification, mainly for the purpose of data authentication. In a temperate manner,
a semi-fragile watermark is marginally robust and moderately sensitive to some
attacks [24–26].
• Blind (public) or non-blind (private)
Blind (public) digital watermarking does not require the host signal for watermark
detection. On the contrary, digital watermarking that requires the host signal to
extract the watermark is non-blind (private). Generally, watermark detection is more
robust if the original unwatermarked data are available. However, access to the
original host signal can not be warranted in most real-world scenarios. Therefore,
blind watermarking is more flexible and practical [24, 28].
• Nonreversible or reversible
In reversible watermarking, the watermark can be completely removed from the
watermarked signal, thus allowing it to obtain an exact recovery of the host signal.
However, the price of such reversibility implicates some loss of robustness and secu-
rity. Nonreversible watermarking usually introduces a slight but irreversible degra-
dation in the original signal. Watermark reversibility must only be considered in
applications where complete restoration of the host signal is in great request [24,27].
1.2 Overview of Digital Watermarking 7
The exploration of digital watermarking was driven by the desire for copyrights
protection. The idea is to embed a watermark with copyright information into the
media. When proprietorial disputes happen, the watermark can be extracted as
reliable proof to make an assertion about the ownership. To this end, the watermark
must be inseparable from the host and robust against various attacks intended
to destroy it. Moreover, the system requires a high level of security to survive
the statistical detection. With these properties, the owner could demonstrate the
presence of watermark to claim the copyright on the disputed media. In addition,
since it is not necessary for the watermark to be very long, the data payload for this
application does not have to be high [29, 30].
In authentication application, the objective is to verify whether the content has been
tampered with or not. Since the watermarks undergo the same transformations as
the host media, it is possible to learn something about the occurrences by looking at
the resulting watermarks. For this purpose, fragile watermarks with a low robustness
are commonly employed. If the content is manipulated in an illegal fashion, fragile
watermarks will be changed to reveal that the content is not authentic [13, 14].
The target of broadcast monitoring is to collect information about the content being
broadcast. This information is then used as the evidence to verify whether the
content was broadcast as agreed or for some other purposes, such as billing or
statistical analysis for product improvement. In this case, the robustness of the
watermark is not a concern due to a lower risk of distortion. Instead, transparent
or unnoticeable watermarks, i.e., imperceptibility, are more required [13, 31].
8 1 Introduction
The audio watermarking system for copyrights protection has to comply with
the following main requirements: excellent imperceptibility for preserving the
perceptual quality of the audio file, strong robustness against various attacks,
and high-level security for preventing unauthorized detection. Data payload and
computational complexity are two additional criteria [30].
1.3 Audio Watermarking for Copyrights Protection 9
1.3.1.1 Imperceptibility
1.3.1.2 Robustness
1.3.1.3 Security
Data payload refers to the amount of bits carried within a unit of time [13]. In digital
audio watermarking, it is defined as the number of bits embedded in a one-second
audio fraction, expressed in bit per second (bit/s or bps). Data payload of the audio
watermarking system varies greatly, depending on the embedding parameters and
3
Random samples cropping includes deliberate removal of the header or footer of a signal.
Therefore, the watermark should be spread throughout the entire audio signal.
10 1 Introduction
Similar to evaluating the quality of perceptual codecs in the audio, image, and
video fields [39], perceptual quality assessment on the watermarked audio files
is usually classified into two categories: subjective listening tests by human
acoustic perception and objective evaluation tests by perception modelling or quality
measures. Both of them are indispensable to the perceptual quality evaluation of
audio watermarking.
As perceptual quality is essentially decided by human opinion, subjective
listening tests on audiences from different backgrounds are required in most
applications [39]. In subjective listening tests, the subjects are asked to discern
the watermarked and host audio clips. Two popular modes are the ABX test
[40, 41] and the MUSHRA test (i.e., MUlti Stimuli with Hidden Reference and
Anchors) [42], derived from ITU-R Recommendation BS.1116 [43] and BS.1534
[44]4 , respectively. Moreover, the watermarked signal is graded relative to the host
signal according to a five-grade impairment scale (see Table 1.2) defined in ITU-R
BS.5625 . It is known as the subjective difference grade (SDG), which equals to the
subtraction between subjective ratings given separately to the watermarked and host
signals. Therefore, SDG near 0 means that the watermarked signal is perceptually
undistinguished from the host signal, whereas SDG near 4 represents a seriously
distorted version of the watermarked signal.
However, such audibility tests are not only costly and time-consuming, but also
heavily depend on the subjects and surrounding conditions [46]. Therefore, the
industry desires the use of objective evaluation tests to achieve automatic perceptual
measurement. Currently, the most commonly used objective evaluation is perception
modelling, i.e., assessing the perceptual quality of audio data via a stimulant ear,
such as Evaluation of Audio Quality (EAQUAL) [47], Perceptual Evaluation of
Audio Quality (PEAQ) [48], and Perceptual Model-Quality Assessment (PEMO-Q)
[49]. Moreover, objective quality measures are exploited as an alternative approach
to quantify the dissimilarities caused by audio watermarking. For instance, a widely
used quality measure is the signal-to-noise ratio (SNR), calculated as follows [50]:
4
ITU-R: Radiocommunication Sector of the International Telecommunication Union; BS: Broad-
casting service (sound).
5
ITU-R BS.562 has been replaced by ITU-R BS.1284[45].
12 1 Introduction
P
Œso .n/2
SNR .sw ; so / D 10 log10 P n
(1.3)
n Œsw .n/ so .n/2
P P
where n Œso .n/2 is the power of host signal so and n Œsw .n/ so .n/2 is the
power of noise caused by watermarking.
The goal of the robustness test is to test the ability of a watermarking system
resistant to signal modifications in real applications. In the robustness test, various
attacks are applied to the watermarked signal and produce a number of attacked
signals. Then, watermark detection is performed on each attacked signal to check
whether the embedded watermark survives or not. In particular, the detection rate is
denoted by bit error rate (BER), defined in the following equation:
6
To be described in section 3.1.3.2, multiple watermarking is to embed several watermarks
sequentially.
1.3 Audio Watermarking for Copyrights Protection 13
Hearing is the sense by which sound is perceived [52]. Human hearing is performed
primarily by the auditory system, in which the peripheral part is of more relevance
to our study. The peripheral auditory system (the ear, that portion of the auditory
system not in the brain [53]) includes three components: the outer ear, the middle
ear, and the inner ear, as illustrated in Fig. 2.1.
The whole process of capturing the sound through the ear to create neurological
signals is an intricate and ingenious procedure. First, the sound wave travels through
the auditory canal and causes the eardrum to vibrate. This vibration is transmitted
via the ossicles of the middle ear to the oval window at the cochlea inlet. The
movement of the oval window forces the fluid in the cochlea to flow, which results in
the vibration of the basilar membrane that lies along the spiral cochlea. This motion
causes the hair cells on the basilar membrane to be stimulated and to generate neural
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 15
DOI 10.1007/978-3-319-07974-5__2, © Springer International Publishing Switzerland 2015
16 2 Principles of Psychoacoustics
responses carrying the acoustic information. Then, the neural impulses are sent to
the central auditory system through the auditory nerves to be interpreted by the brain
[54, 55].
Sounds communicate the auditory system via the outer ear. The pinna and its deep
center portion, the concha, constitute the externally visible part of the outer ear that
serves focusing the sound waves at the entrance of the auditory canal (or auditory
meatus). Since human pinna has no useful muscles, it is nearly immobile. Therefore,
the head must be reoriented towards the direction of acoustical disturbance for a
better collection and localization of sound. The auditory canal (usually 2–3 cm in
length) is a tunnel through which the sound waves are conducted, and it is closed
2.1 Physiology of the Auditory System 17
with the eardrum (or tympanic membrane).1 The eardrum is stretched tightly across
the inner end of the auditory canal and is pulled slightly inward by structures in the
middle ear [58]. Upon travelling through the auditory canal, sound waves impinge
on the eardrum and cause it to vibrate. Then, these mechanical vibrations which
respond to the pressure fluctuations of acoustic stimuli are passed along to the
middle ear.
The outer ear plays an important role in human hearing. The pinna is of great
relevance to sound localization, since it reflects the arriving sound in ways that
depend on the angle of the source. The resonances occurring in the concha and
auditory canal bring about an increase on sound pressure level (SPL) for frequencies
between 1.5 kHz and 7 kHz. The extent of amplification depends on both the
frequency and angle of the incident wave, as indicated in Fig. 2.2. For example, the
gain is about 10–15 dB in the frequency range from 1.5 kHz to 7 kHz at an azimuthal
angle of 45ı . Moreover, the outer ear protects the eardrum and the middle ear against
extraneous bodies and changes in humidity and temperature [59].
The eardrum vibrations are transferred through the middle ear to the inner ear.
The middle ear is an air-filled chamber, bounding by the eardrum laterally and
by the oval window of the cochlea medially. It contains three tiny bones known
as the ossicles: the malleus (or hammer), incus (or anvil), and stapes (or stirrup).
These three ossicles are interconnected sequentially and suspended in the middle
ear cavity by ligaments and muscles. As shown in Fig. 2.1, the malleus is fused to
the eardrum and articulates with the incus; the incus is connected to both the other
bones; the stapes is attached to the incus and its footplate fits into the oval window
of the cochlea. The oval window is a membrane-covered opening which leads from
the middle ear to the vestibule of inner ear.
As an interface between the outer and inner ears, the middle ear has two functions.
One function is to serve as an impedance-matching transformer that ensures an
efficient transmission of sound energy. As we know, the outer and middle ear
cavities are filled with air, while the inner ear is filled with fluid. So the passage of
pressure waves from the outer ear to the inner ear involves a boundary between air
and fluid, two mediums with different acoustic impedance.2 In fact, approximately
99.9 % of sound energy incident on air/fluid boundary is reflected back within the
air medium, so that only 0.1 % of the energy is transmitted to the fluid. It means that
1
In this sense, the auditory canal closed with the eardrum at its proximal end has a configuration
as a resonator.
2
Acoustic impedance is a constant related to the propagation of sound waves in an acoustic
medium. Technically, sound waves encounter much less resistance when travelling in air than in
fluid.
18 2 Principles of Psychoacoustics
+20
+10
q =0⬚
0.2 0.5 1 2 5 7 10 15
-10
AVERAGE PRESSURE LEVELAT EAR CANAL ENTRANCE VS FREE FIELD PRESSURE (DB)
+20
+10
q =45⬚
0.2 0.5 1 2 5 7 10 15
-10
+20
+10
q =90⬚
0.2 0.5 1 2 5 7 10 15
-10
+10
q =180⬚
0
0.2 0.5 1 2 5 7 10 15
-10
+10
0 q =270⬚
-10
0.2 0.5 1 2 5 7 10 15
-20
+10
q =315⬚
0
-10
0.2 0.5 1 2 5 7 10 15
FREQUENCY (KC/S)
Fig. 2.2 Average pressure levels at auditory canal entrance versus free-field pressure, at six
azimuthal angles of incidence [60]. Notes: (1) The sound pressure was measured with a probe tube
located at the left ear of the subject. (2) A point source of sound was moved around a horizontal
circle of radius 1 m with the subject’s head at the center. At D 0ı , the subject was facing the
source, and at D 90ı , the source was normally incident at plane of left ear
if sound waves were to hit the oval window directly, the energy would undergo a loss
of 30 dB before entering the cochlea. To minimize this reduction, the middle ear has
two features to match up the low impedance at the eardrum with high impedance at
the oval window. The first is related to the relative sizes of the eardrum and the stapes
footplate which clings to the oval window. The effective area of the eardrum is about
55 mm2 and that of the footplate is about 3.2 mm2 ; thereupon they differ in size by
a factor of 17 (55 mm2 /3.2 mm2 = 17). So, if all the force exerted on the eardrum is
transferred to the footplate, then the pressure (force per unit area) at the oval window
is 17 times greater than at the eardrum. The second depends on the lever action of the
2.1 Physiology of the Auditory System 19
ossicular chain that amplifies the force of the incoming auditory signals. The lengths
of the malleus and incus correspond to the distances from the pivot to the applied
and resultant forces, respectively. Measurements indicate that the ossicles as a lever
system increases the force at the eardrum by a factor of 1.3. Consequently, the
combined effect of these actions effectively counteracts the reduction caused by
the impedance mismatch [58]. Another function of the middle ear is to diminish
the transmission of bone-conducted sound to the cochlea by muscle contraction.
If these sounds were sent over to the cochlea, they would appear very loud that may
be harmful to the inner ear [61].
The inner ear transduces the vibratory stimulation from the middle ear to neural
impulses which are transmitted to the brain. The vestibular apparatus and the
cochlea are the main parts in the inner ear. The vestibular apparatus is responsible
for the sense of balance. It includes three semicircular canals and the vestibule. The
cochlea is the central processor of the ear, where the organ of corti, the sensory
organ of hearing, is located. The cochlea is a spiral-shaped bony tube structure of
decreasing diameter, which coils up 2 34 times around a middle core containing the
auditory nerve, as shown in Fig. 2.3a.3 The duct is filled with almost incompressible
fluids and is enclosed by the oval window (the opening to the middle ear) and the
round window (a membrane at the rear of the cochlea). When the stapes pushes back
and forth on the oval window, the motion of the oval window causes the fluid to flow
and impels the round window to move reciprocally, which lead to the variations of
fluid pressure in the cochlea. The movements of the oval and round windows are
indicated by the solid and dotted arrows in Fig. 2.3a.
Figure 2.3c shows the cross-section through one cochlea turn. Two membranes,
Reissner’s membrane and the basilar membrane, divide the cochlea along the spiral
direction into three fluid-filled compartments: scala vestibuli, scala media, and scala
tympani. The scala vestibuli and scala tympani are merged through a small opening
called helicotrema at the apex, and they contain the same fluid (the perilymph) with
most of the nervous system. The scala media is segregated from other scalae and
contains a different fluid (the endolymph). On the scala media surface of basilar
membrane (BM) lies the organ of corti. The changes of fluid pressure in the cochlea
will cause the BM to deform, so that the hair cells4 on the organ of corti are
3
Note that the cochlea is a cavity within the skull, not a structure by itself [58]. Hence the unraveled
cochlea in Fig. 2.3b is impossible in practice, only for the sake of illustration.
4
The hair cells including the outer and inner hair cells (OHC and IHC) are auditory receptors on
the organ of corti.
20 2 Principles of Psychoacoustics
b c
Scala vestibuli
Scala media
Helicotrema
Oval
window
Scala tympani
Basilar
Reissner’s membrane
Round
window membrane
Fig. 2.3 Anatomy of the cochlea (a) Relative location of the cochlea in the inner ear [61]
(b) Schematic of the unraveled cochlea (c) Cross-section through one cochlea turn [65]
stimulated to transduce the movement of the BM into neural impulses. Then the
neural signals are carried over to the brain via auditory nerve, which ultimately lead
to the perception of sound.
The basilar membrane extends along the spirals of the cochlea and is about 32 mm
long. It is relatively narrower and stiffer at the base (near the windows), while
it gets wider and more flexible at the apex (near the helicotrema). Accordingly,
each location on the BM has different vibratory amplitude in response to sound of
different frequencies, which means that each point resonates at a specific charac-
teristic frequency (CF) [54]. As exemplified in Fig. 2.4a, for high-frequency tones,
the maximum displacement of the BM occurs near the base, with tiny movement
on the remainder of the membrane. For low-frequency tones, the vibration travels
all the way along the BM, reaching its maximum close to the apex.5 Figure 2.4b
5
There is one fact worth of attention, i.e., any location on the BM will respond to a wide range of
tones that are lower than its CF. That’s why low frequencies are less selective than high frequencies.
2.1 Physiology of the Auditory System 21
Fig. 2.4 Resonant properties of the basilar membrane (a) Envelopes of vibration patterns on the
basilar membrane in response to sound of different frequencies [66] (b) Distribution of resonant
frequencies along the basilar membrane [64]
ones. That is, there are two vibration peaks along the BM, at the positions identical
to where they would be if two tones were presented independently. However, if two
tones are quite close in frequency, the basilar membrane would fail to separate the
combination into two components, which results in the response with one fairly
broad peak in displacement instead of two single peaks [58]. As for the interval
how far two tones can be discriminated, it depends on critical bands and critical
bandwidths discussed next.
Sounds are rapid variations in pressure, which are propagated through the air away
from acoustic stimulus. Our sense of hearing allows us to perceive sound waves
of frequencies between about 20 Hz and 20 kHz. As discussed in the mechanism
of human ear, perception of sound involves a complex chains of events to read
the information from sound sources. Naturally, we are often surrounded with a
mixture of various sounds and the perception of one sound is likely to be obscured
by the presence of others. This phenomenon is called auditory masking, which is
the fundamental of psychoacoustic modelling. Here, some basic terms related to
auditory masking are introduced.
Sound reaches human ear in the form of pressure waves varying in time, s .t /.
Physically, the pressure p is defined as force per unit area, and the unit in MKS
system is Pascal (Pa) where 1 Pa D 1 N=m2 . Also, the intensity is defined as power
per unit area and its unit is W=m2 . In psychoacoustics, values of sound pressure vary
from 105 Pa (ATH, absolute threshold of hearing) to 102 Pa (threshold of pain). To
cover such a broad range, (SPL) is defined in logarithm units (dB) as
2
p I
LSPL =dB D 10 log10 D 10 log10 ; (2.1)
p0 I0
FEELING
120 phons
120
110
100
SOUND PRESSURELEVEL (DB)
100
90
80
80
70
60
60
50
40
40
30
20
20
Minimum audible field 10
(MAF)
0
0
contour represents SPLs required at different frequencies in order that all tones on
the contour are perceived equally loud [68]. The loudness of 20 phons contour at
100 Hz with 50 dB SPL is perceived similar to 1 kHz with 20 dB SPL. In Fig. 2.5,
the deviation from the maximum sensitivity region of equal-loudness contours at
high phons (i.e., 120 phons) is lower than those of low phons (i.e., 10 phons). This
indicates that the sensitivity to frequency changing of HAS at low phons is relatively
higher than high phons. Hence, complex sounds with identical frequency and phase
components might sound different due to variations in loudness [58].
lim
100 it o 100 10-2 2
f dam
age risk
SOUND INTENSITY
SOUND PRESSURE
80 music 80 10-4 0.2
40 40 10-8 2.10-3
sounds at SPLs below that threshold. In other words, frequency components that
fall below the threshold in quiet are insignificant to our perception of sound and
unnecessary to be processed [51]. This property is crucial to the development
of psychoacoustic model, where the threshold in quiet is approximated by the
following frequency-dependent function:
Threshold in Quiet .f / =dB D
0:8 ( 2 ) 4
f f 3 f
3:64 6:5 exp 0:6 3:3 C 10 (2.2)
1000 1000 1000
as plotted on both linear and logarithmic scales in Fig. 2.7. Regarding Eq. (2.2), one
point to note is that it only applies to the frequency range 20 Hz f 20 kHz.
a b
120 120
100 100
Sound pressure level (dB)
60 60
40 40
20 20
0 0
Fig. 2.7 Approximation for the threshold in quiet (a) Frequency on a linear scale (b) Frequency
on a logarithmic scale
in Appendix D. We call it critical band rate scale and its unit is Bark. One
Bark represents one critical band and corresponds to a distance along the basilar
membrane of about 1.3 mm.6 Considering nonlinear spacing of resonant frequencies
on the basilar membrane, it is expected that critical bandwidths are nonuniform,
varying as a function of frequency. The following equation describes the dependence
of Bark scale on frequency [11]:
2
0:76fl fl
z=Bark D 13 arctan C 3:5 arctan ; (2.3)
1000 7500
where fl is the lower frequency limit of critical bandwidth. For example, the
threshold in quiet in Fig. 2.7 is plotted on Bark scale as shown in Fig. 2.8.
Note that each critical bandwidth only depends on the center frequency of the
passband. It is demonstrated in Fig. 2.9, where the critical bandwidth at 2 kHz is
measured. As shown in Fig. 2.9a, hearing threshold is flat about 33 dB until two
tones are about 300 Hz away from each other, and then it drops off rapidly. A
similar result is obtained from Fig. 2.9b, hearing threshold is rather flat about 46 dB
until two noises are away from 300 Hz [51]. Consequently, the critical bandwidth is
300 Hz for a center frequency of 2 kHz. It is worth mentioning that the threshold in
Fig. 2.9b is at 46 dB versus only 33 dB in a, which means narrowband noises reduce
6
The whole length of 32 mm basilar membrane divided by 24 critical bands is 1.3 mm for each
band.
26 2 Principles of Psychoacoustics
120
100
60
40
20
0 5 10 1516.2 20 25
Frequency on critical band rate scale (Bark)
a b
Level of narrow-band noise (dB)
50 50
Level of test tone (dB)
Δf 50dB
40 40
f
2kHz
30 30
Δf 50dB
20 20
f
2kHz
10 10
50 100 200 500 1000 2000 50 100 200 500 1000 2000
Difference, Δf, between cut-off frequencies (Hz) Frequency separation of the two tones (Hz)
Fig. 2.9 Determination of the critical bandwidth [11] (a) The threshold for a narrowband noise
2 kHz centered between two tones of 50 dB as a function of the frequency separation between two
tones (b) The threshold for a tone of 2 kHz centered between two narrowband noises of 50 dB as a
function of the frequency separation between the cutoff frequencies of two noises
more audibility than tones. This fact is referred to “asymmetry of masking” and
more details will be discussed in the next section.
On the basis of experimental data, an analytic expression is derived to better
describe critical bandwidth 4f as a function of center frequency fc [11]:
" 2 #0:69
fc
4f =Hz D 25 C 75 1 C 1:4 : (2.4)
1000
2.3 Auditory Masking 27
Due to the effect of auditory masking, the perception of one sound is related to not
only its own frequency and intensity, but also its neighbor components. Auditory
masking refers to the phenomenon that one faint but audible sound (the maskee)
becomes inaudible in the presence of another louder audible sound (the masker).
It has a great influence on hearing sensation and involves two types of masking,
i.e., simultaneous masking and nonsimultaneous masking (including pre-masking
and post-masking) as displayed in Fig. 2.10. Due to auditory masking, any signals
below these curves cannot be heard. Therefore, by virtue of auditory masking, we
can modify audio signals in a certain way without perceiving deterioration, as long
as the modifications could be properly “masked.” This notion is the essence of audio
watermarking [70, 71].
Masker on
-20 0 100 250 Time (ms)
S0 Masker
Sound pressure level (dB)
Masking
Partially masked
S1 threshold
Threshold sound
in quiet S3
Masked S
2
sounds
Most often, the case of NMTs happens, where the masker is narrowband noise
and the maskees are tones located in the same critical band. Figure 2.12 shows the
masking thresholds for narrowband noise masker masking tones, where the noise is
7
Here, narrowband means the bandwidth equal to or smaller than a critical band.
2.3 Auditory Masking 29
80
40
20
Fig. 2.12 Masking thresholds for a 60 dB narrowband noise masker centered at different frequen-
cies [51]
at a SPL of 60 dB and centered at 0.25, 1, and 4 kHz separately. In the graph, solid
lines represent masking thresholds, and the dashed line at the bottom is the threshold
in quiet.8 The masking thresholds have a number of important features. For example,
the form of curve varies with different maskers, but always reaches a maximum near
the masker’s center frequency. It means that the amount of masking is greatest when
the maskee is located at the same frequency with the masker. The masking ability of
a masker is indicated by the minimum signal to mask ratio (SMR), i.e., the minimum
difference of SPL between the masker and its masking threshold. Therefore, higher
SMR implies less masking. Another point is that low-frequency masker produces a
broader masking threshold and provides more masking than high frequencies. Here,
the 0.25, 1, and 4 kHz thresholds have a SMR of 2, 3, and 5 dB, respectively.
Figure 2.12 is sketched in normal frequency units, where the masking thresholds
of different frequencies are dissimilar in shape. If graphed in Bark scale, all the
masking thresholds look similar in shape as shown in Fig. 2.13.9 In this case, it
is easier to model the masking threshold by the use of the so-called spreading
function in Sect. 2.4.1.1. As a result, Bark scale is widely used in the area of auditory
masking.
Moreover, the masking thresholds from a 1 kHz narrowband noise masker at
different SPLs, LCB , are outlined in Fig. 2.14. Although SPL of the masker is
different, the minimum SMR remains constant at around 3 dB, corresponding to the
value in Fig. 2.12. It means that the minimum SMR in NMT solely depends on the
center frequency of masker. Also notice that the masking threshold becomes more
asymmetric around the center frequency as the SPL increases. At frequencies lower
than 1 kHz, all the curves have a steep rise. But at frequencies higher than 1 kHz,
8
Hereafter, this rule does apply to all the graphs in Sect. 2.3.
9
For illustration, all the curves are shifted upward to the masker’s SPL (60 dB).
30 2 Principles of Psychoacoustics
40
20
0 2 4 6 8 10 12 14 16 18 20 22 24
Critical-band rate (Bark)
Fig. 2.13 Masking thresholds for a 60 dB narrowband noise masker centered at different frequen-
cies in Bark scale [51]
100
LCB =
80
Level of test tone (dB)
100 dB
60
80
40
60
20
40
20
0
Fig. 2.14 Masking thresholds from a 1 kHz narrowband noise masker at different SPLs [51]
the slopes of maskers at higher SPLs decrease more gradually. Recall Fig. 2.4a; it is
reasonable to expect that the masker is good at masking the tones whose frequencies
are lower than its own frequency, rather than higher frequency tones [58]. To show
the similarity in shape over all the masking thresholds, Fig. 2.15 plots the curves in
Bark scale again.
The early work on auditory masking started from experiments on tones masking
tones within the same critical band. Since both the masker and maskee are pure
tones, their interference is likely to result in the occurrence of beats. Therefore,
2.3 Auditory Masking 31
100
Threshold, excitation level (dB)
LG = 100dB
80
80
60
60
40
40
20
20
0
0 2 4 6 8 10 12 14 16 18 20 22 24
Critical-band rate (Bark)
Fig. 2.15 Masking thresholds from a 1 kHz narrowband noise masker at different SPLs in Bark
scale [51]
besides the masker and maskee, additional beating tones become audible and
accordingly disturb the subjects. Figure 2.16 shows the masking thresholds from
a 1 kHz tonal masker at different SPLs. During the course of approaching 1 kHz,
the maskee was set 90ı out of phase with the masker to prevent beating. Similar
to Fig. 2.14, the masking thresholds spread also broader towards high frequencies
than lower frequencies. However, an obvious difference lies in the minimum SMR,
roughly 15 dB in Fig. 2.16 versus about 3 dB in Fig. 2.14. It indicates that the
narrowband noise is a better masker than pure tone, referred as “asymmetry of
masking” [73]. This fact actually has been demonstrated in Fig. 2.9 already. The
masking threshold by narrowband noise masker in Fig. 2.9b is valued at 46 dB,
higher than 33 dB by tonal masker in Fig. 2.9a. So in psychoacoustic modelling,
we should identify the frequency components to be noise-like or tone -like and then
calculate their masking thresholds separately.
80
40
70
20
50
30
0
Fig. 2.16 Masking thresholds from a 1 kHz tonal masker at different SPLs [51]
In addition to simultaneous masking, auditory masking can also take place when
the maskee is present immediately preceding or following the masker. This is called
nonsimultaneous masking or temporal masking. As exemplified in Fig. 2.10, one
200 ms masker masks a tone burst with very short duration relative to the masker.
There are two kinds of nonsimultaneous masking: (1) pre-masking or backward
masking, occurring just before the onset of masker, and (2) post-masking or forward
masking, occurring after the removal of masker. In general, the physiological
basis of nonsimultaneous masking is that the auditory system requires a certain
integration time to build the perception of sound, where louder sounds require longer
integration intervals than softer ones [51].
2.3.2.1 Pre-masking
2.3.2.2 Post-masking
The knowledge of auditory masking provides the foundation for developing psy-
choacoustic models. In psychoacoustic modelling, we use empirically determined
masking models to analyze which frequency components contribute more to the
masking threshold and how much “noise” can be mixed in without being perceived.
This notion is applicable to audio watermarking, of which the imperceptibility is
one prerequisite. Typically, in some audio watermarking techniques such as spread
spectrum watermarking [74, 75] and wavelet domain watermarking [7, 76], the
watermark signal is added to the host signal as a faint additive noise. To keep
the watermarks inaudible, we often utilize the minimum masking threshold (MMT)
calculated from psychoacoustic model to shape the amplitude of watermark signal.
Models for the spreading of masking are developed to delineate excitation patterns
of the maskers. As noticed from two examples of excitation patterns in Figs. 2.13
34 2 Principles of Psychoacoustics
and 2.15, the shape of curves are quite similar and also easy to describe in
Bark scale, because Bark scale is linearly related to basilar membrane distances.
Accordingly, we define spreading function SF.d z/ as a function of the difference
between the maskee and masker frequencies in Bark scale, d z=Bark D z .fmaskee /
z .fmasker /. Apparently, d z 0 when the masker is located at a lower frequency than
the maskee, and d z < 0 when the masker is located at a higher frequency than the
maskee.
There are a number of spreading functions introduced to imitate the characteris-
tics of maskers. For instance, two-slope spread function is the simplest one that uses
a triangular function:
(
Œ27 C 0:37 max fLM 40; 0g d z; d z 0
10 log10 SF .d z/ =dB D (2.5)
27d z; d z < 0;
10 log10 SF .d z/ =dB
q
D 15:8111389 C 7:5 .1:05d z C 0:474/ 17:5 1 C .1:05d z C 0:474/2
h i
C8 min 0; .1:05d z 0:5/2 2 .1:05d z 0:5/ : (2.7)
It should be noted that the two spreading functions Eqs. (2.6) and (2.7) are indepen-
dent of the masker’s SPL, which is advantageous to reduction in computation when
generating overall masking threshold.
The spreading function utilized in ISO/IEC MPEG Psychoacoustic Model 1 is
different from Psychoacoustic Model 2:
8
ˆ
ˆ 17d z 0:4LM C 11; 3 d z < 1
ˆ
ˆ
<.0:4L C 6/ d z; 1 d z < 0
M
10 log10 SF .d z/ =dB D : (2.8)
ˆ
ˆ 17d z; 0 dz < 1
ˆ
:̂
17d z C 0:15LM .d z 1/ ; 1 dz < 8
10
ISO: International Organization for Standardization; IEC: International Electrotechnical Com-
mittee; MPEG: Moving Picture Experts Group.
2.4 Psychoacoustic Model 35
LM=20
100
LM=40
LM=60
LM=80
Sound pressure level (dB)
80
LM=100
60
40
20
0
−4 0 4 8 12
Critical band rate (Bark)
Figure 2.17 shows spreading functions in Model 1 for different levels of the masker.
It is seen that the higher SPL the masker has, the more asymmetric the curve looks.
Specifically, higher frequencies exhibit more masking than lower frequencies when
the level of masker is high. This two-piece linear spreading function is a good
approximation to the masking thresholds of TMT in Fig. 2.16.
In addition, four models described above for spreading functions, i.e., two-slope
SF, Schroeder SF, Psychoacoustic Model 1 SF, and Model 2 SF, are compared
at a level of 80 dB in Fig. 2.18. Among these four models, two-slope spreading
function is the most conservative one, and Model 1 spreading function allows for
more upward spreading of masking than others [51].
Two−slope SF
80 Schroeder SF
Model 1 SF
Model 2 SF
Sound pressure level (dB)
60
40
20
0
−8 −4 0 4 8 12
Critical band rate (Bark)
[31, 78, 79]. Hence, Psychoacoustic Model 1 for Layer I is later employed in our
audio watermarking scheme in consideration of its higher efficiency.
In our case, the input to Psychoacoustic Model 1 is one frame of audio signal
and the corresponding output is its MMT. The whole procedure of implementation
consists of six steps [72, 73, 77, 80]:
1. FFT analysis and SPL normalization
2. Identification of tonal and nontonal maskers
3. Decimation of invalid tonal and nontonal maskers
4. Calculation of individual masking thresholds
5. Calculation of global masking threshold
6. Determination of the MMT
where hann .N / D 12 1 cos 2Nn is the N -point Hanning window. Factor
q
8
3
is a gain to compensate the average power of w .n/, so that hw .n/2 i
X
N 1 h i
2
1
N
w .n/ D 1. Then, power spectral density (PSD) of x .n/ is computed as
nD0
ˇ "N 1 #ˇ2
ˇ1 X 2 nk ˇˇ
ˇ N
PSD .k/ =dB D 10 log10 ˇ x .n/ w .n/ exp j ˇ 0k< :
ˇN N ˇ 2
nD0
(2.10)
After that, PSD estimate PSD .k/ is normalized to a SPL level of 96 dB, i.e., the
maximal is limited to 96 dB.
a
Initial PSD
100
Normalized PSD
Threshold in Quiet
50
Sound pressure level (dB)
−50
−100
−150
0 4000 8000 12000 16000 20000
Frequency (Hz)
b
Initial PSD
100
Normalized PSD
Threshold in Quiet
50
Sound pressure level (dB)
−50
−100
−150
1 3 5 7 9 11 13 15 17 19 21 23 25
Critical band rate (Bark)
Fig. 2.19 Initial and normalized PSD estimates (a) Frequency on linear scale (b) Frequency on
Bark scale
2.4 Psychoacoustic Model 39
If the value of a local maxima is at least 7 dB greater than that of its neighboring
components within a certain Bark range Dk , such a maxima will be marked as a
tonal masker. All the tonal components comprise the “tonal” set, STM :
8
< f˙2g ; 2 < k < 63 $ 2Fs
N
63F
N
s
kHz
Dk 2 f˙2; ˙3g ; 63 k < 127 $ 63Fs
N kHz
127Fs
:
: N
f˙2; ˙3; : : : ; ˙6g ; 127 k 250 $ 127Fs
N
250F
N
s
kHz
One point to note is that [77] did not specify the value of Dk for 251 k 256,
because the maskers within this range are already dominated by the threshold in
quiet (as seen in Fig. 2.19) and have no contribution to masking threshold. Actually,
it is the first criterion for decimation in Step 3.
As the effect of masking is additive in the logarithmic domain, the SPL of each
tonal component is calculated by
h P .k1/ P .k/ P .kC1/
i
PTM .k/ =dB D 10 log10 10 10 C 10 10 C 10 10 : (2.14)
In addition, the remaining components within each critical band12 are treated to be
nontonal. So we sum up their intensities as the SPL of a single nontonal masker for
each critical band, PNM :
X P .j /
PNM k =dB D 10 log10 10 10 8P .j / … STM ; (2.15)
j
where that k is the frequency index nearest to the geometric mean13 of each critical
band. Correspondingly, all the nontonal components are put into the “nontonal”
set, SNM .
11
The frequency edges are calculated based on the sampling frequency Fs .
12
Critical band boundaries vary with the Layer and sampling frequency. ISO/IEC IS 11172-3 [77]
has tabulated such parameters in Table D.2a–f. In our case, Table D.2b for Layer I at a sampling
frequency of 44.1 kHz is adopted.
!1=M
Y
M
13
The geometric mean of a data set Œa1 ; a2 ; : : : ; aM is defined as am . It is sometimes
mD1
!1=M " #
Y
M X
M
called the log-average, i.e., am D 10ˆ 1
M
log10 .am / .
mD1 mD1
40 2 Principles of Psychoacoustics
a 120
Normalized power spectrum
Local maxima
100 Tonal components
Non−tonal components
Threshold in Quiet
80 Removed components
Sound pressure level (dB)
60
40
20
−20
0 4000 8000 12000 16000 20000
Frequency (Hz)
b 120
Normalized power spectrum
Local maxima
100 Tonal components
Non−tonal components TM
Threshold in Quiet TM 2
TM3
1
80
Sound pressure level (dB)
Removed components
60
16.55 17.08 17.57 18.01 Barks
40 0.49 Bark
20
−20
1 3 5 7 9 11 13 15 17 19 21 23 25
Critical band rate (Bark)
Fig. 2.20 Tonal and nontonal maskers (a) Frequency on a linear scale (b) Frequency on Bark scale
Tonal and nontonal maskers are denoted by pentagram and asterisk symbols in
Fig. 2.20, respectively. Particularly, the associated critical band for each masker is
indicated in the graph on Bark scale.
• STEP 3: Decimation of invalid tonal and nontonal maskers
On considering their possible contributions to masking threshold, the sets of tonal
and nontonal maskers are examined according to two criteria as follows:
2.4 Psychoacoustic Model 41
One rule is that any tonal and nontonal maskers below the threshold in quiet
are removed. That is, only the maskers that satisfy Eq. (2.16) are retained, where
ATH .k/ is the SPL of threshold in quiet at frequency index k:
For example, one of each tonal and nontonal maskers between 24 and 25 Barks is
discarded, as shown in Fig. 2.20b.
The other rule is to simplify any group of maskers occurring within a distance
of 0.5 Bark: only the masker with the highest SPL is preserved and the rest are
eliminated.
For example, two pairs of tonal maskers between 17 and 19 Barks, fTM1 ; TM2 g
and fTM2 ; TM3 g, are inspected. As shown in an enlarged drawing on the right of
Fig. 2.20b, the distance between fTM1 ; TM2 g is 0.49 Bark, and TM1 has a lower
SPL than TM2 . Therefore, TM2 is preserved, whereas TM1 is removed. Similarly,
we dispose of TM3 but retain TM2 for fTM2 ; TM3 g.
ˇ
ˇ Distance W17:57 17:08 D 0:49 Bark
TM2 fTM1 ; TM2 g ˇˇ
SPL W PTM1 < PTM2
ˇ
ˇ Distance W18:01 17:57 D 0:44 Bark
TM2 fTM2 ; TM3 g ˇˇ
SPL W PTM2 > PTM3
In Fig. 2.20, the invalid tonal and nontonal maskers being decimated are denoted by
a circle.
• STEP 4: Calculation of individual masking thresholds
After eliminating invalid maskers, individual masking threshold is computed for
each tonal and nontonal masker. An individual masking threshold L .j; i / refers to
the masker at frequency index j contributing to masking effect on the maskee at
frequency index i . It corresponds to L Œz .j / ; z .i /, where z .j / and z .i / are the
masker and maskee’s frequencies in Bark scale. In MPEG psychoacoustic models,
only a subset of samples over the whole spectrum are considered to be maskees and
involved in the calculation of global masking threshold. The number and frequencies
of maskees also depend on the Layer and sampling frequency, as tabulated from
Table D.1a–f in [77]. In our case, Table D.1b for Layer I at a sampling frequency of
44.1 kHz is adopted, where 106 maskees are taken into account.
The individual masking thresholds for tonal and nontonal maskers, LTM
Œz .j / ; z .i / and LNM Œz .j / ; z .i /, are calculated by
where PTM Œz .j / and PNM Œz .j / are the SPLs of tonal and nontonal maskers
at a Bark scale of z .j /, respectively. The term 4X is called masking index, an
offset between the excitation pattern and actual masking threshold. As mentioned
in Sect. 2.3.1, the excitation pattern needs to be shifted by an appropriate amount in
order to obtain the masking curve relative to the masker. Because tonal and nontonal
maskers have different masking capability, i.e., the noise is a better masker than pure
tone, the masking indices of tonal and nontonal maskers are defined separately as
follows [77]:
a
120
Tonal components
Masking thresholds for tonal components
100 Non−tonal components
Masking thresholds for non−tonal components
80
Sound pressure level (dB)
60
40
20
−20
0 4000 8000 12000 16000 20000
Frequency (Hz)
b 120
Tonal components
Masking thresholds for tonal components
100 Non−tonal components
Masking thresholds for non−tonal components
80
Sound pressure level (dB)
60
40
20
−20
1 3 5 7 9 11 13 15 17 19 21 23 25
Critical band rate (Bark)
Fig. 2.21 Individual masking thresholds (a) Frequency on linear scale (b) Frequency on Bark
scale
where ATH .i / is the SPL of threshold in quiet at frequency index i , NTM and NNM
are the number of tonal and nontonal maskers, and LTM Œz .j / ; W and LNM Œz .j / ; W
are their corresponding individual masking thresholds.
44 2 Principles of Psychoacoustics
a
120
Tonal components
Non−tonal components
100 Normalized power spectrum
Threshold in Quiet
Global masking threshold
80 Minimum masking threshold
Sound pressure level (dB)
60
40
20
−20
0 4000 8000 12000 16000 20000
Frequency (Hz)
b
120
Tonal components
Non−tonal components
100 Normalized power spectrum
Threshold in Quiet
Global masking threshold
80 Minimum masking threshold
Sound pressure level (dB)
60
40
20
−20
1 3 5 7 9 11 13 15 17 19 21 23 25
Critical band rate (Bark)
Fig. 2.22 Global masking threshold and minimum masking threshold (a) Frequency on linear
scale (b) Frequency on Bark scale
46 2 Principles of Psychoacoustics
Subband number, n
1 4 8 12 16 20 24 28 32
106
100
Subsamples over the spectrum, i
80
60
40
20
0
0 32 64 96 128 160 192 224 256
Frequency index k
100
Normalized power spectrum
MMT from Model 1
MMT from Model 2
80
Sound pressure level (dB)
60
40
20
−10
0 4000 8000 12000 16000 20000
Frequency (Hz)
0.5
Post−masking threshold
Input audio
0.25
Normalized amplitude
−0.25
−0.5
0 20 40 60 80 100 120
Time duration (ms)
2.5 Summary
The ultimate aim of this chapter is to establish a psychoacoustic model that emulates
the HAS. Accordingly, audio watermarking techniques are able to analyze the host
audio signal in order to determine how the watermarks can be rendered as inaudible
as possible.
The chapter started with the physiology of the peripheral auditory system
including the outer, middle, and inner ears. The outer ear collects sound waves in
the air and channels them to interior parts of the ear; the middle ear transforms the
acoustical vibration of sound waves into mechanical vibration and passes them onto
the inner ear; the inner ear transduces mechanical energy into nerve impulses that
are transmitted to the brain. Then, some fundamental concepts of psychoacoustics
2.5 Summary 49
such as SPL, loudness, human hearing range, threshold in quiet, and critical
bandwidth were introduced. The notions of two types of auditory masking, i.e.,
simultaneous and nonsimultaneous masking, were also explained. In simultaneous
masking, it is noted that the masking ability of narrowband noise is superior to
pure tone. Based on the acquired knowledge, the ways of constructing the models
for simultaneous and nonsimultaneous masking effects are investigated respectively,
particularly simultaneous masking. After reviewing several models for the spreading
of masking, we described the details of implementing Psychoacoustic Model 1 in
ISO/MPEG standard, followed by a comparison with Model 2. On balance, two
psychoacoustic models have similar perceptual quality, but Model 2 requires more
computation than Model 1. Consequently, we adopted Psychoacoustic Model 1 in
the audio watermarking scheme we developed in this book.
Chapter 3
Audio Watermarking Techniques
In recent years, there has been considerable interest in the development of audio
watermarking techniques. To clarify the essential principles underlying a diversity
of sophisticated algorithms, this chapter gives an overview of basic methods for
audio watermarking, such as least significant bit (LSB) modification, phase coding,
spread spectrum watermarking, cepstrum domain watermarking, wavelet domain
watermarking, echo hiding, and histogram-based watermarking.
As the first step towards a full investigation into various approaches, we start
with the details of performance evaluation undertaken in this book, including the
parameters employed during perceptual quality assessment and robustness tests.
Then, different audio watermarking techniques are separately implemented and
evaluated in order to ascertain their advantages and disadvantages. Also, possible
enhancements are exploited to further improve their capabilities. Finally, the chapter
is concluded with a summary of comparative study.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 51
DOI 10.1007/978-3-319-07974-5__3, © Springer International Publishing Switzerland 2015
52 3 Audio Watermarking Techniques
0
Normalized amplitude
−0.7
0 4 8 12 16 20 24
Right channel of stereo signal
0.7
−0.7
0 4 8 12 16 20 24
Time (s)
A collection of audio test files are prepared for performance evaluations in this
book. The test set contains seventeen pieces of audio signals in total and all of
them are in WAVE format (44.1 kHz, 16 bit, mono). For simplicity of reference,
each audio signal An is marked with a subscript number n, i.e., (i) Vocal: Soprano1 ,
Bass2 , Quartet3 ; (ii) Percussive instruments: Hihat4 , Castanets5 , Glockenspiel16 ,
Glockenspiel27 ; (iii) Tonal instruments: Harpsichord8 , Violoncello9 , Horn10 ,
Pipes11 , Trumpt12 , Electronic tune13 ; (iv) Music: Bach14 , Pop15 , Rock16 , Jazz17 .
More details of each audio test file, such as its duration and waveform, are listed in
Appendix E.
Except for the category of music, A14 A17 , the majority of test files are
selected from audio tracks on the EBU SQAM1 disc, specifically for the testing
and evaluation of sound systems [79]. Originally, most audio tracks are stereo
channels. Since our study is not limited to stereo audio watermarking,2 , the left
channel of each signal is always used for the watermarking. Figure 3.1 shows an
example of a stereo signal and the left channel signal is taken as the audio test
file A2 . It is observed that there are silent intervals inherent in audio data. Long
silence is usually intractable in the watermarking, because embedding watermarks
in muteness would unavoidably introduce perceived noise. Thus, watermarking
1
EBU: The European Broadcasting Union; SQAM: Sound Quality Assessment Material.
2
In general, stereo audio watermarking depends on some kind of relation between two channels
[80], so it can only apply to stereo signals. However, mono audio watermarking can commonly
treat one stereo channel as two mono channels, so it supports both mono and stereo audio signals.
3.1 Specifications on Performance Evaluation 53
regions must be carefully chosen. This issue will be further elaborated in Chap. 4.
To avoid long silence during the watermarking, we only embed the watermark into
the first half of A2 for performance evaluations in this chapter.
The general guideline for robustness tests was discussed in Sect. 1.3.2.2. On
consideration of test items in SDMI, STEP2000, and StirMark for Audio, two
robustness tests are set up in the book, i.e., a basic robustness test and an advanced
robustness test.
Recall that in robustness tests, the attacked watermarked signals should not
be degraded far beyond tolerable levels. On the basis of this premise, the attack
parameters listed below are determined accordingly.
The basic robustness test incorporates a variety of typical attacks on audio water-
marking techniques. Table F.1 in Appendix F shows the parameters used, expres-
sion, and implementation of each attack. The basic robustness test is employed for
54 3 Audio Watermarking Techniques
3
Adobe Audition is a powerful digital audio recorder, editor, and mixer for Windows. It can
perform a lot of operations, such as resampling, requantization, amplitude scaling, reverberation,
MPEG compression, time stretching, and pitch shifting, on various formats of audio files, .au, .voc,
.vox, .wav, and so on.
4
The reverberation time of a room is the time that it takes for sound to decay by a certain level ˛
dB once the source of sound has stopped [30]. T60 is when ˛ D 60 dB.
5
Based on extensive operations with Adobe Audition v3.0, it is found that an amount of 1,201
samples is added to the beginning of an audio file.
3.1 Specifications on Performance Evaluation 55
DA/AD conversion: The watermarked audio signal is played through the audio
player in a computer. Then the playback signal is recorded by connecting the
headphone jack to the line-in jack on the sound card of the computer.
Random samples cropping: A number of 25 ms intervals are cropped at randomly
selected positions in the front, middle, and rear of the watermarked audio signal.
Jittering: Jittering is an evenly performed form of random samples cropping. For
our watermarked audio signal, 0.1–0.2 ms out of every 20 ms is cropped.
Zeros inserting: A number of 25 ms silent intervals are inserted into randomly
selected positions in the front, middle, and rear of the watermarked audio signal.
Pitch-invariant time-scale modification (PITSM): The time-scale of the water-
marked audio signal is stretched from ˙4 % up to ˙10 %, whereas the audio pitch
is preserved. Positive PITSM results in a longer duration with a slower tempo, while
negative PITSM results in a shorter duration with a faster tempo.
Tempo-preserved pitch-scale modification (TPPSM): The pitch-scale of the water-
marked audio signal is shifted from ˙4 % up to ˙10 %, whereas the audio tempo is
preserved. Positive TPPSM results in a higher pitch, while negative TPPSM results
in a lower pitch.
The last five attacks belong to desynchronization attacks, which cause displace-
ment between the encoder and decoder. Therefore, it is difficult to retrieve a
watermark suffering from such hazardous attacks, especially PITSM and TPPSM.
The advanced robustness test involves more stringent attacks than the basic robust-
ness test and is specifically designed for rigorously evaluating our proposed audio
watermarking algorithm to be described in Chap. 4. It consists of three parts: a
test with StirMark for Audio, a test under collusion, and a test under multiple
watermarking.
Collusion: We separately embed n different watermarks w.1/ .2/ .n/
o ; wo ; : : : ; wo into a
.1/ .2/ .n/
host signal so and obtain n watermarked signals sw ; sw ; : : : ; sw correspondingly.
Without loss of generality, these watermarked signals are further combined to create
n average watermarked signals sw.i/ .1 i n/ as follows:
8
<sw.j / D Embedding s ; w.j / ; 1 j n;
o o
:s .i/ D 1 s .1/ C s .2/ C C s .i/ ; 1 i n:
(3.1)
w i w w w
In the detection, i watermarks w.i;/ e are detected from the average watermarked
signal sw.i/ individually:
w.i;j
e
/
D Detection sw.i/ ; 1 i n and 1 j i; (3.2)
In the detection, i watermarks are detected from the watermarked signal sw.i/
individually:
w.i;j
e
/
D Detection sw.i/ ; 1 i n and 1 j i; (3.4)
Over the last years, many digital watermarking methods have been proposed for dif-
ferent applications. These methods can be broadly divided into two main categories:
(1) blind embedding, where the encoder does not exploit the knowledge of the host
signal, for example, spread spectrum watermarking, and (2) informed embedding,
where the knowledge of the host signal is adequately exploited by the encoder,
for example, quantization index modulation (QIM) [1, 82].6 Both prototypes have
found implementations in audio watermarking, such as LSB modification, phase
coding, spread spectrum watermarking, cepstrum domain watermarking, wavelet
domain watermarking, echo hiding, and histogram-based watermarking. In the fol-
lowing subsections, one algorithm of each technique is implemented and evaluated
separately.
Regarding the performance evaluations in this section, perceptual quality assess-
ment and robustness tests are conducted as follows. In the perceptual quality
assessment, the ODG using software PEAQ and the SNR are employed to indicate
the audio quality. Since subjective listening tests are costly and time-consuming,
only informal subjective listening tests are carried out if necessary. Informal
subjective listening tests are performed in the same environment as described in
Sect. 3.1.2; however, only a couple of listeners are involved. Without providing
SDG scores, they are merely required to ascertain whether the watermarked signal
is perceptually undistinguished from the host signal.
6
Two categories are named as the host-interference nonrejecting method and the host-interference
rejecting method respectively in [35, 83, 84].
3.2 Audio Watermarking Algorithms 57
For the robustness test, some or all the attacks in basic robustness test are
involved and the bit error rates (BERs) as defined in Eq. (1.4) are calculated
accordingly. Moreover, repetition coding on the watermark is adopted to enhance
the robustness. For a .nr ; 1/ repetition code, each watermark bit is repeated nr
times and subsequently embedded. In the detection, the bits are determined using
the majority vote rule. For example, repetition coding
with nr D 3 on the original
watermark wo D 1 0 1 yields a sequence w O o D 111 000 111 . Suppose that the
attacked sequence becomes wO e D 011 001 110 , then the final detected watermark
is we D 1 0 1 and BER = 0 %.
Earlier audio watermarking techniques embed the watermark into the host signal in
a straightforward manner. One method is to replace the LSB of each sample with the
watermark represented in a coded binary string [36]. In this way, the data payload
of the watermarking system could be very high, approximately of the same order of
magnitude with the sampling frequency of the host signal. Ideally, for instance, the
bit rate is 44.1 kbps for an audio signal with a sampling frequency of 44.1 kHz [31].
However, the system under such conditions would be quite irresistible to any attack.
In order to enhance the robustness and security, LSB modification can be performed
on some selected subsets of the samples only, such as low-frequency components
that are perceptually important. Usually, repetition coding could help increase the
detection rate in LSB watermarking.
3.2.1.1 Algorithm
a
0.7
0
−0.7
0 2 4 6 8 10
Normalized amplitude
b
0.7
0
−0.7
0 2 4 6 8 10
c
0.01
0
−0.01
0 2 4 6 8 10
Time (s)
Fig. 3.2 Host signal and a watermarked signal by LSB modification. Note that the watermarked
signal is produced by using L D 6 and modifying the third and fourth decimal places. (a) Host
audio signal. (b) Watermarked audio signal. (c) Difference between the watermarked and host
audio signals
7
Note that two of the ODGs in Table 3.1 are slightly positive, i.e., 0.13 and 0.03. According to its
definition, the ODG should normally be in the range [ 4, 0]. However, if the distortion caused
by watermarking is very low, then the cognitive model calculates positive values. In such cases, it
is interpreted that the distortion is mostly inaudible for humans [38].
3.2 Audio Watermarking Algorithms 59
Based on the fact that the human auditory system (HAS) is unable to perceive
the absolute phase, only the relative phase [36], audio watermarking techniques
can embed watermarks into the phase of host signal, i.e., phase coding and phase
modulation [31].
3.2.2.1 Algorithm
The basic phase-coding method was presented in [36]. It splits the host signal into
frames and the first frame’s phase spectrum is modified to represent the watermark.
Then the phases of subsequent frames are changed accordingly to preserve their
relative phases. Thus, the first frame is crucial for watermark embedding and it
must not be an absolute silence. For watermark detection, the first frame of the
attacked signal is taken out, and then the value of its phase spectrum is calculated
to determine the watermark. The premise for detection is precise synchronization to
obtain the first frame accurately.
60 3 Audio Watermarking Techniques
Algorithm 3.1 describes the pseudocode of phase coding. Note that the length of
each˘ frame is N , but the watermark has a length of .N 2 1/ only, where N 2 D
N
2
and bc is the smallest integer value. The reason is that the Fourier transform of
a real-valued signal exhibits conjugate symmetry [87] and only half of the spectrum
is available to embed the watermark. Moreover, the first spectral component is DC
value whose phase is always equal to 0 or , so it is not used for watermarking.
MATLAB script for phase-coding method can be found as Phase coding:m file
under Audio_Watermarking_Techniques folder in the attached CD.
Table 3.2 shows the results of performance evaluation of the basic phase-coding
method. As indicated by very low SNRs, the watermarked signals have been
changed greatly and their perceptual quality has degraded badly.
The cause of deterioration is the substitution of ˆ1 with a binary sequence of 2
and 2 . In the basic method, every component of the phase spectrum is altered to
represent the watermark. Such a sharp phase transition is likely to produce audible
distortion [36]. In order to smooth the variation, the modified method is to change
the phase spectrum at an interval of ne and perform interpolation between the
values. Several kinds of interpolation such as linear interpolation and cubic spline
interpolation were tested.
Figure 3.3 illustrates an example of the watermarked signal produced by the
modified phase-coding method. From Fig. 3.3c, there is still quite a difference
between the watermarked and host signals, which also interprets very low SNRs
in Table 3.2. Informal subjective listening tests show that the perceptual quality
of the watermarked signals is not satisfied yet. Moreover, as shown in Table 3.2,
the watermarked signals by phase coding are not robust and nearly all the BERs
are over 20 %. In addition, repetition coding is not helpful to phase coding. This is
because the effect of nr times repetition coding resembles that of embedding interval
ne D nr with linear interpolation.
To achieve imperceptible watermarking, Kuo et al. [88] proposed phase modula-
tion under the following constraint condition:
ˇ ˇ
ˇ d .z/ ˇ
ˇ ˇ ı
ˇ d z ˇ < 30 ; (3.5)
where .z/ denotes the signal phase and z is the Bark scale [51]. More information
on phase modulation for audio watermarking can be found in [31, 80, 81].
3.2 Audio Watermarking Algorithms 61
% Watermark detection
% Note: sa is the attacked signal.
1 aD saa .1 W N /; %
a
g The first frame
1 ; ˆ1 D F g1a ;
% The watermark we is detected based on the phase spectrum ˆa1 .
for m D 1 W .N 2 1/
if ˆa1 .m/ 0, then we .m/ D 1;
if ˆa1 .m/ < 0, then we .m/ D 0.
end
62 3 Audio Watermarking Techniques
Watermarking parameters
Phase-coding method Basic method Modified method
Frame length N 1;024 2;048 4;096 2;048 2;048 4;096 4;096
Embedding interval ne 1 1 1 128 64 128 64
Watermark length Nw 511 1; 023 2; 047 8 16 16 32
a
0.7
−0.7
0 2 4 6 8 10
b
Normalized amplitude
0.7
−0.7
0 2 4 6 8 10
c
0.7
−0.7
0 2 4 6 8 10
Time (s)
Fig. 3.3 Host signal and a watermarked signal by the modified phase-coding method. Note that
the watermarked signal is produced by watermarking with N D 2; 048 and ne D 128. (a) Host
audio signal. (b) Watermarked audio signal. (c) Difference between the watermarked and host
audio signals
3.2 Audio Watermarking Algorithms 63
a
Host signal Watermarked signal
s sw
Psychoacoustic Scaling factor
model
Watermark Watermark wm
wo spreading Modulated watermark
Pseudorandom sequence
rs
b
Attcked signal ~
sa Extracted watermark
Pre-processing Correlator we
sa
Pseudorandom sequence
rs
Fig. 3.4 Block diagram of basic SS watermarking scheme. (a) Embedding process. (b) Detection
process
3.2.3.1 Algorithm
There are two main forms of SS watermarking, namely direct sequence spread
spectrum (DSSS) [8, 89, 91] and frequency hopping spread spectrum (FHSS) [92].
DSSS-based audio watermarking method is more commonly used and its basic
scheme is shown in Fig. 3.4 [1,90]. In watermark embedding process, the watermark
wo is modulated by PRS rs to produce the modulated watermark wm . To keep wm
inaudible, scaling factor ˛ may be used to control the amplitude of wm . Then the
watermarked signal sw is produced by adding wm to the host signal so . In watermark
detection process, the watermark we is extracted by correlating the received signal
sa with the PRS rs used in the embedding.
64 3 Audio Watermarking Techniques
Note that the watermark can be spread not only in the time domain but also in
various transformed domains. Discrete Fourier transform (DFT), discrete cosine
transform (DCT), and discrete wavelet transform (DWT) are some examples of
transforms that are frequently used. Typically, a SS watermarking scheme that
spreads the watermark into the time-domain signal is implemented as follows.
First, host signal˘ so .n/ ; 1 n No is split into NP frames fgi g with N samples,
where NP D NNo :
so .n/ D fgi .j /g ; 1 i NP and 1 j N: (3.6)
So a number of NP watermark bits can be embedded at most, i.e., Nw D NP .
Each watermark bit wo .i / 2 fC1; 1g is modulated by one PRS rs . A sequence of
random numbers uniformly distributed in the interval .0:5; 0:5/ is applied in our
experiment. Then, the watermarked frame giw is obtained by adding the modulated
frame to the host frame as follows:
giw D gi C ˛wo .i / rs ; (3.7)
where the factor ˛ controls the strength of watermarking.
For a better perceptual quality, adaptive factor ˛ D ˇ max .abs
˚ .gi // is adopted,
where ˇ is a scaling factor. Finally, all the watermarked frames giw ; 1 i Nw
are concatenated to produce the watermarked signal sw .
In the detection, watermark bits are determined by using a linear correlation
between the watermarked
˚ signal and PRS. After splitting the watermarked signal
sw .n/ into frames giw in the same way as the embedding, the linear correlation
Rc ./ between each frame giw and rs is calculated as
1 X w
N
Rc .i / D g .j / rs .j /
N j D1 i
1 X
N
D Œgi .j / C ˛wo .i / rs .j / rs .j /
N j D1
1 X 1 X
N N
D gi .j / rs .j / C ˛wo .i / Œrs .j /2 : (3.8)
N j D1 N j D1
Ideally, if the host frame gi and the PRS rs are independent, the first term in Eq. (3.8)
is close to zero. Meanwhile, the second term has a large magnitude and its sign
depends on the watermark bit wo .i /. However, it is not always the condition that
gi and rs are uncorrelated. In this case, the first term has similar or even larger
magnitude than the second term, which would lead to incorrect detection. Thus, the
watermarked signal must be preprocessed to reduce the effect of the host signal to
the fullest extent [90]. To this end, different preprocessing methods are developed,
such as linear predictive coding (LPC) filtering [93], cepstrum filtering [8], the
Savitzky–Golay filtering [4], and decorrelation by subtracting the host signal [89]
or adjacent frames [9].
3.2 Audio Watermarking Algorithms 65
Figure 3.5 shows the host signal and a watermarked signal by SS watermarking. As a
result of a small difference in Fig. 3.5c, the SNRs in Table 3.3 are higher than 30 dB.
For perceptual quality assessment, however, the values of the ODG are pretty low.
To ascertain the real perceptual quality, we carry out an informal subjective listening
test on the watermarked signal with the highest ODG in Table 3.3, i.e., the shaded
column. When the watermarked signal is played at a high volume, a constant hissing
background noise is heard. This occurs as a result of the modulated PRS added to
the host signal as white noise. Therefore, amplitude shaping by the psychoacoustic
model is very important to produce an unperceived watermark.
In the robustness test, SS watermarking with repetition coding behaves differ-
ently. In descending order of resistance, the attacks are sorted as Compression I,
noise addition, requantization, low-pass filtering, echo addition, resampling, and
Compression II. Apparently, SS watermarking is quite vulnerable to desynchroniza-
tion attacks. Thus, proper solutions for the synchronization problem are necessary
to improve the detection rate.
a
0.7
−0.7
0 2 4 6 8 10
b
Normalized amplitude
0.7
−0.7
0 2 4 6 8 10
c
0.01
−0.01
0 2 4 6 8 10
Time (s)
Fig. 3.5 Host signal and a watermarked signal by SS watermarking. Note that the watermarked
signal is produced by watermarking with N D 4; 096, nr D 3 and ˇ D 0:03. (a) Host audio signal.
(b) Watermarked audio signal. (c) Difference between the watermarked and host audio signals
where log ./ refers to the natural logarithm. In a more compact way, xO .n/ D
F1 flog . F fx .n/g/g, where F ./ is the Fourier transform and F1 ./ is its inverse.
The complex cepstrum xO .n/ has preserved information about the magnitude and
phase of the frequency spectrum of x .n/. Therefore, x .n/ can be recovered from
xO .n/ by the inverse complex cepstrum. Figure 3.6 shows the block diagram of
computing the complex cepstrum and the inverse complex cepstrum using DFT and
IDFT [87].
In MATLAB implementation, the real part of the complex cepstrum is often
utilized as the common “cepstrum” c .n/ [95], i.e.,
Note that the real part of the complex cepstrum c .n/ should be distinguished from
the “real cepstrum,” cr .n/. The real cepstrum is defined as the inverse Fourier
transform of the logarithm of the magnitude of the Fourier transform of a signal
x .n/, i.e.,
Z ˇ ˇ
1 ˇ O j! ˇ j!n
cr .n/ D ˇX e ˇe d!: (3.12)
2
3.2 Audio Watermarking Algorithms 67
Watermarking parameters
Frame length N 2048 4096
Repetition coding nr 1 3 1 3 1 3 1 3
Watermark length Nw 256 85 256 85 128 42 128 42
Scaling factor ˇ 0:02 0:02 0:03 0:03 0:02 0:02 0:03 0:03
(1) Perceptual quality assessment
SNR /dB 34:81 34:59 31:09 31:11 33:90 33:92 30:46 30:46
ODG 3:68 3:67 3:76 3:73 3:70 3:68 3:77 3:78
(2) Robustness test (BER: %)
No attack 0 0 0 0 0 0 0 0
Noise (30 dB) 7:03 4:71 3:52 1:18 3:13 2:38 0:78 0
Re-sampling (22.05 kHz) 46:88 34:12 31:25 28:24 30:47 28:57 17:97 14:29
Requantization (8 bit) 4:69 5:88 2:34 0 4:69 0 1:56 0
Lp filtering (8 kHz) 12:89 11:76 4:69 0 7:03 0 0:78 0
Echo (0.3, 100 ms) 28:52 10:59 10:94 2:35 7:03 7:14 2:34 0
Compression I (96 kbps) 1:17 0 0:39 0 0 0 0 0
Compression II (96 kbps) 50:00 47:06 49:61 58:82 52:34 45:24 48:44 47:62
a
Time domain Cepstrum domain
Discrete Complex Inverse discrete
x(n) Fourier transform logarithm Fourier transform xˆ (n)
DFT{.} log{.} IDFT{.}
b
Cepstrum domain Time domain
Discrete Complex Inverse discrete
xˆ (n) Fourier transform natural exponential Fourier transform x(n)
DFT{.} exp{.} IDFT{.}
Fig. 3.6 Block diagram of computing the complex cepstrum and the inverse complex cepstrum.
(a) Complex cepstrum xO .n/ D F1 flog . F fx .n/g/g. (b) Inverse complex cepstrum x .n/ D
F1 fexp . F fxO .n/g/g
Different from complex cepstrum, real cepstrum has lost the phase information.
Therefore, x .n/ cannot be reconstructed perfectly from cr .n/.
68 3 Audio Watermarking Techniques
3.2.4.1 Algorithm
There are some audio watermarking schemes conducting in the cepstrum domain
[95–99]. Li et al. [95] found that the statistical mean of cepstrum coefficients is an
attack-invariant feature and accordingly developed cepstrum domain watermarking
based on statistical mean manipulation (SMM). In this way, the statistical mean of
cepstrum coefficients, , is set to be ˙˛w that represent watermark bit “1” and “0”
respectively. Apparently, the strength of watermarking is controlled by ˛w . For the
detection, the statistical mean of watermarked cepstrum coefficients w is calculated
and compared to a predefined threshold Td . If w Td , then the watermark bit
is “1.” Otherwise, it is “0.” Note that instead of the mean, we use the sum for
bit determination in practice. This is because the mean of cepstrum coefficients is
usually around 104 , which is rather small for a comparison.
Algorithm 3.2 describes the pseudocode of the watermarking process.
Furthermore, Hsieh et al. [97] proposed a method of embedding based on time
energy features to solve the synchronization problem. The watermark is embedded
into the frames followed by salient points, the positions where signal energy
climbs fast to a peak [33]. As salient points are supposed to remain stable after
attacks, synchronization can be regained in the detection. Afterward, Cui et al. [98]
improved the method in [95] by employing the psychoacoustic model to control the
audibility of the introduced distortion. Apart from SMM, Gopalan [99] embedded
the watermark by altering the cepstrum in the regions that are psychoacoustically
masked, so as to ensure a better trade-off between imperceptibility and robustness.
Different from [95, 97–99], Lee et al. [96] spread the PRS in the cepstrum
domain to watermark the audio signal, considered in some sense as kind of a
SS watermarking. Also, in order to minimize its audibility, the PRS is weighted
according to the distribution of the cepstrum coefficients and the masking threshold
from the psychoacoustic model to minimize its audibility. But the scheme is non-
blind because the host signal is required in the detection.
% Watermark embedding
% Note: Host signal so .n/ ; 1 n No is split into NP frames fgi g with N samples,
˘
% where NP D NNo . Also, Nw D NP .
% Watermark detection
% Notes: sa .n/ is the attacked signal.
˚
sa .n/ D gia .j / , 1 i NP and 1 j N ;
Predefined threshold Td ;
Bit "1"
No attack: T d =3.2716 Echo addition: T d =3.6284
Bit "0"
4 4
2 2
μi
μi
0 0
−2 −2
−4 −4
1 100 200 249 1 100 200 249
Frame index Frame index
MP3 compression (48kps): T d =3.1196 Cropping: T d =1.0419
4 4
2 2
μi
μi
0 0
−2 −2
−4 −4
1 100 200 249 1 100 200 249
Frame index Frame index
Fig. 3.7 Distributions of Rone and Rzero under different attacks. Note that these data are produced
by watermarking with N D 4; 096, ˛w D 0:001, and nr D 3
commonly used attacks, such as noise addition, low-pass filtering, echo addition,
MP3 compression, and random samples cropping. For every attacked signal, the
sum of the cepstrum coefficients of each frame is written as i D sum cia .
If bit “1” was originally embedded in the ith frame, i is put into the “one”
set, i.e., Rone D f i j wo .i / D 1g. Otherwise, i is put into the “zero” set, i.e.,
Rzero D f i j wo .i / D 0g. Then, the minimum of Rone and the maximum of Rzero
0
are averaged to be one sub-threshold Td D min.Rone /Cmax.R
2
zero /
. Experimental results
show that the elements in Rone are positive numbers, commonly being around N˛w ,
whereas the elements in Rzero vary greatly between ŒN˛w ; N˛w . Therefore, the
maximum sub-threshold is a proper Td to achieve low BERs. To further reduce
the BER, more delicate adjustments on Td are required. For a better illustration,
the distribution of elements in Rone and Rzero is plotted in Fig. 3.7, where they
are denoted by blue asterisks and red circles respectively. As shown on the graph,
0
the maximum sub-threshold Td D 3:6284 under echo addition is an appropriate
Td . Furthermore, after subtle alteration, Td D 3:5 is utilized as the final detection
threshold for the conditions of N D 4; 096, ˛w D 0:001, and nr D 3 in Table 3.4.
Finally, variable frame length in the detection is utilized to combat pitch-invariant
time-scale modification (PITSM). Under the default setting for SMM detection, the
attacked signal sa .n/ ; 1 n Na is assumed to be as long as the watermarked
Table 3.4 Results of performance evaluation of cepstrum domain watermarking
Watermarking parameters
SNR /dB 27:07 26:49 26:73 20:74 20:28 20:06 17:92 17:37 17:37
ODG 0:16 0:17 0:16 0:46 0:44 0:49 0:04 0:08 0:07
Watermarking parameters
signal (as well as the host signal), i.e., Na D No . Also, the detection uses the same
frame length as the embedding, i.e., NQ D N . Then, the same splitting j method
k
Q
adopted in the embedding is employed to divide sa .n/ into NQ w D N Q
a p N
D
j k N .1p/
No pN
N .1p/
D Nw frames. Correspondingly, Nw watermark bits are extracted.
However, PITSM will adjust playback speed of the audio signal and Na is
changed accordingly. For example, it is lengthened by positive PITSM .Na > No / or
shortened by negative PITSM .Na < No /. For most audio watermarking techniques,
such alteration in signal length will cause severe problems to the detection. On one
hand, both positive and negative PITSM modify the time-scale of the watermarked
signal, which results in a displacement between the detection and embedding.
Without the retrieval of synchronization, watermark detection cannot work properly.
On the other hand, in the case of negative PITSM, we probably are unable to extract
as many watermark bits as are embedded. This is because the attacked signal has not
enough samples for the detection. For example, given No D 5:252 105 and N D
j k
No pN 5:252105 12 2;048
2; 048, the host signal is split into Nw D N .1p/ D 2;048.1 12 /
D 511
frames, and hence a number of bits up to 511 can be embedded. After being modified
by 10 % PITSM, the attacked signal has a shorter length of Na D 4:77455 j k105 .
Q
If NQ D N D 2; 048 is still in use, there would be NQ w D N a p N
NQ .1p/
D
1
4:7745510 2 2;048
5
2;048.1 12 /
D 465 frames. This means that only 465 bits can be extracted
8
at most, not to mention the low detection accuracy.
Under such circumstances, variable frame length for re-synchronization in the
Q
detection isjproposed: k the frame length N should vary with signal length Na ,
Q
i.e., N D N so . So the attacked signal would still be split into NP frames
sa
approximately.
MATLAB script for cepstrum domain watermarking can be found as
C epst rum wat ermarki ng:m file under Audio_Watermarking_Techniques
folder in the attached CD.
It is worth mentioning that although other desynchronization attacks such as
random samples cropping, zeros inserting, and jittering may also change signal
length, variable frame length is not very effective in these cases, especially the
former two. Under serious cropping and inserting attacks, a large amount of
removed or added samples occur locally, not uniformly along the whole water-
marked signal. Therefore, the detection with variable frame length cannot achieve
proper re-synchronization to recover the watermark. To withstand these attacks,
either the watermark is embedded on the basis of attack-invariant features, such
as the statistical mean of the cepstrum coefficients on a large scale, or the detection
can locate the positions where the attacks take place, such as the synchronization
method introduced in the next chapter.
8
Despite the fact that only a portion of bits are extracted, the corresponding BER is always
calculated for performance evaluations.
74 3 Audio Watermarking Techniques
a
0.7
−0.7
0 2 4 6 8 10
b
Normalized amplitude
0.7
−0.7
0 2 4 6 8 10
c
0.03
−0.03
0 2 4 6 8 10
Time (s)
Fig. 3.8 Host signal and a watermarked signal by cepstrum domain watermarking. Note that the
watermarked signal is produced by watermarking with N D 2; 048, ˛w D 0:0015 and nr D 3.
(a) Host audio signal. (b) Watermarked audio signal. (c) Difference between the watermarked and
host audio signals
a
H0 2 cD1
S0
H0 2 cD2
G0 2
H0 2 cD3
G0 2
G0 2 cA3
b cD1 2 H1
S0
cD2 2 H1
2 G1
cD3 2 H1
2 G1
cA3 2 G1
Fig. 3.9 A three-level DWT decomposition and reconstruction. (a) Wavelet decomposition.
(b) Wavelet reconstruction.
Notes: 1. and denote downsampling and upsampling by two, respectively.
2. H0 =G0 and H1 =G1 are high-pass/low-pass analysis and synthesis filters, respectively
3.2.5.1 Algorithm
robustness, Hwang et al. [101] embedded the watermark into post-masking regions
with high-energy and low zero-crossing rate (ZCR). Instead of DWT coefficients,
certain detail coefficients of the last level from the DWPT were chosen for
watermarking in [105]. Different from [7, 101, 104–106] embedded the watermark
in wavelet domain using the patchwork method. Moreover, to enhance the security,
Cvejic et al. [7] employed a secret key to randomly select the subbands used for
watermarking.
Inspired by the idea of cepstrum domain watermarking in [95], Li et al. [107]
applied SMM in the wavelet domain, where the mean of approximation coefficients
at the last level is modified to embed the watermark. Basically, the procedures of
watermark embedding and detection described in [107] are similar to Algorithm 3.2.
To further improve the performance, we also employ the Hanning window and
half overlapping for smooth transition as well as variable frame length to strive
against PITSM. However, the estimation of detection threshold Td is not necessary
in wavelet domain watermarking based on SMM. Similar pre-attack experiments as
those in cepstrum domain watermarking show that the detection threshold Td can
be specified as Td D 0.
MATLAB script for wavelet domain watermarking can be found as W avelet
wat ermarki ng:m file under Audio_Watermarking_Techniques folder in the
attached CD.
a
0.7
−0.7
0 2 4 6 8 10
b
Normalized amplitude
0.7
−0.7
0 2 4 6 8 10
c
0.02
−0.02
0 2 4 6 8 10
Time (s)
Fig. 3.10 Host signal and a watermarked signal by wavelet domain watermarking. Note that the
watermarked signal is produced by watermarking with N D 2; 048, nr D 3, and ˛w D 0:01. (a)
Host audio signal. (b) Watermarked audio signal. (c) Difference between the watermarked and host
audio signals
against all the attacks except DA/AD conversion, cropping, and inserting. To further
reduce the BERs under cropping and inserting, a longer frame (e.g., N D 4; 096)
would be required.
As a kind of statistical watermarking, wavelet domain watermarking based
on SMM is also troubled with security. In [95], the watermark was encrypted
before embedding for the purpose of security consideration. In this way, the
encrypted watermark remains incomprehensible to the attacker without the secret
key for decryption. However, the encryption only offers additional security on top
of watermarking, but cannot prevent deliberate alteration on the mean of DWT
coefficients from seriously destroying the embedded watermark. Therefore, it is
quite important to enhance the security of the watermarking scheme.
Echo hiding embeds the watermark into host signals by introducing different echoes.
With well-designed amplitudes and delays (offset), the echoes are perceived as
resonance to host audio signals and would not produce uncomfortable noises [36].
Table 3.5 Results of performance evaluation of wavelet domain watermarking
Watermarking parameters
Frame length N 1024 2048
Embedding strength ˛w 0.01 0.02 0.01
Watermark length Nw 1; 016 338 203 1; 016 338 203 504 168 100
Repetition coding nr 1 3 5 1 3 5 1 3 5
SNR /dB 23:38 22:92 22:90 20:80 20:01 20:06 26:28 25:90 25:59
ODG 0:15 0:15 0:16 0:10 0:13 0:12 0:48 0:47 0:50
Watermarking parameters
Frame length N 1024 2048
Embedding strength ˛w 0.01 0.02 0.01
Compression I 96 kbps 22:05 2:66 0:49 17:03 0:30 0 19:05 0:60 0
64 kbps 22:34 2:66 0 17:03 0:30 0 18:65 0:60 0
48 kbps 21:95 2:96 0:49 16:83 0:59 0 18:85 0 0
Cropping (4 25 ms) 48:03 38:76 35:96 44:19 39:94 33:50 37:70 32:14 19:00
Jittering (0.1/20 ms) 46:85 41:72 33:00 44:49 31:36 21:67 38:69 22:02 2:00
Inserting (4 25 ms) 46:36 42:90 36:45 42:62 41:42 33:99 40:67 35:71 18:00
PITSM C10 % N 48:62 47:04 44:33 49:70 48:82 46:31 50:40 54:76 46:00
NQ 37:30 24:26 12:81 37:11 13:31 6:40 28:77 4:17 0
10 % N 49:52 49:03 46:77 51:45 54:52 42:47 52:04 47:10 45:16
NQ 38:29 21:30 7:88 34:25 9:17 2:96 30:95 6:55 1:00
TPPSM C10 % 37:40 18:05 7:88 35:04 12:43 2:96 25:99 8:33 2:00
10 % 32:87 14:50 6:90 29:63 4:44 1:97 26:79 4:17 1:00
3 Audio Watermarking Techniques
3.2 Audio Watermarking Algorithms 81
a b
Host signal Host signal
a a
d1 Time d0 Time
Fig. 3.11 Impulse response of echo kernels. (a) “One” kernel. (b) “Zero” kernel
3.2.6.1 Algorithm
where ˛ is echo amplitude and d is the delay. To represent bit “1” and “0,” echo
kernels are created with different delays (d1 and d0 ), as shown in Fig. 3.11. Usually,
the allowable delay offsets for 44.1 kHz sampled audio signals are set to be 100
150 samples (about 2:3 3:4 ms) [108]. Consequently, the watermarked signal is
described as follows:
In order to detect the watermark, cepstrum analysis is utilized to discern the value
of delay. According to Fig. 3.6a, the complex cepstrum of the watermarked signal
sOw .n/ is defined as
where F fg and F1 fg denote the Fourier transform and the inverse Fourier trans-
form, respectively. After substituting Eq. (3.15) into Eq. (3.16), sOw .n/ is written as
˚ ˚
where sOo .n/ D F1 log So e j! and hO .n/ D F1 log H e j! are the
complex cepstrum of so .n/ and h .n/, respectively. In view of Eq. (3.14), we have
2 3
H e j! D 1 C ˛e j!d . Using the Taylor series log .1 C x/ D x x2 C x3
for jxj < 1, hO .n/ is calculated by
˚
hO .n/ D F1 log 1 C ˛e j!d
˛2 ˛3
D F1 ˛e j!d e j 2!d C e j 3!d
2 3
˛2 ˛3
D ˛ı .n d / ı .n 2d / C ı .n 3d / (3.18)
2 3
˛2 ˛3
sOw .n/ D sOo .n/ C ˛ı .n d / ı .n 2d / C ı .n 3d / (3.19)
2 3
This shows that a series of impulses with exponentially decaying amplitudes
repeatedly appear for every d samples. In particular, the dominant spike is just
located at the delay (n D d ) and its amplitude is equal to that of the embedded
echo, ˛. Then, the watermark can be decided based on the comparison between the
values of cepstrum coefficients at two delays, i.e., sOw .d1 / and sOw .d0 / .
To further increase the amplitude of cepstrum spikes representing the echoes,
autocorrelation of the cepstrum (auto-cepstrum) ca .n/ is employed to detect the
delay [36, 109]:
n o
ca .n/ D F1 log .F fsw .n/g/2 : (3.20)
Since the autocorrelation calculates the signal power at each delay, the power
spike in the cepstrum is more prominent, as illustrated in Fig. 3.12. Therefore, the
watermark bit is determined by comparing ca .d1 / and ca .d0 /:
8
<1; if ca .d1 / ca .d0 / ;
we .i / D (3.21)
:0; if ca .d1 / < ca .d0 / :
The performance of echo hiding depends on echo kernels, and hence different
echo kernels are introduced to improve the imperceptibility and robustness of the
embedded echoes [3, 108, 110–112]. In [108], the echo kernel comprises multiple
positive and negative echoes with different delays. Typically, a dual echo kernel
with one positive and one negative echo is denoted as
10
Complex cepstrum
8 Real cepstrum
Auto−cepstrum
6
2
Amplitude
−2 Close−up
1
−4
0.5
−6
−8 0
401
−10
1 401 1024 2048 3072 4096
Samples
Therefore, the watermark bit can be determined by comparing Œca .d1 / ca .d1 C4/
and Œca .d0 / ca .d0 C 4/. By virtue of the positive and negative echo kernel,
high-energy echoes can be added to enhance the robustness, while audio quality
is not deteriorated. This is because by combining these closely located positive and
negative echoes, the frequency response of the dual echo kernel can remain flat
over lower frequencies. Thus, the perceptual quality of the watermarked signal is
preserved.
Later, Kim et al. [110] proposed the backward and forward echo kernel as
follows:
h .n/ D ı .n/ C ˛ı .n d / C ˛ı .n C d / : (3.24)
hO .n/ D ˛ fı .n d / C ı .n C d /g ˛2
2 fı .n 2d / C 2ı .n/ C ı .n C 2d /g
˛3
C 3 fı .n 3d / C 3ı .n d / C 3ı .n C d / C ı .n C 3d /g
D ˛ C ˛ 3 C ˛ 5 C ı .n d / C
D ˛ 2 ı .n d / C :
1˛
(3.25)
84 3 Audio Watermarking Techniques
Therefore, the watermark bit can be determined by comparing ca .d1 / and ca .d0 /.
From Eq. (3.25), the amplitude of the cepstrum peak at n D d is equal to 1˛ ˛
2,
which is larger than ˛ for 0 < ˛ < 1. As a result, the detection rate is increased.
MATLAB script for echo hiding can be found as Echo hid i ng:m file under
Audio_Water-marking_Techniques folder in the attached CD.
In addition, a time-spread echo kernel [111] is introduced to enhance the security.
Although the large amplitude of the cepstrum peak is beneficial to robustness,
obvious spikes are against the purpose of security and unauthorized attackers might
detect the existence of echoes easily without prior knowledge. By using a PRS to
spread multiple echoes, the amplitude of each echo becomes small. It contributes
directly to the imperceptibility, while the detection ability is better maintained as
well. Furthermore, log-scaling watermark detection [112] is proposed to cope with
pitch-scale modification (PSM). Recently, Chen et al. [3] designed an advanced echo
hiding scheme based on the analysis-by-synthesis approach.
In our experiments, echo hiding schemes with a single echo kernel (Kernel 1), a
positive and negative echo kernel (Kernel 2), and a backward and forward echo
kernel (Kernel 3) are evaluated. For a fair comparison, the frame length and the
amplitude of different echo kernels are the same, i.e., N D 4; 096 and ˛ D 0:2.
In order to enhance the security, a sequence of pseudorandom numbers is utilized
as the secret key to shift between several echo delays. Each delay is denoted as
dxy , where x and y represents the pseudorandom number (PRN) and the watermark
bit, respectively. In this way, if the PRN is 1 and bit “0” is to be embedded, then
d10 is selected. In considering both imperceptibility and robustness, the value of
delays is set as follows: d11 D 100 and d01 D 120 are used for embedding bit
“1,” while d10 D 110 and d00 D 130 are used for embedding bit “0.” Moreover,
additional delay 4 D 4 is used in the positive and negative echo kernel. Similar
to previous watermarking techniques, Hanning windowing and half overlapping for
smooth transition, variable frame length to combat PITSM and repetition coding are
also employed to further improve the performance.
Figure 3.13 shows the watermarked signal produced by echo hiding using the
positive and negative echo kernel. The results of performance evaluation of three
echo kernels are summarized in Table 3.6. The positive and negative echo kernel
provides higher SNRs than the other two kernels, which means the least distortion
between the watermarked and host signals. However, the watermarked signals with
three echo kernels obtain the same ODG scores and are deemed to be similar in
perceptual quality. Informal subjective listening tests show that the added echoes do
not introduce annoying noises, rather they make the sound rich.
Regarding the robustness of the watermarked signals, the backward and forward
echo kernel (Kernel 3) generally provides the best detection rate. With the help
of triple repetition coding (nr D 3) and variable frame length (indicated as NQ ),
the BERs of watermarks under all attacks except TPPSM are less than 10 %. After
3.2 Audio Watermarking Algorithms 85
a
0.7
−0.7
0 2 4 6 8 10
b
Normalized amplitude
0.7
−0.7
0 2 4 6 8 10
c
0.1
−0.1
0 2 4 6 8 10
Time (s)
Fig. 3.13 Host signal and a watermarked signal by echo hiding. Note that the watermarked signal
is produced by watermarking with N D 4; 096, ˛ D 0:2, and nr D 3. (a) Host audio signal.
(b) Watermarked audio signal. (c) Difference between the watermarked and host audio signals
quintuple repetition coding (nr D 5), the BERs under cropping and inserting are
reduced further. For the other two kernels, the positive and negative echo kernel
(Kernel 2) performs slightly better than the single echo kernel (Kernel 1), where the
BERs are decreased on average. In addition, it is worth mentioning that all three
kernels exhibit good resistance to DA/AD conversion. However, all the watermark
detections with any kernel fail completely under TPPSM. Also, echo addition might
be a hazardous attack on echo hiding watermarking. If echo delays in the attack
happen to be the same as those of echo kernels, the mistakes in watermark detection
are unavoidable.
Echo hiding is a watermarking technique specifically for audio signals. By
selecting the proper amplitude and delay of echo kernels, the echoes embedded as
the watermark can be imperceptible and robust against most attacks. However, echo
hiding may suffer from two deficiencies. One is weak security, because obvious
cepstrum peaks might be tampered with deliberately. The other is about inborn
echoes contained in natural sound, which might result in false-positive errors [110].
86
Watermarking parameters
Echo hiding method Kernel 1 Kernel 2 Kernel 3
Watermark length Nw 251 83 50 251 83 50 251 83 50
Repetition coding nr 1 3 5 1 3 5 1 3 5
SNR /dB 14:78 14:19 14:13 19:25 18:48 18:30 11:56 11:45 10:87
ODG 2:20 2:20 2:20 2:20 2:20 2:20 2:20 2:20 2:20
TPPSM C4 % 44:22 38:55 40:00 43:82 50:60 34:00 45:02 32:53 36:00
4 % 46:22 42:17 48:00 59:36 80:72 80:00 46:22 43:37 50:00
87
88 3 Audio Watermarking Techniques
3.2.7.1 Algorithm
There are several ways in which the histogram is modified to embed the watermark.
Coltuc et al. [113] implemented a robust image watermarking using exact histogram
specification, where the image histogram is shape-altered (e.g., a saw-teeth shape
with 3 Ï 8 periods) to represent a watermark. Later, MeÅŸe et al. [114] designed
an optimal algorithm for histogram modification, so that the mean square error
(MSE) is minimized between the modified and host images. In [115], the watermark
was embedded into the image by permuting some pairs of histogram bins. Apart
from being robust against geometrical attacks, the watermarking scheme is also
reversible, which means that the watermarked image can be fully restored to its
original status. Unlike the commonly used histogram in the time domain, Xuan
et al. [116] employed histogram shifting in the integer wavelet transform domain
for reversible image watermarking.
In [35], histogram modification is applied in audio watermarking. Based upon
the fact that the modified audio mean and the audio histogram shape are invariant
to temporal scaling, the authors designed an audio watermarking scheme resistant
to time-scale modification (TSM), random cropping, and inserting attacks. The
modified audio mean A is defined as the average of the absolute value of the
XNo
16-bit signed audio signal, i.e., A D N1o jso .n/j. In the embedding, A is used
nD1
the amplitude range B of the samples for producing the histogram, i.e.,
to decide
B D A; A . Through extensive experiments on different audio signals, a
suggested range is 2 Œ2; 2:5[35]. The number of histogram bins Nbin depends
on watermark length Nw , i.e., Nbin 3Nw . The factor 3 comes from the reason
that the watermark is embedded by controlling the relative relation of the number
of samples in every threej neighboring
k bins. Accordingly, the bin width Nbw is
calculated as Nbw D N 2A
bin
. The value of Nbw affects the properties of both
imperceptibility and robustness. A small Nbw is likely to reserve the shape of the
original histogram, which is beneficial to the perceptual quality. Meanwhile, each
bin should contain sufficient samples to ensure the watermark robustness. Therefore,
must be carefully chosen to obtain a suitable Nbw .
3.2 Audio Watermarking Algorithms 89
After the histogram is constructed, its bins are divided into Nw groups, each of
which has three bins. For every group, the number of samples in three consecutive
bins is denoted as Nb1 , Nb2 , and Nb3 , respectively. Then, one watermark bit is
embedded into one group of histogram bins by applying the following rules [117]:
8
< .Nb1 CNb3 / E ; if wo .i / D 1;
2Nb2 h
(3.26)
: 2Nb2 E ; if w .i / D 0;
.Nb1 CNb3 / h o
where the embedding strength Eh is around 1:2 1:5. Obviously, a large Eh would
increase the robustness but degrade the perceptual quality.
From the embedding process, one point to note is that histogram-based water-
marking has the advantage of handling silent intervals of the host audio signal. As
mentioned in Sect. 3.1.1, it is nearly impossible to embed the watermark into zero
values. In histogram-based watermarking, all zero-value samples fall into the center
of the histogram. Therefore, these samples can be well preserved, provided that the
one or two bins right in the center are exempted from watermarking.
w
In the detection, the modified mean of the watermarked signal A is calculated,
and then the histogram is constructed with the same and Nbin . Furthermore,
the histogram bins are divided into groups in the same way as the embedding.
Therefore, the watermark bit is decided by comparing the number of samples in
w w w
three consecutive bins, namely, Nb1 , Nb2 , and Nb3 :
8
<1; .N w CN w /
if b12N w b3 1;
we .i / D b2 (3.27)
:0; .Nb1w CNb3w /
if 2N w < 1:
b2
w
Since the attacks might change the modified audio mean, A is probably not equal to
A. As a result, the histogram range is deviated and the watermark cannot be detected
correctly. Therefore, it is necessary to search for the proper modified audio mean.
Experimental results show that the fluctuation of the modified audio mean is usually
less than ˙6 %; thus, optimal searching within this range is proposed to prevent
exhaustive searching [35].
MATLAB script for histogram-based watermarking can be found as
H i st ogram wat ermarki ng:m file under Audio_Watermarking_Techniques
folder in the attached CD.
a
0.7
−0.7
0 2 4 6 8 10
b
Normalized amplitude
0.7
−0.7
0 2 4 6 8 10
c
0.003
−0.003
0 2 4 6 8 10
Time (s)
Fig. 3.14 Host signal and a watermarked signal by histogram-based watermarking. Note that the
watermarked signal is produced by watermarking with Nw D 40, D 2:2, and Eh D 1:5.
(a) Host audio signal. (b) Watermarked audio signal. (c) Difference between the watermarked and
host audio signals
because the samples in each histogram bin are arranged in the order they are found
in the host signal. During the process of modifying the samples into three bins of
each group, the samples at the beginning and the end of each bin are always chosen
with priority. Our implementation attempted to randomly select the samples in the
bins, but the perceptual quality of the watermarked signal became worse. The reason
might be that the randomly distributed modification resembles the addition of white
noise.
The results of performance evaluation of histogram-based watermarking are
illustrated in Table 3.7. Generally, a larger Nw or a smaller would result in a
better perceptual quality, but a weaker robustness. This is because the bin width Nbw
is proportional to , but inversely proportional to Nw . Moreover, histogram-based
watermarking is indeed quite robust against most desynchronization attacks such
as TSM, cropping, jittering, and inserting. In compromise with the imperceptibility,
the watermarks are also able to survive from PSM to a certain extent.
However, the BERs of the watermarks are rather high in the cases of some attacks
like low-pass filtering, DA/AD conversion, reverberation, and MP3 compression.
w
Even extending the searching range of A to ˙10 %, the detection rate has no
substantial improvement.
Table 3.7 Results of performance evaluation of histogram-based watermarking
Watermarking parameters
Watermark length Nw 20 40
Embedding range 2.2 2.5 2.2 2.5
Embedding strength Eh 1:2 1:5 1:2 1:5 1:2 1:5 1:2 1:5
SNR /dB 41:99 38:66 40:96 37:55 47:95 44:59 46:70 43:35
3.2 Audio Watermarking Algorithms
No attack 0 0 0 0 0 0 0 0
Noise (30 dB) 5:00 0 0 0 20:00 12:50 25:00 17:50
Resampling (22.05 kHz) 0 0 0 0 0 0 0 0
Requantization (8 bit) 30:00 30:00 35:00 35:00 35:00 35:00 27:50 25:00
Amplitude C10 % 0 0 0 0 0 0 0 0
10 % 0 0 0 0 0 0 0 0
Lp filtering (8 kHz) 5:00 5:00 5:00 0 27:50 22:50 22:50 17:50
DA/AD (line-in jack) 25:00 20:00 35:00 30:00 32:50 30:00 27:50 30:00
Echo (0.3, 200 ms) 25:00 30:00 30:00 30:00 15:00 22:50 20:00 22:50
Reverb (1 s) 25:00 25:00 15:00 15:00 27:50 22:50 17:50 20:00
(continued)
91
92
Watermarking parameters
Watermark length Nw 20 40
Embedding range 2.2 2.5 2.2 2.5
Compression I 9 kbps 0 0 0 0 15:00 15:00 10:00 15:00
64 kbps 5:00 5:00 5:00 5:00 15:00 17:50 22:50 20:00
48 kbps 15:00 5:00 15:00 10:00 22:50 25:00 22:50 27:50
Cropping (4 25 ms) 0 0 0 0 0 0 0 0
Jittering (0.1/20 ms) 0 0 0 0 0 0 0 0
Inserting (4 25 ms) 0 0 0 0 0 0 0 0
PITSM C10 % 0 0 0 0 0 0 0 0
10 % 0 0 0 0 0 0 0 0
TPPSM C10 % 0 0 0 0 10:00 0 7:50 5:00
10 % 0 0 0 0 2:50 0 5:00 2:50
3 Audio Watermarking Techniques
3.3 Summary 93
The reason is that the attacks have smoothed the histograms and the relative
w w w
relations between Nb1 , Nb2 , and Nb3 of each group are destroyed. As an exception,
the histogram changes dramatically after requantization. Since quantization rounds
off the samples, some histogram bins are eliminated, which leads to failure in
watermark detection. For an 8-bit quantizer, the quantization error is calculated as
o .n// o .n//
q D 2max.s
28
D max.s
128
[118]. Therefore, the width of histogram bins should be
larger than q to combat requantization attack.
Later, instead of the histogram in the time domain, Xiang et al. [2] exploited
the invariance of histogram shape in the low-frequency subband of the DWT
domain. It is reported that the improved watermarking method is more robust against
TSM, low-pass filtering, and MP3 compression. Nevertheless, the security of the
watermarking scheme is still a serious issue.
In addition to the mean, other statistical moments can be used as invariants in
histogram-based watermarking. For example, image steganalysis in [119] employed
the first four moments (i.e., the mean, variance, skewness, and kurtosis) of the
subbands that are decomposed by quadrature mirror filters (QMF). In the book, we
attempted to utilize the moments in [120] for audio watermarking, where the audio
signal has been converted into a two-dimensional square matrix as implemented in
[121]. However, most invariants (except for the first moments) are not applicable
to audio watermarking. The reason might be that the audio signal represented in a
two-dimensional form is incomparable to a two-dimensional image.
3.3 Summary
4.1 Preliminaries
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 95
DOI 10.1007/978-3-319-07974-5__4, © Springer International Publishing Switzerland 2015
96 4 Proposed Audio Watermarking Scheme
the watermark bits are repeatedly embedded to increase the robustness. Also, a
synchronization bit is repeatedly embedded for synchronization purposes.
For a better understanding of the watermarking procedure, we start with some
preliminary knowledge, such as the selection of watermarking regions and the
structure of the watermarking domain.
As mentioned in Sect. 3.1.1, embedding the watermarks into silent segments would
introduce unavoidable perceived noise. Therefore, a selection process is applied
on the host audio signal to determine the embedding regions. Correspondingly, it
is necessary to perform this procedure in the watermark detection stage, so as to
locate the regions for watermark detection. Since various attacks might alter the
watermarked signal, the watermarking regions should be rather stable to ensure that
they still can be identified in the attacked signals [4, 5, 122]. In this way, the process
of selection is a kind of initial synchronization between the watermark embedding
and detection.
There are several methods for selecting reliable watermarking regions which
usually follow certain distinct points, for example, salient point extraction [6,33,97],
peak point extraction [123], and envelope peak extraction [5]. Commonly, these
delicate methods have been employed as solutions to synchronization; however,
this is not necessary in the proposed scheme. The selection of our watermarking
regions mainly aims to preclude the long silences from watermarking, not to solve
the synchronization problem for watermark detection. Thus, the accuracy of locating
watermarking regions is not required to be as high as that in synchronization.
Similar to the method in [4], the proposed scheme selects the watermarking
regions according to the signal energy. The energy is calculated on a frame-by-frame
basis along the input signal, where each non-overlapping P frame g .n/ has a length
of N . Then, successive frames whose energy, i.e., N 2
nD1 g .n/, exceed a certain
threshold ET are concatenated to construct a high-energy segment. In our scheme,
a high-energy segment which has a duration of more than 2 s is considered to be a
long high-energy segment. Furthermore, adjacent long high-energy segments which
are located within 0.1 s are concatenated into one watermarking region.
The predefined energy threshold ET plays an important role in the selection of
watermarking regions. For different values of ET , different watermarking regions
are selected from the input signal, as shown in Fig. 4.1.
With a lower ET , more audio frames are included and hence longer segments
will be available for embedding the watermark. In this case, data payload is higher.
However, since the segments with relative low energy are susceptible to attacks, the
obtained watermarking regions might be unstable. With a higher ET , more stable
segments will be chosen for watermarking, but data payload is reduced accordingly.
Therefore, given the watermark to be embedded, ET is then determined to achieve
a better compromise between the robustness and data payload.
4.1 Preliminaries 97
Normalized amplitude
Watermarking region 1 Watermarking region 2
0.7
−0.7
0 4 8 12 16 20 24
Time (s)
b
Normalized amplitude
−0.7
0 4 8 12 16 20 24
Time (s)
Frequency
fs Bsub(1) Bsub(2) Bsub(3) Bsub(4)
2
Block Block Block Block ... ...
0 Time
The watermarking domain, which is generated by taking the fast Fourier transform
(FFT) of adjacent audio frames with 50 % overlap, refers to the time–frequency
representation of the selected watermarking regions. Each frame has a length of N
points. As shown in Fig. 4.2, the watermarking domain is divided into Nblock blocks.
The reason for the half overlapping is to smooth the transition between frames, as
used by the previous techniques in Chap. 3.
Each block is used to embed one sub-watermark Bsub , which is a part of the
original watermark wo , i.e.,
˚
wo D Bsub.m/ m D 1; : : : ; Nblock : (4.1)
98 4 Proposed Audio Watermarking Scheme
. ... .. . .......
...........
......
......
........
. ... .. . .......
. ... .. . .......
...........
......
...........
......
......
6
5 FFT coefficient
. .......
. .......
. .......
. .......
. .......
.. . . . . ..
Tile
4 (T12,4)
. . . ..
. . . ..
. . . ..
3
2 Subband
..
..
..
..
..
..
..
..
1
Frame index
0 1 2 3 4 5 6 7 8 9 10 11 12 (Time)
Moreover, every sub-watermark Bsub contains Nbit watermark bits Bi 2 f1; 1g,
i.e.,
Figure 4.3 shows the details of each block as segmented into different levels of
granularities, such as unit, subband, slot, and tile.
Along the time axis, every block is divided into Nunit units, each of which
comprises Nc frames. Thus, one block has Nf D Nc Nunit frames. In our scheme,
Nc D 4. The concerns about the effect of Nc D 4 will be recognized in Sect. 4.3.1.
Along the frequency axis, the block is divided into Nsubband nonuniform perceptually
motivated subbands based on the Gammatone filterbank (GTF) (to be described in
Sect. 4.1.3).
The intersection of a subband and a unit is called a slot. Each watermark bit
Bi 2 f1; 1g is repeatedly embedded into a number of slots for robustness purposes.
In addition to the watermark bits, a synchronization bit Bs D 1 is also repeatedly
embedded for synchronization purposes. Specifically, NB slots are randomly chosen
from the total number of slots within every block for embedding each Bi . Then the
remaining slots are used for embedding Bs , whose total number is Ns . The value of
Ns is calculated by
In this way, every slot is used for embedding a bit, as exemplified in Fig. 4.4.
Without loss of generality, it is assumed here that each sub-watermark consists of
two watermark bits, i.e., Bsub D fB1 ; B2 g, which are separately embedded into five
slots, i.e., NB D 5. Thereby, Ns D 3625 D 8 slots are used for embedding Bs .
To enhance watermarking security, every bit embedded in the slot is modulated
by a pseudorandom number (PRN), Px 2 f1; 1g. As mentioned above, NB slots
4.1 Preliminaries 99
Subband
.. . .. ... . . . . . .. .
. . . . . . . . . . .. . . . .
. . . . . . . . . . .. . . . .
. . . . . . . . . . .. . . . .
.. . .. ... . . . . . .. .
.. . .. ... . . . . . .. .
. . . . . . . .. . . . . . . .
. . . . . .. . . . . . . .. .
. . . . . .. . . . . . . .. .
. . . . . . . . . . .. . . . .
. . . . . . . . . . .. . . . .
. . . . . .. . . . . . . .. .
6
4
3
2
1
0 1 2 3 4 5 6 7 8 9 10 11 12 Frame index
Fig. 4.4 Distribution of the watermark bits and synchronization bit. Note: Slots, , and
are used for embedding B1 , B2 , and Bs , respectively
In this example, MP consists of Pb1 D fPb1 .1/ ; Pb1 .2/ ; : : : ; Pb1 .5/g, Pb2 D
fPb2 .1/ ; : : : ; Pb2 .5/g, and Ps D fPs .1/ ; Ps .2/ ; : : : ; Ps .8/g, which correspond to
B1 (in red), B2 (in blue), and Bs (in black), respectively. Without loss of generality,
our implemented watermark detection (to be described in Sect. 4.3) searches Bs and
each Bi which are embedded in the block, from the left bottom and then column
by column. Correspondingly, the indices of Ps and Pbi in Eq. (4.4) separately start
from the left bottom and column by column.
Note that all blocks have the same configuration of MB , which is solely
determined by kb . But the value of MB varies from block to block, since each block
has different watermark bits fBi g, i D 1; : : : ; Nbit to be embedded. As for MP ,
100 4 Proposed Audio Watermarking Scheme
0.8
Amplitude frequency response
0.6
0.4
0.2
0
1000 2000 3000 4000 5000 6000 7000 8000
Frequency (Hz)
it is unique and the same as all blocks to keep synchronization. The values of MP
components are determined by one secret key kp , which also belongs to confidential
information shared only between the embedder and the authorized detectors.
The smallest basic element of the slot is called tile as indicated in Fig. 4.3. A tile
consisting of several FFT coefficients is the basic module for amplitude modulation
in the watermark embedding. Due to Nc D 4, each slot contains four tiles. Recall
that every slot is used for embedding one bit, which is modulated by a PRN. More
specifically, the four tiles in a slot are used for embedding a bit and they share the
PRN of that slot in common. Accordingly, the tiles used for embedding each bit are
identified in the watermark detection and used to determine the bit value.
Each tile is denoted as Tt;b , where t and b are its frame and subband indices
respectively. For example, T12;4 shown in Fig. 4.3 stands for the tile located at the
4th subband of the 12th frame.
To get the center frequency and the bandwidth of each subband, the entire
carrying band (fL fH ) and the number of channels NGTF are required. Then,
overlapping spacing is calculated by
9:26 fH C 228:7
D log ; (4.5)
NGTF fL C 228:7
However, the channels from the GTF are usually overlapped, which results in
confusion of embedding bits. In order to get a set of non-overlapping subbands,
the lower limit of each channel (V l ) is always taken as the boundary, where V l D
fc Bw =2. This is because critical bandwidths are determined by the lower edges
of the band [11]. Moreover, narrow channels in the low frequencies might contain
only one FFT coefficient or even less under a sampling frequency of 44.1 kHz.
Since a single frequency coefficient is sensitive towards slight modification, several
channels in the low frequencies are combined in our scheme. In this way, the tiles
are forced to contain more than five FFT coefficients, which greatly helps improve
the robustness. For example, a set of 32 nonuniform subbands ( i.e., Nsubband D 32)
over the frequency spectrum is illustrated in Appendix G.
For a given host signal to be watermarked, the first step of the embedding algorithm
is the selection of the watermarking regions. This is followed by the construction
of the watermarking domain. Generally, the watermark bits are embedded through
amplitude modulation of the tiles in the watermarking domain.
102 4 Proposed Audio Watermarking Scheme
+ Watermarked
frame
+
Host frame Psychoacoustic Magnitude
Model 1 Watermark
signal
Phase
FFT Reconstruction IFFT
N
%Fp 1 W is the positive-frequency part of Fwm :
2
N
%Fn 1 W is the negative-frequency part of Fwm :
2
N
Fp 1 W D sign .* magnitude .* exp (j*phase)I
2
N N
Fn 1 W D fliplr conj Fp 1 W I
2 2
N N
Fwm .1 W N / D Fp 1 W ; 0; Fn 1 W 1 I
2 2
where N is the frame length, function fliplr ./ is to flip a matrix left to right, and
function conj ./ is to compute the complex conjugate.
After taking the inverse fast Fourier transform (IFFT), the frequency spectrum of
the watermark signal is transformed to the time domain and subsequently added to
the host frame to produce the watermarked frame.
Finally, all the watermarked frames in each watermarking region are windowed
by a Hanning window for smooth concatenation. By combining original samples
with the watermarked regions in order, the overall watermarked audio signal is
formed.
Figure 4.7 shows an example of the watermarked signal, where the host signal
is the audio test file A2 (Bass.wav). It is observed that the watermark signal has a
similar shape to the host signal, which might help preserve the perceptual quality of
the watermarked signal.
Nsubband Š
Nscrambling D P Nsubband ; NQ subband D : (4.8)
Nsubband NQ subband Š
104 4 Proposed Audio Watermarking Scheme
a
0.7
−0.7
0 4 8 12 16 20 24
b
Normalized amplitude
0.05
−0.05
0 4 8 12 16 20 24
c
0.7
−0.7
0 4 8 12 16 20 24
Time (s)
Fig. 4.7 Host signal and a watermarked signal by the proposed scheme. (a) Host signal.
(b) Watermark signal. (c) Watermarked signal
From the description of the embedding algorithm in Sect. 4.2.1, the watermark bits
in a bit matrix term are used to determine the sign of the spectrum of the watermark
signal. Therefore, in the detection, every watermark bit is determined by checking
4.3 Watermark Detection 105
1
As mentioned in Sect. 4.1.2, the tile is the basic module for amplitude modulation in the
watermark embedding. Therefore, the detection is focused directly on the tiles, not the slots.
2
As defined in Sect. 1.2.1, the input to the watermark detector is generally called the attacked
signal, no matter whether it has been attacked or not. In the case that a watermarked signal has
been attacked, we specifically call it an attacked watermarked signal.
3
The random stretching attack used by [9, 10] which was implemented by omitting or inserting
a random number of samples (usually called “random samples cropping/inserting”) and the pitch
shifting attack by linear interpolation are much less complicated than PITSM and TPPSM.
106 4 Proposed Audio Watermarking Scheme
Attacked signal
Basic detection
Selection of
wming regions
Watermarking
region
Frequency
Construction of spectrum Frequency
abs(.)
wming domain alignment
Magnitude
spectrum
Power
spectrum
Mean removal 20log10(.)
Descaled magnitude
spectrum
Whitened power
spectrum
Amplified
Inter- difference Analysis of Magnitude of
subtraction the tiles the tiles
Average
synchronization
strength (As)
Secret keys
(kb & kp)
Block Adaptive
synchronization synchronization
Synchronization position
(dsync)
Threshold
Bit (Tsync)
Bit strength Calculation of
determination bit strength
Watermark
bit
Secret keys
Watermark (kb & kp)
reconstruction
Final watermark
Fig. 4.8 Block diagram of watermark detection. Notes: (1) Basic detection works independently.
(2) Adaptive synchronization is an improvement technique for block synchronization. (3) Fre-
quency alignment indicated by dashed lines is an additional solution to excess PITSM and TPPSM
4.3 Watermark Detection 107
• Basic detection is the basic algorithm for watermark detection, which includes a
set of steps for detecting the watermark from the attacked signal. It begins with
the selection of watermarking4 regions and the construction of the watermarking
domain, similar to the embedding algorithm. Then, several operations are
performed to calculate the magnitude of the tiles, which are required by block
synchronization. By use of the secret keys, block synchronization aims to find
out the synchronization position of each block. Based on the synchronization
position found, the tiles used for embedding each watermark bit are identified
and further utilized to calculate the bit strength. According to the bit strength, the
value of that watermark bit is determined. Finally, the watermark bits detected
from all the blocks comprise the whole watermark.
• Adaptive synchronization is an improvement technique for block synchroniza-
tion. Except that a threshold Tsync is introduced as an extra input, adaptive
synchronization has the same inputs (i.e., magnitude of the tiles and secret keys)
and the same output (i.e., synchronization position dsync ) as block synchroniza-
tion. Unless requested by frequency alignment, adaptive synchronization does
not output the average synchronization strength (As ).
• Frequency alignment is an additional solution to excessive PITSM and TPPSM of
up to ˙10 %. Only when basic detection updated with adaptive synchronization
cannot detect the watermark from the attacked watermarked signal, frequency
alignment is employed to descale the frequency spectra. Average synchronization
strength from adaptive synchronization is involved in choosing the scale factor.
As seen from Fig. 4.8, basic detection consists of the following eleven steps:
• Step 1: Selection of the watermarking regions
For a given signal, successive frames which exceed an energy threshold ET are
specified and concatenated into high-energy segments as watermark regions. Only
these segments are used for watermark detection.
• Step 2: Construction of the watermarking domain
Similar to the approach in the embedding, the watermarking domain for detection
is also generated by taking the FFT of adjacent frames with 50 % overlap. However,
Hanning windowing is applied on these frames before taking the FFT. This is
because all the watermarked frames are windowed and subsequently concatenated
into the watermarked signal, as described in Sect. 4.2.1.
The frequency spectrum of the tth windowed frame, Ft .n/, is calculated by
4
The word “watermarking” is abbreviated as “wming” in the first two boxes in Fig. 4.8.
108 4 Proposed Audio Watermarking Scheme
where N is the frame length, gt .n/, 1 n N is the tth frame, and Hanning .n/,
1 n N is the N -point Hanning window.
In view of the conjugate symmetry of the frequency spectrum, we only process
the positive-frequency part with N2 coefficients, i.e., Ft 1 W N2 .
• Step 3: Calculation of the magnitude spectrum
ˇ ˇ
The magnitude spectrum of the tth windowed frame is shown to be ˇFt 1 W N
2
ˇ,
where jj is the absolute value.
• Step 4: Calculation of the power spectrum
The power spectrum of the tth windowed frame, Pt 1 W N
2
, is calculated by
ˇ ˇ
N ˇ N ˇˇ
Pt 1 W ˇ
D 20 log10 ˇFt 1 W : (4.10)
2 2 ˇ
1 X
N=2
where Pt D ŒPt .n/ is the mean of Pt 1 W N
2
.
N=2
nD1
• Step 6: Inter-subtraction
To amplify the effect of the watermark signal and reduce the effect of the host
signal, the difference PQt 1 W N2 between each frame and the one just after the next
is calculated:
N N N
PQt 1 W D POt 1 W POtC2 1 W : (4.12)
2 2 2
The reason is that on the condition that adjacent frames in the blocks are half
overlapped, these two frames (t and t C 2) are considered to be non-overlapped
with each other.
• Step 7: Analysis of the tiles
To accumulate the effect of the watermark signal over the FFT coefficients in a
tile, the magnitude of each tile is calculated. Specifically, the magnitude of the tile
located at the bth subband of the tth frame, Qt;b , is calculated by
4.3 Watermark Detection 109
Vb h
X
PQt .n/
nDVbl
Qt;b D ; (4.13)
Vbh Vbl C 1
where Vbl and Vbh refer to the lower and upper bounds of the bth subband,
respectively.
• Step 8: Block synchronization
The purpose of block synchronization is to find out the beginning frame of each
block.
Since every frame in its block is possibly the beginning frame, it is necessary
to calculate synchronization strength Sd d D 1; : : : ; Nf frame by frame, where
Nf D Nc Nunit . On the assumption that the dth frame is the beginning frame, Sd
for the dth frame is calculated by the following normalized correlation [13]:
X
Ns
Qt.d;k/; b.k/ Ps .k/
kD1
Sd D v v ; (4.14)
u Ns u Ns
uX uX
t Qt.d;k/; b.k/
2
t ŒPs .k/2
kD1 kD1
˚
where Qt.d;k/; b.k/ represent Ns tiles that are used for embedding Bs under the
assumed condition and fPs .k/g are their corresponding PRNs. The subscripts
t .d; k/ and b .k/ to locate the tiles are computed by
where Nc is the number of frames per unit and Nc D 4 is considered in our scheme.
Rs is called the index matrix for Bs , which solely depends on the secret key kp
mentioned in Sect. 4.1.2. To indicate the location of these tiles, the first and second
columns of Rs represent the distribution of the tiles’ columns and rows, respectively.
For example, given MB in Eq. (4.4) which corresponds to Fig. 4.4, its Rs is shown
as follows. As mentioned already in Eq. (4.4), the search for Bs starts from the left
bottom and column by column without loss of generality.
110 4 Proposed Audio Watermarking Scheme
(4.17)
Then, the beginning frame of this block (i.e., the synchronization position dsync )
is the frame that provides the maximum Sd :
NB h
X i
Qt .dsync ;k /; b.k/ Pbj .k/
kD1
Gj D v v ; (4.19)
u NB h i2 u
uX uXNB
t Qt .dsync ;k /; b.k/ t Pbj .k/
2
kD1 kD1
Cropped frames
Subband
Block Block Block
Frame index
1 4 8 12 1 5 10 12
If Gj 0; then Bj D 1:
(4.22)
If Gj < 0; then Bj D 0:
Block synchronization
Magnitude of
the tiles (Qt,b) Calculate Sd dsync Yes
Determine dsync ~
(d =1,…, Nf) dsync Tsync dsync
Secret keys based on Eq.(3.18)
by Eq.(3.14)~(3.16)
(kb & kp)
No
Replace Eq.(3.15)
with Eq.(3.23)
detection in Eq. (4.19) will be taken from the third block, instead of the second
block. Consequently, the watermark bits embedded in the second block cannot be
correctly detected.
To find out the actual synchronization position of the second block (i.e., the
current first frame), we need to compensate for the cropped frames. According to
the structure of the block, the number of the cropped frames can be calculated by
dm D Nf dsync C1 D Nc Nunit dsync C1 D 4310C1 D 3. Therefore, when
identifying the tiles for embedding Bs , we take into consideration an offset of dm .
Specifically, t .d; k/ in Eq. (4.15) should be modified into the following equation:
Note that if more than half of the frames in a block are missing or destroyed, we
do not expect to detect the watermark bits embedded in that block. Therefore, 1
N N
dm 2f is considered. In view of dm D Nf dsync C1, then 2f C1 dsync Nf .
This means that under different attacks, Tsync is varied between 1 C Nf =2 and Nf .
Thus, we need to perform adaptive synchronization for each possible Tsync and sub-
sequently calculate the average synchronization strength Async that is defined as
1
NX
block h i
Async D SdQsync .k/ ; (4.25)
Nblock
kD1
where SdQsync .k/ is the kth block’s SdQsync . Then, the Tsync that provides the maximum
Async is ˘regarded as the desired one. Experimentally, an optimal value of Tsync is
0:8Nf , where bc is the smallest integer value.
fQ D ˛ f; (4.26)
where ˛ .> 0/ is the scale factor, f is the frequency of the original signal, and fQ is
the ˛-scaled frequency [112]. If ˛ > 1, it is a positive PSM that gets a higher pitch.
If ˛ < 1, it is a negative PSM that gets a lower pitch. Accordingly, such a frequency
fluctuation introduces desynchronization in watermark detection.
To retrieve the synchronization positions, frequency alignment attempts to reverse
the process of PSM. Thus, the modified frequency spectrum is descaled by
a
iF 1 2 3 4 5 6 7 8 ...
iX 1 2 3 3 4 4 4 ... ...
Average
Average
Dt Ft (1) Ft (2) Ft 3, 4 Ft 5, 6, 7 ... ... ... ...
b
iF 1 2 3 4 5 6 7 8 ...
Linear interpolation
Fig. 4.12 Illustration of frequency alignment (a) Positive PSM (b) Negative PSM
terminated provided that the copyright information can be recognized. Even without
knowing TBER , we can always perform frequency alignment on any attacked signal,
regardless of the attacks which are unknown beforehand. But frequency alignment
probably would fail to extract the watermarks modified by other attacks rather
than TSM/PSM attacks. Eventually, if the watermarks extracted by all attempts are
unrecognized, the suspected signal is claimed to be unwatermarked. Therefore, the
proposed audio watermarking scheme is undoubtedly blind.
Figure 4.12 shows the process of frequency alignment, where ˛ is a scale factor,
Ft is the frequency spectrum of the tth windowed frame in Eq. (4.9), and Dt is the
resulting vector after performing frequency alignment. Note that we consider both
positive- and negative-frequency parts of Ft during the calculation; hence, the length
of Ft is N . However, only the first N2 points of Dt , i.e., Dt 1 W N2 , will be taken as
the descaled frequency spectra for further processing.
In general, the elements of Ft are indexed by a vector iF , i.e., iF D 1 W N .
Another index vector iX is calculated by rounding the result of iF =˛ to the nearest
integer, i.e., iX D round .iF =˛/. The values of iX in turn determines the resulting
vector, Dt .
Specifically, for a positive PSM, ˛ > 1 may result in repetitive values of iX ,
i.e., the nth, .n C 1/th, . . . , and .n C x/th elements have the same value. If the nth
116 4 Proposed Audio Watermarking Scheme
element of iX has a unique value m, then transfer Ft .n/ to Dt .m/. For instance,
in our example in Fig. 4.12a, the first element of iX has a unique value of 1, this
indicates transferring Ft .1/ to Dt .1/. Otherwise, calculate the average of the nth to
.n C x/th elements (which correspond to repetitive values of iX ) and transfer this
value to Dt .m/. For instance, in our example, the third and fourth elements of iX
both have values of 3, which indicates transferring the average of Ft .3/ and Ft .4/
to Dt .3/.
As for negative PSM (˛ < 1), the values of iX are discontinuous. If the nth
element of iX has a value of n, then transfer Ft .n/ to Dt .n/. For instance, in our
example in Fig. 4.12b, the first element of iX equals 1, this indicates transferring
Ft .1/ to Dt .1/. If the nth element of iX has a value m, where m ¤ n, then transfer
Ft .n/ to Dt .m/, e.g., in our example, the fourth element of iX equals 5, this indicates
transferring Ft .4/ to Dt .5/. As the values of iX is discontinuous, only part of the
vector Dt is filled by this mechanism. The rest of the vector is calculated by linear
interpolation between successive known values.
As shown above, the outcomes of the frequency alignment for a positive and
negative PSM are different—one being compression and the other being expansion.
Using this knowledge, we can use one positive and one negative trial values of the
scale factor respectively to descale the frequency spectra. Two trial values are gener-
ally within ˙10 %, e.g., C6 % and 6 %. After the frequency spectra are descaled,
adaptive synchronization is performed to output the average synchronization
strength Async . Let us denote the value of Async obtained from the positive trial value
by AC C
sync and that from the negative trial value by Async . If Async > Async , it can be
deduced that the watermarked signal has been attacked by a positive PSM, and vice
versa. Further, the scale factor can be delicately adjusted for a higher detection rate.
To convert PITSM to TPPSM, the length of the host signal .No / is required to
be the information shared between the embedder and the detector. By comparing
with the length of the attacked signal .Na /, it is ascertained whether the attack
is a positive or negative PITSM. Accordingly, the attacked signal is resampled to
the corresponding TPPSM. Then, frequency alignment is performed to improve the
accuracy of watermark detection. Although a slight deviation of Na might occur
(which happens when samples cropping or inserting attack the watermarked signal
along with PITSM), experimental results in the next chapter show that such an
amount of difference is negligible to the operation of resampling.
As mentioned in Sect. 4.3.3.2, the BER threshold TBER is used to determine whether
one detection is successful or not and further to declare whether a watermark exists
or not. In practice, we are given a signal under inspection and then perform water-
mark detection to extract a supposed watermark. Once one extracted watermark has
a BER less or equal than TBER , the suspected signal is claimed to be watermarked.
4.3 Watermark Detection 117
X k
1 .Nw k/
Nw
1
Ppd D C .Nw ; k/ 1
2 2
kD.Nw Ne /
(4.28)
1 X
Nw
D C .Nw ; k/;
2Nw
kD.Nw Ne /
X
Ndet
n .Ndet n/
Ppw D C .Ndet ; n/ Ppd 1 Ppd ; (4.29)
nD1
X
Nw
Pcd D C .Nw ; k/ .1 TBER /k .TBER /.Nw k/ : (4.30)
kD.Nw Ne /
Recall that the suspected signal is claimed to be watermarked if at least one detection
is successful. Therefore, the false-negative probability of watermark existence is
determined by [130]
X
Ndet
Pnw D 1 C .Ndet ; n/ .Pcd /n .1 Pcd /.Ndet n/ : (4.31)
nD1
From the analysis of the embedding and detection algorithms, the watermark
embedded for copyrights protection is essentially represented by a series of
watermark bits.
To better function for copyrights protection, the proposed scheme adopts coded-
image such as with bit “1” and “0” (mapped to “1”)5 as a visual
watermark, instead of a meaningless pseudorandom or chaotic sequence. A coded-
image can be identified visually, as a kind of ownership stamp. Moreover, post-
processing on the extracted watermark can also be done to enhance the binary image
and consequently the detection accuracy will increase. Image denoising and pattern
recognition are examples of post-processing techniques for automatic character
recognition. Thus, on top of BER, coded-image provides a semantic meaning for
reliable verification [131]. In addition, as mentioned in Sect. 4.2.2, the watermarking
scheme benefits from the encryption of the coded-image to obtain extra security.
5
By definition, a coded-image belongs to a binary image, which has only two values for each pixel.
4.4 Coded-Image Watermark 119
Noise A
6
Appcr1: Character Recognition at http://www.mathworks.com/access/helpdesk/help/toolbox/
nnet/.
120 4 Proposed Audio Watermarking Scheme
b c
Fig. 4.14 Character recognition by the neural network. (a) Letters “C” “O” “P” “Y” “R” “I” “G”
“H” “T”. (b) Noisy coded-image watermark. (c) Recovered coded-image watermark
4.5 Summary
In Chap. 4, the embedding and detection algorithms of the proposed audio water-
marking scheme were analyzed theoretically. The aim of this chapter is to examine
system performance in terms of imperceptibility, robustness, security, data payload,
and computational complexity, as required in Sect. 1.3.1.
First, the process of determining the parameters used for watermarking is
described. Then performance measurement begins with perceptual quality assess-
ment, which consists of the subjective listening test and the objective evaluation test.
This is subsequently followed by a complete robustness test including both basic and
advanced robustness tests. After performing a security analysis, we carry out the
estimations of data payload and computational complexity. Finally, a performance
comparison is made between the proposed scheme and other existing schemes.
Some observations are discussed according to the experimental results.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 123
DOI 10.1007/978-3-319-07974-5__5, © Springer International Publishing Switzerland 2015
124 5 Performance Evaluation of Audio Watermarking
These variables combine to affect the system performance in some way. Accord-
ing to the embedding algorithm previously described, larger Ns and NB could
contribute to a stronger robustness against desynchronization attacks to some extent.
Correspondingly, a larger value of Nunit and a smaller value of Nbit are desired.
However, a larger Nunit would lead to an increase in computational complexity.
As indicated in Eq. (4.18), the number of times that each block searches for its
synchronization position is equal to Nf D Nc Nunit , where Nc D 4. Also, the
values of all tiles in the blocks should be provided simultaneously for watermark
detection, thus more computer memory is required to store the data. Additionally,
as a result of a smaller Nbit , data payload would be inevitably reduced.
In view of these constraints, the above variables are specified as follows to aim for
a good compromise between various requirements. That is, Nunit D 10, Nbit D 4,
NB D 30, and the resulted Ns D 160, which will be employed as constants for all
experiments.
• Determination of watermark strength ˛w
The amplitude of the watermark signal in the embedding has an important influence
on system performance. As discussed in Sect. 4.2.1, the magnitude spectrum of
the watermark signal is controlled by the watermark strength ˛w . To get good
performance in the experiments, watermark strength might be selected to be
uniformly distributed between 10 and 200, i.e., ˛w D 10; 20; 30; : : : ; 200.
5.1 Experimental Setup 125
Given a host audio signal, watermarking with a smaller ˛w would result in better
imperceptibility, but weaker robustness; on the other hand, watermarking with a
larger ˛w would result in imperceptibility degradation, but stronger robustness [19].
Considering that imperceptibility is a prerequisite for practical application of audio
watermarking, our prime concern in determining ˛w is dedicated to maintaining the
perceptual quality of the watermarked audio signals. Within the scope of satisfactory
perceptual quality, a value of ˛w that provides adequate robustness is adopted. In our
experiments, the software “Perceptual Evaluation of Audio Quality” (PEAQ) [48]
is used to interpret the perceptual quality, and an objective difference grade (ODG)
within Œ2:0; 0 is deemed to be acceptable. Also, the property of robustness is
denoted by the bit error rate (BER) of the extracted watermark under 36 dB noise
addition attack. 1 Empirically, a reasonable BER is expected to be less than 10 %.
For example, Fig. 5.1 shows the determination of watermark strength ˛w for
Bass:wav and P op:wav. As indicated in Fig. 5.1a, watermarking with ˛w 50
is considered unperceived, since the ODGs fit within the allowable range Œ2:0; 0.
Also, under the condition of ˛w 50, the BERs are less than 10 % and thereby the
requirement of robustness is met. Consequently, ˛w D 50 is the only appropriate
value for watermarking Bass:wav. By comparison, embedding a robust watermark
into P op:wav while retaining the imperceptibility is more feasible. With the same
method, it is found that a proper watermark strength for watermarking P op:wav
ranges between 60 and 140, i.e., 60 ˛w 140. In this case, we use the average
value ˛w D 100 for the experiments with P op:wav below.
• Determination of the embedded watermark
Generally, the less watermark bits embedded, the better imperceptibility but the
worse robustness. To evaluate the proposed scheme fairly, our experiments always
embed the watermark bits into host signals at full capacity. For example, the number
of watermark bits .Nw / embedded into Bass.wav, Gspi.wav, Harp.wav, and Pop.wav
is 350, 210, 140, and 280, respectively. More analysis of data payload will be
presented later in Sect. 5.5.1.
As shown in Sect. 4.4, a coded-image watermark offers greater advantage over
a pseudorandom sequence (PRS). Therefore, the coded-image watermark is always
adopted in the experiments. Recall that each letter on the coded-image watermark
is represented by a matrix of 7 5 bits, i.e., Lw D 35 bits for one letter. Thus,
Nw watermark bits can be coded into NL D Nw =Lw letters. For example, the
coded-image watermark embedded into Bass.wav is , which consists of
NL D 350=35 D 10 letters; the coded-image watermark embedded into Gspi.wav
is , which consists of NL D 210=35 D 6 letters; the coded-image watermark
1
Additive noise attack is a commonly used attack in robustness test of audio watermarking
techniques. As clearly indicated in Appendix A and B, SDMI standard and STEP 2000 employ
36 dB and 40 dB additive noise attack respectively. Therefore, a rigorous additive noise attack with
a lower SNR value, i.e., 36 dB additive noise attack, is chosen for our basic robustness test listed
in Appendix E.
126 5 Performance Evaluation of Audio Watermarking
a
ODG
0 40%
BER
−0.5 35%
−1 30%
Imperceptibility (ODG)
ODG = −1.986
Robustness (BER)
−1.5 BER = 9.14% 25%
−2 20%
−2.5 15%
−3 10%
−3.5 5%
−4 0%
b
ODG
0 40%
BER
−0.5 35%
−1 30%
Imperceptibility (ODG)
Robustness (BER)
−1.5 25%
−2 20%
−2.5 15%
ODG = −1.076
BER = 3.93%
−3 10%
−3.5 5%
−4 0%
Fig. 5.1 Determination of watermark strength. ˛w (a) ˛w D 50 for Bass:wav. (b) ˛w D 100 for
P op:wav
5.2 Perceptual Quality Assessment 127
The goal of perceptual quality assessment is to fairly judge the perceptual quality
of the watermarked audio signals relative to host audio signals. To this end, both
subjective and objective approaches to perceptual quality assessment are employed
in this book, as discussed in Sect. 1.3.2.1.
Subjective listening tests are carried out in two ways: the MUSHRA test and the five-
scale subjective difference grade (SDG) rating. As described in Sect. 3.1.2, listening
tests performed in an isolated chamber were undertaken by ten trained participants
and all the stimuli are presented through a high-fidelity headphone.
The MUSHRA test stands for MUlti Stimuli with Hidden Reference and Anchors
test, which is defined by ITU-R recommendation BS.1534 [44]. In the MUSHRA
test, the participant is exposed to three types of audio clips as test, reference (i.e.,
the original unprocessed audio), and anchor audio signals. The recommendation
specifies that one anchor must be a 3.5 kHz low-pass filtered version of the reference
audio signal [44]. Also, a hidden reference is usually adopted as another anchor.
Then, the participant is asked to grade the perceptual quality of the audio signals
under test and the anchors relative to the reference audio signal.
We developed a MATLAB GUI for the MUSHRA test to help our analysis, as
shown in Fig. 5.2. In the context of audio watermarking, the watermarked signal
is the signal under test, while the host signal is the reference signal. As required,
the host signal is always presented in the experiments. For the anchors, we use three
versions of the host signal, i.e., a hidden version, a 3.5 kHz low-pass filtered version,
128 5 Performance Evaluation of Audio Watermarking
Fig. 5.2 Screenshot of the MATLAB GUI for the MUSHRA test. The buttons on the GUI have
the following functions: “Load,” load the host audio signal to be evaluated. “Start,” start playing
the sound from the beginning. “Pause/Stop,” pause or stop the sound that is currently playing.
“Resume,” resume the sound from the pause position. “Save,” save the host signal name and the
participant name as well as the registered scores into a .txt file. “Reset,” reset the interface for the
next trial
and a 96 kbps MP3 compressed version.2 During each experiment, the participant is
therefore asked to grade four versions of a given host signal, i.e., the watermarked
signal (WM), the hidden reference (HOST), the low-pass filtered version (LPF), and
the compressed version (MP3).
Given a host signal, one participant can launch the MUSHRA test by clicking
the “Load” button. Subsequently, four versions of the host signal will be randomly
assigned to Source AD. Then, the participant grades each version by moving the
slider to the location corresponding to the perception. Accordingly, a score between
Œ0; 100 appears in the text box below. With the buttons “Start,” “Pause/Stop,”
and “Resume,” the participant can switch instantly between different sound files.
Finally, the buttons “Save” and “Reset” save the test results and get ready for next
experiment.
2
The 3.5 kHz low-pass filtered version refers to a version of host audio filtered by a 3.5 kHz low-
pass filter, and the 96 kbps MP3 compressed version refers to a version of host audio after MP3
compression at 96 kbps.
5.2 Perceptual Quality Assessment 129
As mentioned in Sect. 3.1.1, the test set includes 17 pieces of audio signals,
denoted by Ai , i D 1; 2; : : : ; 17. Then, the four versions of Ai are denoted by
Aij , where j D 1; 2; 3, and 4 stands for the version of WM, HOST, LPF, and MP3,
respectively. Since there are 10 subjects participating in the tests, the score of Aij
provided by the k-th subject is denoted by GM .i; j; k/, where k D 1; 2; : : : ; 10.
After all the scores are collected, statistical analysis [44] is performed to assess
the perceptual quality of each Aij separately. First, the mean of the scores of Aij is
calculated by
1 X
K
ij D GM .i; j; k/ (5.2)
K
kD1
ij ıij ; ij C ıij (5.3)
where
ij
ıij D t0:05 p (5.4)
K
Here, t0:05 is the t test for a significance level of 95 % and is the standard deviation
defined as
v
u
u 1 XK
t 2
ij D GM .i; j; k/ ij (5.5)
.K 1/
kD1
On the assumption that the mean scores follow a normal distribution, the value of
t0:05 is equal to
˛
t0:05 D ˆ1 1 D ˆ1 .0:975/ D 1:96 (5.6)
2
Bass.wav
100
Gspi.wav
90 Harp.wav
Pop.wav
80
70
MUSHRA score
60
50
40
30
20
10
0
WM HOST LPF MP3
X-axis, while the perceptual quality scores are lying along the Y-axis. The average
scores over ten listeners as well as the related 95 % confidence interval are displayed.
From Fig. 5.3, the hidden references are rated the highest with small 95 %
confidence intervals. On the other hand, the scores of 3.5 kHz low-pass filtered
signals are quite low. This is because a low-pass filter attenuates high frequencies,
which makes audio samples sound dull. Moreover, the scores of watermarked
signals are comparable to those of MP3-compressed signals at 96 kbps, which
means they are of similar perceptual quality. For different host audio signals,
the watermarked Harp.wav signal has the best performance and its average score
is around 93. The second is the watermarked Pop.wav signal, followed by the
watermarked Gspi.wav signal. Although the watermarked Bass.wav signal obtains
the lowest average score, the score is still more than 80. Therefore, we conclude that
perceptual quality of all the watermarked signals is well preserved. Also, perceptual
quality depends on the music type. It is worth mentioning that three observations are
common issues among all techniques, due to the complexity of audio signals [133].
In addition to MUSHRA test, a rating based on the five-scale SDG evaluates
the perceptual quality of the watermarked signals in a straightforward manner. The
subjects are asked to rate a watermarked signal relative to its host signal according
to the descriptions in Table 1.2. Similarly, the SDG of host signal Ai from the k-th
subject is denoted as GSDG .i; k/, where i D 1; 2; : : : ; 17 and k D 1; 2; : : : ; 10.
Then, the average SDG of host signal Ai is calculated as
1 X
K
i D GSDG .i; k/ (5.8)
K
kD1
where K D 10.
5.2 Perceptual Quality Assessment 131
For example, the average SDGs for the watermarked Bass.wav, Gspi.wav,
Harp.wav, and Pop.wav signals are shown in Table 5.1. From the table, it is seen
that the average SDGs for these samples are in the range of 1:6 and 0. In fact,
except that a few feel a slight difference between Bass.wav and its watermarked
signal, most listeners find it hard to distinguish the host and the watermarked audio
signals during the experiments.
The goal of the robustness test is to investigate the capability of the watermarked
audio signals to resist various attacks. To fully evaluate the robustness of the
proposed audio watermarking scheme, we carry out both basic and advanced
robustness tests as depicted in Sect. 3.1.3. All ODGs in this section are provided
by using PEAQ.
Recall from Sect. 4.3.3.3 that the threshold TBER has an influence on both false-
positive (Ppw ) and false-negative (Pnw ) probabilities of declaring the existence of a
watermark. Suppose that the value of TBER is equal to 20 % [2, 5], Ppw and Pnw are
calculated using Eqs. (4.28)–(4.31).
Table 5.3 shows the results on the error probabilities of the watermarked
Bass.wav, Gspi.wav, Harp.wav, and Pop.wav signals, where Nw is the watermark
length and Ne D bNw TBER c is the number of wrong bits. As mentioned in
Sect. 5.1, 350, 210, 140, and 280 watermark bits are separately embedded into
Bass.wav, Gspi.wav, Harp.wav, and Pop.wav, so the resulting Ne is equal to 70, 42,
28, and 56, respectively. In addition, Ndet is the number of detections performed,
where two values (i.e., Ndet D 5 and 10) are considered.
For different host signals in Table 5.3, the false-positive probabilities Ppw
increase exponentially with the watermark length Nw , but vary slightly with
the number of detections performed Ndet . On the other hand, the false-negative
probabilities Pnw increase with Ndet , but vary slightly with Nw .
Generally, the severity of false-positive and false-negative probabilities is applica-
tion dependent. In our scheme, given TBER D 20 %, the false-positive probabilities
have already satisfied the requirement, being much less than 105 . When Ndet D 5
and 10, the false- negative probabilities are around 102 and 104 , respectively.
These values are sufficient for the application of copyrights protection [130].
Therefore, the threshold is set to be TBER D 20 % in our experiments. This means
that the detections with the BERs of greater than 20 % are considered failed.
Table 5.4 Results of the basic robustness test on the watermarked Bass.wav signal
Basic Adaptive Frequency Final
detection synchronization alignment watermark
(BER: %) we
No attack 0 0
SNR (dB) 40 dB 6.86 5.71
36 dB 11.71 9.14
30 dB 18.57
Resampling (22.05 kHz) 0 0
Requantization (8 bit) 17.71 16.00
Amplitude C20 % 0 0
20 % 0 0
Lp filtering 8 kHz 0 0
6 kHz 1.43 1.43
5 kHz 9.71 9.71
DA/AD (line-in jack) 0
Echo (0.3, 200 ms) 0.57 0.57
Reverb (1 s) 0 0
Compression II 96 kbps 0 0
64 kbps 1.43 1.14
48 kbps 3.71 2.86
Cropping (8 25 ms) 0
Jittering (0.1 ms/20 ms) 0
Inserting (8 25 ms) 0 0
PITSM C4% 0.57
C10 % 2.86
4 % 1.71
10 % 6.00
TPPSM C4 % 14.86 8.00
C10 % 4.86
4 % 12.57 6.00
10 % 4.29
Notes: 1. Symbol “”: one detection with a BER of greater than 20 %
2. Symbol “”: one unexecuted advanced detection
differ in the performance. In terms of the BER, almost all the BERs of we are
less than 10 %. Moreover, the coded-image watermarks embedded in the four
host signals can always be extracted and clearly identified. Therefore, compared
to merely meaningless bits, extra assistance in confirmation can be obtained
from the coded-image watermarks. Although some pixels in the coded-image
are mistaken, we are still able to recognize the copyrights information.
• For different watermarked signals, the watermarked Harp.wav signal in Table 5.6
shows the strongest robustness. None of the BERs are greater than 10 %, in fact
most of them are equal to 0 %. This is because a higher watermark strength ˛w
is used in watermarking Harp.wav, as can be seen from the lower SNR value of
the watermarked Harp.wav signal in Table 5.2. Due to the characteristics of the
5.3 Robustness Test 135
Table 5.5 Results of the basic robustness test on the watermarked Gspi.wav signal
Basic Adaptive Frequency Final
detection synchronization alignment watermark
(BER: %) we
No attack 0 0
SNR (dB) 40 dB 12.38 7.62
36 dB 12.38 9.52
30 dB 16.67 16.19
Resampling (22.05 kHz) 1.90 1.90
Requantization (8 bit) 19.05 19.05
Amplitude C20 % 0 0
20 % 0 0
Lp filtering 8 kHz 0.95 0.95
6 kHz 6.67 4.29
5 kHz 11.43
DA/AD (line-in jack) 0
Echo (0.3, 200 ms) 0 0
Reverb (1 s) 0 0
Compression II 96 kbps 2.86 2.86
64 kbps 3.81 3.81
48 kbps 14.76 9.05
Cropping (8 25 ms) 0
Jittering (0.1 ms/20 ms) 3.33
Inserting (8 25 ms) 0 0
PITSM C4 % 0.48 0.48
C10; % 10.00
4 % 4.29
10 % 10.48 4.29
TPPSM C4 % 10.95 3.81
C10 % 9.52
4 % 12.86 4.29
10 % 11.90
Notes: 1. Symbol “”: one detection with a BER of greater than 20 %
2. Symbol “”: one unexecuted advanced detection
Table 5.6 Results of the basic robustness test on the watermarked Harp.wav signal
Basic Adaptive Frequency Final
detection synchronization alignment watermark
(BER: %) we
No attack 0 0
SNR (dB) 40 dB 0 0
36 dB 0 0
30 dB 0 0
Resampling (22.05 kHz) 0
Requantization (8 bit) 0 0
Amplitude C20 % 0 0
20 % 0 0
Lp filtering 8 kHz 0 0
6 kHz 10.00 0
5 kHz 15.71 10.00
DA/AD (line-in jack) 0 0
Echo (0.3, 200 ms) 0 0
Reverb (1 s) 0 0
Compression II 96 kbps 17.14 0
64 kbps 0
48 kbps 4.29
Cropping (8 25 ms) 0
Jittering (0.1 ms/20 ms) 0
Inserting (8 25 ms) 0 0
PITSM C4% 2.86 2.86
C10 % 2.14
4 % 0
10 % 0.71
TPPSM C4 % 6.43
C10 % 1.43
4 % 12.86 4.29
10 % 4.29
Notes: 1. Symbol “”: one detection with a BER of greater than 20 %
2. Symbol “”: one unexecuted advanced detection
Table 5.7 Results of the basic robustness test on the watermarked Pop.wav signal
Basic Adaptive Frequency Final
detection synchronization alignment watermark
(BER: %) we
No attack 0 0
SNR (dB) 40 dB 0.71 0.71
36 dB 3.93 3.93
30 dB 12.14 10.71
Resampling (22.05 kHz) 0 0
Requantization (8 bit) 19.29
Amplitude C20 % 0 0
20 % 0 0
Lp filtering 8 kHz 0 0
6 kHz 11.79 2.50
5 kHz 16.43 9.64
DA/AD (line-in jack) 13.93 0
Echo (0.3, 200 ms) 0 0
Reverb (1 s) 13.93 0
Compression II 96 kbps 0 0
64 kbps 0 0
48 kbps 2.50 2.50
Cropping (8 25 ms) 0
Jittering (0.1 ms/20 ms) 0
Inserting (8 25 ms) 0 0
PITSM C4 % 8.57 2.14
C10 % 9.64
4 % 1.07
10 % 15.36 4.29
TPPSM C4% 10.00 1.43
C10 % 12.14
4 % 14.29 0.36
10 % 8.93
Notes: 1. Symbol “”: one detection with a BER of greater than 20 %
2. Symbol “”: one unexecuted advanced detection
In cases of PITSM and TPPSM attacks, the improved detection can mostly
combat PITSM and TPPSM within ˙4 %, but fail at larger distortions of
˙10 %. Under such circumstances, we resort to the advanced detection with
frequency alignment to extract the severely distorted watermarks. From Table 5.4
to Table 5.7, it can be seen that the BERs of we attacked by ˙10 % PITSM, and
TPPSM attacks are reduced greatly after frequency alignment, most of which are
not greater than 10 %.
Among various attacks, it is observed that requantization is the most difficult
attack. Except for the watermarked Harp.wav signal in Table 5.6, the other three
watermarked signals are rather vulnerable to the requantization and their BERs
are not less than 16 %. Moreover, noise addition poses difficulties for watermark
138 5 Performance Evaluation of Audio Watermarking
detection as the power of the added noise increases. Note that the decrease of
the specified SNR value, i.e., 40 dB ! 36 dB ! 30 dB, indicates the increase
of the noise power. To enhance the robustness against these two attacks, higher
watermark strengths ˛w are required to amplify the magnitude of the watermark
signals in the embedding.
In addition to common signal operations and the desynchronization attacks listed
above, some combined attacks are also applied on the watermarked signals. The
aim is to further evaluate the robustness of the proposed scheme. A combined
attack is a combination of two attacks, i.e., AT1 ./ followed by AT2 ./. The
procedure of applying a combined attack on the watermarked signal is as follows:
the watermarked signal is first attacked by AT1 ./ and the resulting signal is then
attacked by AT2 ./.
Two types of combined attacks are taken into consideration.
(1) Type I combined attack: AT1 ./ is random samples cropping, jittering, or zeros
inserting, while AT2 ./ is MP3 compression at 96 kbps or low-pass filtering at
8 kHz.
(2) Type II combined attack: AT1 ./ is C5 % PITSM, 5 % PITSM, C5 % TPPSM,
or 5 % TPPSM, while AT2 ./ is MP3 compression at 96 kbps, low-pass
filtering at 8 kHz, random samples cropping, jittering, or zeros inserting.
Without loss of generality, the watermarked Bass.wav signal is used as an
example. The coded-image watermark embedded into Bass.wav is
Table 5.8 shows the results of combined attacks on the watermarked Bass.wav
signal, including the final extracted watermarks we as well as their BERs. Note
that all the we attacked by Type I combined attacks are extracted by the improved
detection. Type II combined attacks are very destructive and the improved detections
fail to extract the watermarks. In this case, the advanced detection is employed to
recover the we attacked by Type II combined attacks.
In Table 5.8, each combined attack is the combination of the attacks on the
corresponding row and column. For instance, in (1) Type I combined attacks, the
shaded BER (0.86 %) and we are the results under the combined attack where AT1 ./
is random samples cropping and AT2 ./ is MP3 compression at 96 kbps. Also, in
(2) Type II combined attacks, the shaded BER (0.57 %) and we are the results under
the combined attack where AT1 ./ is 5 % PITSM and AT2 ./ is zeros inserting.
From Table 5.8, it can be seen that the proposed scheme is quite resistant to
these combined attacks. All the BERs of we are less than 10 % and the coded-
image watermarks can be clearly identified. It is observed that the combined attacks
involving jittering are generally more challenging than the others.
With regard to Type II combined attacks, one point to note is that the length
of the PITSM- or TPPSM-attacked signal has been altered by cropping, jittering,
and inserting. Even in these cases, the distorted watermarks can be recovered by
advanced detection. Therefore, it is proved that a slight change in the length of
the attacked signal has no influence on the efficiency of frequency alignment, as
discussed in Sect. 4.3.3.2.
5.3 Robustness Test 139
The advanced robustness test is designed especially for evaluating the proposed
audio watermarking scheme. As described in Sect. 3.1.3.2, the advanced robustness
test is comprised of three parts, namely a test with StirMark for Audio, a test
under collusion, and a test under multiple watermarking. Note that in the advanced
robustness test, all the watermarks are extracted by improved detection, i.e., the
basic detection updated with adaptive synchronization.
StirMark for Audio [134] is a publicly available benchmark for robustness evalu-
ation of audio watermarking schemes. In the experiments, we utilize StirMark for
Audio v0.2 with default parameters and a suite of 50 StirMark-attacked signals are
generated accordingly. Note that the attacked signals from StirMark for Audio are
140 5 Performance Evaluation of Audio Watermarking
stereo signals. Similar to the method in Sect. 3.1.1, the left channel is taken as the
attacked watermarked signal in our scheme.
Based on the description of the attacks in Appendix C and the analysis of
the attacked watermarked signals, the following attacks are excluded from the
evaluation: Addfftnoise, Extrastereo_30, Extrastereo_50, Extrastereo_70, Nothing,
Resampling, and Voiceremove. Since the audio test files are monaural, Extrastereo
attack has no effect. Also, as its name implies, the Nothing attack does nothing
with the watermarked signals. So in these cases, the watermarked signals remain
unchanged and the watermarks can always be extracted perfectly. Moreover, most
samples of the attacked signals under Addfftnoise, Resampling, and Voiceremove are
zeros, and hence it is unnecessary to proceed with the detection. Other than these,
the remaining 43 attacks are included in our experiments. It is worth mentioning that
the Original attack resembles the original (unattacked) watermarked signal, which
is actually the same as the Nothing attack. Therefore, the resulting signals from the
Original attack can be referenced as the original watermarked signals.
Detection results of the watermarked Bass.wav, Gspi.wav, Harp.wav, and
Pop.wav signals under StirMark for Audio are shown in Table 5.9. For simplicity,
the coded images are not illustrated in the table, and we only present the BERs of
the extracted watermarks. Apart from the BERs, the ODGs of the attacked signals
relative to their host signals are also calculated to get an insight into the amount of
distortion caused by the attacks. If the ODG of the attacked signal is comparable
to that of the original watermarked signal, it means that the watermarked signal is
less affected by this attack. Otherwise, the watermarked signal has already been
severely destroyed by the attack.
Table 5.9 shows that the proposed scheme has high resistance to most attacks in
StirMark for Audio, including common signal operations and some serious desyn-
chronization attacks, such as Copysamples, Zerolength, and Zeroremove. Although
more than half or all four watermark detections fail under the shaded attacks (i.e.,
Addnoise_500, Addnoise_700, Addnoise_900, Cutsamples, and Zerocross), these
attacks actually have strong negative impact on the fidelity of the watermarked
signals. As can be seen, in cases of failed detections, the ODGs are lower than
3:30, while 80 % of the ODGs are even lower than 3:80. Therefore, the attacked
signals are very different from the host signals, beyond the premise of the robustness
test.
For different watermarked signals, the watermarked Harp.wav signal possesses
the strongest robustness. Except for one failed detection and the detection under the
Echo attack, the rest of the attacks cannot destroy the embedded watermarks and the
resulting BERs are not more than 5 %. Next comes the watermarked Pop.wav signal,
which succeeds in 34 detections with BERs of less than 8 % and three detections
with BERs of around 16 %. For the watermarked Bass.wav signal, the BERs of all
35 surviving watermarks are less than 7 %. Similar to the conclusion in the previous
section, the watermarked Gspi.wav signal suffers more from the attacks. But even
so, the watermarked Gspi.wav signal fails in six detections only.
Finally, it should be pointed out that successful detections under Addbrumm_1100
Addbrumm_10100 and Addsinus attacks are on the condition that the initial
5.3 Robustness Test 141
watermarking regions are known to the detector. These attacks add high-amplitude
buzz or sinus tone throughout the watermarked signal3 , which has an influence on
the threshold ET for the selection of watermarking regions. As a result, the detector
cannot properly locate the regions for watermark detection. In this case, ET must
be set at a higher value to select more stable regions. However, as discussed in
Sect. 4.1.1, data payload is reduced accordingly.
.i;j /
In the detection, i watermarks we are detected from each average watermarked
signal sw.i/ individually:
w.i;j
e
/
D Detection sw.i/ ; 1 i n and 1 j i (5.10)
Such an averaging operation weakens the original watermarks and hence makes
them hard to detect. Note that the averaging collusion attack is the most common
collusion attack and nonlinear collusion attacks [135] are not taken into considera-
tion in the book.
During our experiments, four different watermarks (w.1/ .2/ .3/
o , wo , wo ;and wo )
.4/
are separately embedded into host signal so and yield four watermarked signals
(sw.1/ , sw.2/ , sw.3/ , and sw.4/ ). Then, four average watermarked signals are generated,
i.e., sw.1/ D sw.1/ , sw.2/ D 12 sw.1/ C sw.2/ , sw.3/ D 13 sw.1/ C sw.2/ C sw.3/ , and sw.4/ D
.1/
1
4
sw C sw.2/ C sw.3/ C sw.4/ .
After that, four sets of watermark detections are separately performed as fol-
lows:
(1) Detect w.1/ .1/ .1;1/
o from sw to obtain the extracted watermark we .
.1/ .2/ .2/
(2) Detect wo and wo separately from sw to obtain the extracted watermarks
we.2;1/ and we.2;2/ .
(3) Detect wo.1/ , w.2/ .3/ .3/
o , and wo separately from sw to obtain the extracted water-
.3;1/ .3;2/ .3;3/
marks we , we , and we .
3
In fact, the noises are quite loud already, as proved by the ODGs.
5.3 Robustness Test 143
(
sw.1/ D Embedding so ; w.1/
o
(5.11)
sw.i/ D Embedding sw.i1/ ; w.i/
o ; 2i n
.i;j /
In the detection, i watermarks we are extracted from the watermarked signal
sw.i/individually:
w.i;j
e
/
D Detection sw.i/ ; 1 i n and 1 j i (5.12)
w.2/
o , as w.3/ o , and as w.4/
o . For Harp.wav, the coded-image watermarks
.1/
used are as wo , as w.2/
o , as w.3/
o , and as w.4/
o . For Pop.wav, the
.1/
coded-image watermarks used are as wo , as w.2/
o , as
.3/ .4/
wo , and as wo .
For evaluation purposes, we calculate the SNRs and ODGs of the watermarked
signals (including sw.1/ sw.4/ ) relative to the host signal, as well as the BERs of the
extracted watermarks.
It is observed from Table 5.11 that for a given host signal, the SNRs and ODGs
of sw.1/ , sw.2/ , sw.3/ , and sw.4/ decrease gradually. This means that the perceptual quality
gets worse if the signal is watermarked more times. Take Bass.wav as an example.
The decreasing SNRs of sw.1/ , sw.2/ , sw.3/ , and sw.4/ are 33.36 dB, 29.79 dB, 27.93 dB,
5.3 Robustness Test 147
and 26.24 dB, respectively. Also, the decreasing ODGs of sw.1/ , sw.2/ , sw.3/ , and sw.4/ are
2:027, 3:131, 3:391, and 3:591, respectively. The reason is that more samples
of the signal are modified during the embedding of more watermarks.
Meanwhile, because of being watermarked more times, each individual water-
mark becomes more difficult to extract from the multiple watermarked signals.
Remember that w.;1/
e (including we.1;1/ , we.2;1/ , we.3;1/ , and we.4;1/ ) is the distorted wo.1/
extracted respectively from sw.1/ , sw.2/ , sw.3/ , and sw.4/ . It can be seen from Table 5.11 that
for a given host signal, the BERs of we.2;1/ , we.3;1/ , and we.4;1/ are usually increasing.
For example, the BERs of we.1;1/ , we.2;1/ , w.3;1/ e , and we.4;1/ for Pop.wav are 0 %,
0.36 %, 1.79 %, and 2.14 %, respectively.
On the whole, the BERs of all the extracted watermarks are less than 10 %.
This indicates that the proposed scheme is quite robust against multiple self-
watermarking.
• Inter-watermarking
In inter-watermarking, the watermarked signal is separately re-watermarked by
other audio watermarking techniques.
In addition to the proposed method (“Proposed”), four watermarking techniques
in Chap. 3 are also considered, i.e., cepstrum domain watermarking (“Cepstrum”),
wavelet domain watermarking (“Wavelet”), echo hiding with kernel 3 (“Echo”), and
histogram-based watermarking (“Histogram”). Note that least significant bit (LSB)
modification, phase coding, and spread spectrum (SS) watermarking are not taken
into consideration, since these three watermarking techniques cannot preserve the
perceptual quality of the watermarked signals.
During the process of inter-watermarking, cepstrum domain watermarking,
wavelet domain watermarking, echo hiding, and histogram-based watermarking use
the parameters as specified in Figs. 3.8, 3.10, 3.13, and 3.14, respectively. The
results in Chap. 3 show that these parameters provide the best performance for each
watermarking technique. Also, instead of the coded-image watermarks, the PRSs
are directly embedded at full capacity. Moreover, to avoid perceived noise caused
by watermarking the silence, the PRSs are embedded into the watermarking regions
selected by the proposed method.
To make a comparison of robustness against inter-watermarking between the
proposed method and the other audio watermarking techniques, two experiments are
carried out. The first experiment is to evaluate the ability of the proposed method
resistance to inter-watermarking. The second experiment is to evaluate the ability of
the other watermarking techniques resistance to inter-watermarking.
¾ In Experiment I, the watermarked signal generated by the proposed method is
separately re-watermarked by the considered watermarking techniques.
The procedure for Experiment I is described as follows.
(1) The proposed method embeds wo into so to generate sw and then detects wo
from sw to obtain the extracted watermark we .
148 5 Performance Evaluation of Audio Watermarking
*
Note: Symbol “*”: the column identical to the one in Table 5.13 below
From Table 5.13, it is observed that only the proposed method and echo hiding are
robust against inter-watermarking by all five watermarking techniques. The BERs
of the extracted watermarks are less than 2 %. Note that the successful detection of
echo hiding’s self-watermarking is conditional: the echo delays used by the attack
technique are different (as far away as possible) from the ones used by the host
technique.
Meanwhile, the other three techniques fail in some cases of inter-watermarking.
For example, given the watermarked signal generated by cepstrum domain water-
marking, the embedded watermark cannot survive the re-watermarking by its self-
watermarking or wavelet domain watermarking. Similarly, given the watermarked
signal generated by wavelet domain watermarking, the embedded watermark cannot
survive the re-watermarking by its self-watermarking or cepstrum domain water-
marking. By contrast, histogram-based watermarking shows the weakest resistance
to inter-watermarking. Given the watermarked signal generated by histogram-based
watermarking, the embedded watermark can merely survive the re-watermarking by
the proposed method.
In summary, the proposed audio watermarking scheme performs well throughout
the robustness test.
5.5 Data Payload and Computational Complexity 151
The goal of security analysis is to evaluate the security level of the proposed audio
watermarking scheme. As discussed in Sect. 3.2, cepstrum domain watermarking
and wavelet domain watermarking that are based on statistical mean manipulation
(SMM), echo hiding, and histogram-based watermarking all suffer from security
problem to varying degrees. A theoretical analysis of watermarking security is not
the focus of this book. As introduced in Sect. 1.3.2.3, an intuitive method of security
analysis is to calculate the possible ways for embedding. If there were more possible
ways for embedding, unauthorized detection without secret keys would become
more difficult to identify and/or remove the embedded watermark.
In our experiments, each block is divided into 32 nonlinear subbands, where
28 subbands are randomly selected for embedding. In this case, Nsubband D 32
and NQ subband D 28. Accordingly, the number of possible ways due to channel
scrambling is calculated by using Eq. (4.8):
Such a huge number (i.e., 1:1 1034 ) makes unauthorized detection nearly
impossible, which means that the property of the security has increased greatly.
This is just one code complexity, which can be further multiplied by the complexity
introduced by the PRNs.
Data payload and computational complexity are two criteria of minor consideration
in audio watermarking for copyrights protection.
As defined in Sect. 1.3.1.4, data payload (or capacity) of one audio watermarking
scheme is the number of bits embedded into a one-second audio fraction. According
to the embedding algorithm, the data payload of the proposed audio watermarking
scheme, DPB , is expressed as follows:
2fs Nbit
DPB D bps (5.14)
N Nc Nunit
152 5 Performance Evaluation of Audio Watermarking
where fs is the sampling frequency of audio signal, Nbit is the number of watermark
bits embedded per block, N is the frame length, Nc is the number of frames per unit,
and Nunit is the number of units per block. Note that factor 2 in Eq. (5.14) is due to
half-overlapping between adjacent audio frames.
Furthermore, if the coded-image watermark is adopted, the data payload in terms
of letters, DPL , is
DPB
DPL D lps (5.15)
Lw
where Lw D 35 is the number of bits comprising one letter and DPL is expressed
in letter per second (lps).
Note that the data payload discussed above refers to the theoretical data payload
of one audio watermarking scheme, which solely depends on the watermark
embedder. That is, once the embedding parameters and the embedding algorithm
used by the watermark embedder are chosen, theoretical data payload is determined
subsequently.
The values of these experiment parameters determined in Sect. 5.1 are N D 512,
Nc D 4, Nunit D 10, and Nbit D 4. Moreover, all the audio test files are sampled
at 44.1 kHz, i.e., fs D 44:1 kHz. Therefore, the data payload of the scheme under
evaluation is equal to
2 44100 4
DPB D 17:2 bps (5.16)
512 4 10
17:2
DPL D 0:5 lps (5.17)
35
which are sufficient for the purpose of copyrights protection.
From Table 5.14, however, it is observed that the watermarks embedded in
e
different host signals have quite different lengths, although the same watermark
embedder is employed. Thus, the practical data payload (DP B ) is defined as
the watermark length divided by the duration of the audio signal. For example,
we calculate the practical data payload for Bass.wav, 350 bits/24.9 s = 14.1 bps;
Gspi.wav, 210 bits/19 s = 11.1 bps; Harp.wav, 140 bits/16.4 s = 8.5 bps; and
Pop.wav, 280 bits/20 s = 14 bps. The duration of all the test signals is listed in
Appendix D. By averaging these four values, the average practical data payload of
the proposed scheme is considered to be 11.9 bps.
The practical data payload of one host signal depends on the watermark embedder
as well as the selected watermarking regions of each host signal. If a host signal
contains more silences and trifle fractions, the watermarking regions are of smaller
size and hence the practical data payload is lower. Obviously, the practical data
payload cannot always exceed the theoretical data payload. As the practical data
payload is approaching the theoretical value, more samples of the host signal are
used for watermarking.
5.5 Data Payload and Computational Complexity 153
Nw
C Cembedding D bps (5.18)
tembedding
Similarly, if Nw watermark bits are detected in time tdetection , the average detection
speed is given by
Nw
C Cdetection D bps (5.19)
tdetection
Note that the average embedding and detection speeds are expressed in the same
unit as data payload, i.e., bps.
For the test platform, all the experiments are conducted on a Pentium 4 2.4 GHz
computer with 1 GB RAM. Table 5.14 shows the results of the computational
complexity estimation on Bass.wav, Gspi.wav, Harp.wav, and Pop.wav. Note that
for the detections measured in the experiments, watermark bits are detected from
the watermarked signals without being attacked.
From Table 5.14, it is observed that although different host signals differ with
embedding time, their average embedding speeds are similar. On the other hand,
the average detection speed of different watermarked signals varies in the detection
time as well as the average detection speed. For example, the average detection
speed of the watermarked Gspi.wav signal is almost twice as fast as than that of the
watermarked Bass.wav signal. This is due to different implementation mechanisms
of the embedding and detection algorithms in MATLAB.
During the execution of the embedding algorithm, the host signal is processed
block by block to embed watermark bits. According to Fig. 4.3, block size is merely
determined by N , Nc , and Nunit , not related to the host signal. Therefore, different
154 5 Performance Evaluation of Audio Watermarking
host signals have the same utilization of computer memory in the embedding,
resulting in a similar average embedding speed.
However, during the execution of the detection algorithm, the watermark bits
are extracted from each watermarking region. The reason for this is that in
the cases of desynchronization attacks, the conformation of blocks is distorted.
Therefore, the magnitudes of all tiles in every watermarking region (not just in every
block) need to be provided simultaneously for block synchronization, as shown
in Sect. 4.3.1. Consequently, different host signals have a different utilization of
computer memory in the detection. Moreover, large watermarking regions demand
more computer memory, thus the average detection speed is slow and vice versa for
small watermarking regions.
On the whole, the average detection speed is much faster than the average
embedding speed, which is a desirable attribute in copyrights protection application.
Table 5.15 compares the performance of our proposed scheme (“Proposed”) with
several existing audio watermarking schemes, sorted by chronological order. The
chosen schemes were not implemented in the book. Therefore, if the result is not
reported in the publication, it is marked by symbol = in the table. Also if the
published result is obtained or interpreted in a different way, it is marked by the
symbol .
The investigation is focused on imperceptibility (“Impcpty”), robustness, and data
payload (“Payload”), since security and computational complexity were not taken
into consideration by most schemes.
• For imperceptibility evaluation, the commonly used SNR is employed as the
metric. Moreover, it usually refers to the average SNR of all the watermarked
signals, since there are a number of host audio signals adopted in every scheme.
• For the robustness test, the attacks include noise addition (“NA”), resampling
(“RS”), amplitude scaling (“AM”), low-pass filtering (“LP”), echo addition
(“ECHO”), MP3 compression (“MP3”), and PITSM (“TSM”).4 Other attacks
such as requantization, DA/AD conversion, reverberation, random samples
cropping, jittering, zeros inserting, and TPPSM are not listed in the table, because
they were either performed in a varying way or not even conducted in most
4
The attacks with symbol in Table 5.15 are described as follows. Under the “NA” category, the
schemes in [5, 7] did not specify the value of the SNR. Under the “AM” category, the schemes
in [5, 7] compressed the amplitude with a nonlinear gain function. Under the “LP” category, the
schemes in [3, 8] tested band-pass filtering only. Under the “TSM” category, the schemes in [9, 10]
implemented random stretching (at ˙4 % and ˙8 %, respectively) merely by omitting or inserting
a random number of samples, which is considered similar to random sample cropping/inserting.
Table 5.15 Performance comparison of different audio watermarking schemes
5.6 Performance Comparison
e
a different way. Thus, there is no direct comparison in terms of the BER.
• For data payload estimation, the practical data payload (DP B ) instead of the
theoretical data payload is adopted for comparison. Since most schemes have
no theoretical analysis on the data payload, the actual amount of watermark bits
embedded into one host signal of certain duration is calculated as the practical
data payload. Similar to the SNRs, the practical data payloads shown in the table
are usually the average values of all the watermarked signals in each scheme.
Table 5.15 shows that these schemes have different performance characteristics.
On the average, the proposed scheme achieves the best compromise between
imperceptibility, robustness, and capacity.
• In terms of imperceptibility, the SNR of the proposed scheme is within the range
of other schemes. Also, the average ODG of the proposed scheme is 1:33,
obtained from Table 5.2. In spite of a higher SNR, the ODGs reported in the
scheme in [2] are around 1:80, not superior to the proposed scheme. Without
addressing the SNR, the scheme in [3] has an average ODG of 0:93.
• In terms of robustness, the focus is on the performance under PITSM, as well
as noise addition, low-pass filtering, and MP3 compression, since most schemes
show high resistance to resampling, amplitude scaling, and echo addition. Under
PITSM, only the proposed scheme and the schemes in [2, 5, 6] can resist
excessive distortion of up to ˙10 % or greater, and hence are chosen for further
comparison. The proposed scheme is robust against PITSM (˙10 %), noise
addition (30 dB), low-pass filtering (5 kHz), and MP3 compression (48 kbps).
Compared to the proposed scheme, the scheme in [2] is quite robust against
PITSM (˙25 %), but relatively vulnerable to noise addition (40 dB), low-pass
filtering (7 kHz), and MP3 compression (64 kbps). The scheme in [5] is slightly
more robust against Low-pass filtering (4 kHz) and MP3 compression (32 kbps);
nevertheless the SNR has no specified value to compare the robustness against
noise addition.6 The scheme in [6] is slightly more robust against low-pass
5
These unlisted attacks were undertaken in several schemes as follows. Requantization: only
the scheme in [3] tested 8-bit requantization and the detection succeeded. DA/AD conversion:
the schemes in [4, 7–10] tested DA/DA conversion and the detections succeeded. Cropping: the
schemes in [2, 4, 5] tested different cropping operations and the detections succeeded. Jittering:
the schemes in [2, 5] tested different jittering operations and the detections succeeded. TPPSM:
the scheme in [1] tested ˙1 % pitch-scaling and the detection succeeded; the scheme in [3] tested
the case that the pitch is shifted up by two semitones and the detections completely failed; the
schemes in [9, 10] implemented pitch shifting (at ˙4 % and ˙8 % respectively) merely by linear
interpolation without anti-alias filtering and the detections succeeded.
6
It was reported as “noise addition that can be heard clearly by everybody [5].”
5.7 Summary 157
filtering (4 kHz), but less against noise addition (36 dB) and MP3 compression
(56 kbps).
• In terms of the data payload, the proposed scheme has the highest practical
data payload among these schemes, i.e., 11.9 bps. In particular, this value is
much higher than data payloads of the schemes in [2, 5, 6], i.e., 2 bps, 4.3 bps,
and 2.3 bps respectively. Moreover, as shown in Eq. (5.16), the theoretical data
payload of the proposed scheme is even higher—about 17.2 bps.
5.7 Summary
In this chapter, the performance of the proposed audio watermarking scheme has
been thoroughly evaluated with respect to imperceptibility, robustness, security,
data payload, and computational complexity. Specifically, the designed performance
evaluation consists of perceptual quality assessment, robustness test, security
analysis, estimations of the data payload, and computational complexity. Without
loss of generality, the performance evaluation presented in this chapter can serve as
one comprehensive benchmark of audio watermarking algorithms.
Firstly, the subjective listening test and the objective evaluation test were
employed in the perceptual quality assessment. Specifically, the subjective listening
test includes the MUSHRA test and SDG rating, while the objective evaluation test
includes the calculation of the ODG (using PEAQ) and the SNR value. Secondly,
both basic and advanced robustness tests were carried out. Basic robustness test
includes common signal operations (e.g., noise addition, resampling, requantization,
amplitude scaling, low-pass filtering, DA/AD conversion, echo addition, reverber-
ation, and MP3 compression), desynchronization attacks (e.g., random samples
cropping, jittering, zeros inserting, PITSM, and TPPSM), and combined attacks
(e.g., Type I and Type II combined attacks). The advanced robustness test includes
StirMark for Audio, averaging collusion, and multiple watermarking (e.g., self-
watermarking and two types of inter-watermarking). Thirdly, the number of possible
embedding ways due to channel scrambling was calculated in the security analysis.
Furthermore, both theoretical and practical data payloads were calculated. Finally,
computational complexity was evaluated in terms of the embedding/detection PC
computing time as well as the average embedding/detection speed.
The experimental results show that the watermarked audio signals are perceptu-
ally transparent, robust against various attacks, and self-secured from unauthorized
detection. Also, watermarking efficiency of the proposed technique is satisfactory
with respect to the data payload and computational complexity as compared to the
other methods.
Compared with other reported schemes, the proposed scheme achieves a better
compromise between imperceptibility, robustness, and data payload. Thus, it is
concluded that the proposed audio watermarking scheme performs well for the
purpose of copyrights protection.
Chapter 6
Perceptual Evaluation Using Objective
Quality Measures
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 159
DOI 10.1007/978-3-319-07974-5__6, © Springer International Publishing Switzerland 2015
160 6 Perceptual Evaluation Using Objective Quality Measures
Sect. 1.3.2.1, the ABX test and the MUSHRA test (i.e., MUlti Stimuli with Hidden
Reference and Anchors) are two commonly used methods.
In the ABX listening test (see Appendix B), the listener has to identify an
unknown sample X as being A or B, with A (the host signal) and B (the
watermarked signal) available for reference. Initially, the ABX test was designed
for the assessment of small deterioration [43]. Note that ABX tests can also
be performed as ABC/HR tests, i.e., double blind, triple stimulus, with hidden
reference [31]. Specifically, stimulus A is the host signal for reference, whereas
stimulus B and C are the host and watermarked signals in randomized order. After
listening to three stimuli, the listener is asked to decide between B and C as the
hidden reference signal, and then the remaining one is the watermarked signal.
Finally, the watermarked signal is evaluated relative to the host signal by using a
subjective difference grade (SDG), as described in Table 1.2.
The MUSHRA test (see Sect. 5.2.1) is developed for assessing intermediate
audio quality [44]. Since multiple stimuli including the hidden reference and a few
additional signals (anchors) are employed, the MUSHRA test is supposed to be
more reliable than the ABX test in the presence of slightly larger distortions.
Subjective listening tests are indispensable to perceptual quality assessment,
since the ultimate judgment is made by human perception. However, it is quite
time-consuming and cost intensive to conduct such listening tests. Moreover, the test
results are subject to test environments and the participants’ preferences. Therefore,
machine-based objective evaluations are used to provide a convenient, consistent,
and fair assessment.
• Objective evaluation test
Objective evaluation test is intended to facilitate the implementation of subjective
listening test. To achieve this goal, the results of objective evaluation should
correlate well with the SDG scores.
Currently, the commonly used objective evaluation is to assess the perceptual
quality of audio data via a stimulant ear, such as Evaluation of Audio Quality
(EAQUAL) [47], Perceptual Evaluation of Audio Quality (PEAQ) [48], and Percep-
tual Model-Quality Assessment (PEMO-Q) [49]. Basically, these methods establish
an auditory perception model to imitate the listening behavior of a human being, so
that the watermarked signal is graded relative to the host signal. The whole process
is depicted in Fig. 6.1 [31, 46]. After the watermark is embedded, the host and
watermarked signals are separately passed to a psychoacoustic model. As described
in Sect. 2.4, the psychoacoustic model calculates the internal representation of signal
features, such as the masking threshold. By comparing the internal representations
of the host and watermarked signals, the audible difference is determined. The
audible difference is the input to the cognitive model, which models the cognitive
processes in the human brain. After the audible difference is perceptually scaled
in the cognitive model, the final output is an objective difference grade (ODG). As
mentioned in Sect. 3.1.2, the specifications of ODG conform to those of SDG. To
guarantee the accuracy of evaluation, a large set of relevant test signals are required
to train and characterize such models [46].
6.2 Objective Quality Measures 161
Watermark
Host signal embedder
Watermarked signal
Psychoacoustic Psychoacoustic
model model
Comparison
Audible difference
Cognitive
model
Among the implemented models, PEMO-Q is the latest and most advanced
predictor of audio quality. It is reported in [49] that PEMO-Q has a higher ability to
be applicable to unknown distortions and performs better than the other techniques.
The performance of three evaluation tools will be examined in Sect. 6.3.3.
Besides perception modelling, the extent of dissimilarity between the water-
marked and host signals can be quantified by objective quality measures. Objec-
tive quality measures, such as the signal-to-noise ratio measure, the segmental
signal-to-noise ratio measure, the cepstral distortion measure, the log-likelihood
ratio measure, the Itakura–Saito distortion measure, the log-area ratio measure,
and the weighted spectral slope measure [50], are commonly used in speech
processing. They have been widely used in quality evaluation for speech enhance-
ment [136–138], speech intelligibility estimation [139], speech recognition in blind
source separation [140, 141], and noise reduction schemes [138]. We investigate
using these quality measures for objective assessments of the perceptual quality of
audio watermarking for the first time [142, 143].
Objective quality measures have been widely used in the quality evaluation of
speech signals [50]. This kind of measurement makes use of sound source infor-
mation and calculates the distance or distortion of the test signal with regard to
the original signal [140], which corresponds to the concept of perceptual quality
assessment in audio watermarking.
162 6 Perceptual Evaluation Using Objective Quality Measures
As discussed in Chaps. 3 and 5, the signal-to-noise ratio (SNR) has already been
employed to quantify the distortion that a watermark imposes on the host signal.
However, the SNR actually averages the distortions on the entire signal. Thus, it is
not an accurate indicator of perceptual quality, as indicated in Sects. 3.2.5 and 5.2.2.
Based on the results in the existing literature, six more quality measures are
selected to estimate the distance between the host and watermarked signals. Since
the impact of noise on signal quality is nonuniform, all the measures calculate the
level of distortion for each frame. As a convention, the subscripts o and w denote
the components related to host frame and the watermarked frame, respectively.
• Segmental signal-to-noise ratio (segSNR) measure
The segSNR is a variation of the SNR, obtained by averaging the SNRs of all the
frames. Referring to the formula for the SNR in Eq. (1.3), the frame-based segSNR
is calculated by [136, 138, 140]
X
N
Œgo .n/2
nD1
dsegSNR .gw ; go / D 10 log10 (6.1)
X
N
Œgw .n/ go .n/2
nD1
where go is the host frame, gw is the watermarked frame, and N is the frame length
in samples. In our experiments, N D 512, which corresponds to 11.6 ms for a
sampling rate of 44.1 kHz.
In fact, frames with segSNRs above 35 dB do not reflect human perceptual
differences; therefore, their segSNRs are generally replaced with 35 dB. Moreover,
silence frames have negative segSNRs because the signal energy is small. To prevent
getting such abnormal segSNRs, a lower threshold for the segSNR is set to be
10 dB. Thus, the segSNR values are limited in the range of [10 dB, 35 dB]
[50, 136, 138].
• Cepstral distortion measure
The cepstral distortion (CD) measure provides an estimate of cepstral distance
between the watermarked frame and the host frame. Given both cepstral coefficient
vectors cEw and cEo , CD for the first L coefficients is calculated by [140]
X
L
2
dCD cEw ; cEo D cEw .l/ cEo .l/ (6.2)
lD1
where Ro is the autocorrelation matrix and ./T refers to the transpose of a matrix.
• Itakura–Saito distortion measure
The Itakura–Saito (IS) distortion measure is slightly different from the LLR measure
and defined by [50, 138, 140]
2 2
aEw Ro aEwT
dIS aEm ; aEo D o
C log10 w
1 (6.4)
2
w aEo Ro aEoT 2
o
where o2 and w2 are all-pole gains for the host and watermarked frames, respec-
tively.
It was mentioned in [140] that LLR and IS measures perform well as predictors
of the recognition rate for the signals with additive noise in continuous speech
recognition systems.
• Log-area ratio measure
The log-area ratio (LAR) measure is also based on LP analysis in that it depends on
LP reflection coefficients [136–138, 140]:
ˇ ˇ
ˇ X 2 ˇ1=2
ˇ1 P 1 C rEo .p/ 1 C rEw .p/ ˇˇ
dLAR rEw ; rEo D ˇˇ log10 log10 (6.5)
ˇ P pD1 1 rEo .p/ 1 rEw .p/ ˇˇ
where P is the order of LP analysis and P D 10 in our experiments. rEo and rEw are the
LP reflection coefficient vectors of the host and watermarked frames, respectively.
Since the reflection coefficients are closely related to power spectra, the LAR
measure is able to estimate the differences between the logarithms of the spectra of
the host and watermarked signals efficiently [137]. In [136, 137, 140], it has been
observed that the LAR is the best measure in some cases.
• Weighted spectral slope measure
The Weighted spectral slope (WSS) measure is based on an auditory model, in
which 36 overlapping filters of progressively larger bandwidth are used to estimate
the smoothed short-time spectra [136]. Then, the weighted difference between the
spectral slopes .SL/ in each band are calculated [139].
164 6 Perceptual Evaluation Using Objective Quality Measures
According to [50, 136, 138, 140], the WSS measure in decibels is formulated as
X
36
dW SS D Kspl .Ko Kw / C wa .k/ ŒSLo .k/ SLw .k/2 (6.6)
kD1
where Ko and Kw are related to the overall sound pressure level and Kspl is a
parameter that can be varied to increase overall performance. In our experiments,
Kspl D 0 is used as in [140] and the weight wa depends on the formant locations
[136]. As concluded in [50, 140], the WSS measure might outperform other
measures because it employs the auditory model.
Note that for each objective quality measure (except segSNR), its overall quality
score is obtained by using the m95 % mean to reduce the number of outliers. The
m95 % mean of each quality measure is calculated in the following way. First, the
value of the quality measure is calculated for each frame. Then the values of the
quality measure for all the frames are sorted in an ascending order. The m95 % mean
is the average of the first 95 % values of each quality measure [136, 138].
In this section, objective quality measures are evaluated to estimate their capabilities
for predicting the perceptual quality of the watermarked audio signals. This is
achieved by performing correlation analysis between the SDGs and the values of
objective quality measures.
The audio signals used are taken from the test set prepared in Sect. 3.1.1, 17
pieces of audio signals (A1 A17 ) in total. However, PEMO-Q is a commercial
software tool and its demo version is strictly limited to signal lengths up to 4 s.
Therefore, we always use a 4 s length from the beginning of each original audio test
signal and then utilize them for the experiments here. Moreover, all the simulations
are also conducted on a Pentium 4 2.4 GHz computer with 1 GB RAM under
Windows XP operating system.
The performance of objective quality measures are fully investigated under different
audio watermarking techniques, such as the proposed scheme in Chap. 4 along with
cepstrum domain watermarking, wavelet domain watermarking, echo hiding, and
histogram-based watermarking in Chap. 3.
In the experiments, each technique is employed to implement the process of
watermarking separately. The imperceptibility of the watermarked signal is con-
trolled by the watermark strength or a similar factor. Without loss of generality, all
6.3 Experiments and Discussion 165
Similar to the previous subjective listening tests, ten trained listeners participated
in the tests that were performed in an isolated chamber. Also, all the stimuli were
presented through a high-fidelity headphone.
During the tests, the participants were asked to evaluate the perceptual quality
of the watermarked signal relative to its host signal and subsequently provide a
SDG. In view of the difficulties in the real listening tests, only the proposed audio
watermarking scheme was considered. Moreover, for each host signal, the human
subjects were not required to evaluate all the twenty watermarked signals (i.e.,
˛w D 10; 20; 30; : : : ; 200), but just five of them with ˛w D 40; 80; 120; 160; 200.
In addition, the host signal was continuously included as a watermarked signal
166 6 Perceptual Evaluation Using Objective Quality Measures
0
1 X
K 0
GQ SDG i; j D GSDG i; j ; k (6.7)
K
kD1
where K D 10.
For simplicity of expression, the
n average
SDGso for0 each host signal Ai is denoted
Q Q Q 0
as GSDG .i /, where GSDG .i / D GSDG i; j , j D 1; 2; : : : ; 6.
Objective evaluation tests comprise two stages: investigation of the evaluation tools
and calculation of the values of the quality measures.
• Evaluation tool analysis
In the first stage, we investigate the effectiveness of three evaluation tools using
perception modelling, namely PEMO-Q [49], EAQUAL [47], and PEAQ [48]. The
aim is to find the best quasi-subjective predictor of audio quality that would best
conform to the SDG. Its ODGs will be adopted subsequently as quasi-SDGs for
correlation analysis in the next section. The reason of using quasi-SDGs rather than
SDGs is that it would be inaccurate to perform a correlation with an insufficient
amount of the average SDGs.
To this purpose, all the watermarked signals of each host signal are evaluated
separately using three evaluation tools.
Take the proposed audio watermarking scheme as an example. Each host signal
has twenty watermarked signals with ˛w D 10; 20; 30; : : : ; 200. Moreover, the
host signal is also included to correspond with its GQ SDG obtained above. So
for each host signal Ai , there are twenty-one watermarked signals with ˛w D
0; 10; 20; 30; : : : ; 200. After being evaluated by three
n tools, eachohost signal Ai
receives three kinds of ODGs, i.e., GODG1 .i / D GODG1 i; jO by PEMO-Q,
n o n o
GODG2 .i / D GODG2 i; jO by EAQUAL, and GODG3 .i / D GODG3 i; jO
by PEAQ, where jO D 1; 2; : : : ; 21.
6.3 Experiments and Discussion 167
G SDG
Bass.wav GODG1
0 GODG2
−2 GODG3
−4
Gspi.wav
Difference Grade (SDG & ODG)
−2
−4
Harp.wav
0
−2
−4
Pop.wav
0
−2
−4
0 20 40 60 80 100 120 140 160 180 200
Watermark strength
Figure 6.2 shows the average SDGs and the ODGs for the watermarked Bass.wav,
Gspi.wav, Harp.wav, and Pop.wav signals, respectively.
Note that a few ODGs in Fig. 6.2 are slightly positive, as with some values with
small watermark strengths. As mentioned in Sect. 3.2.1, such cases are interpreted
as distortions that are mostly inaudible for humans.
According to Fig. 6.2, for each host signal, its GODG1 are closer to GQ SDG
than GODG2 and GODG3 . It means that PEMO-Q provides a better correspondence
between subjective and objective difference grades. Therefore GODG1 .i / or simpli-
fied as GQ .i / are used for correlation analysis.
• Quality measures calculation
In the second stage, we calculate the values of the quality measures between the
host signal and all its watermarked signals. The selected objective quality measures
include the SNR, segSNR, CD, LLR, IS, LAR, and WSS measures, denoted by
r D 1; 2; : : : ; 7, respectively. Consequently, each host signal Ai has the values of
seven quality measures, denoted by Or .i /, r D 1; 2; : : : ; 7.
168 6 Perceptual Evaluation Using Objective Quality Measures
Different from the tests in the first stage, the watermarked signals with ˛w D 0
are not included in quality measures calculation. Otherwise, the values of the SNR
and the segSNR are infinite, while the values of other measures are always equal to
zero.
Take the proposed audio watermarking scheme as an example. Each host signal
has twenty watermarked signals with ˛w D 10; 20; 30; : : : ; 200. Thus, for each
host signal Ai , the values of seven quality measures are calculated separately, i.e.,
O1 .i / D fO1 .i; j /g for the SNR measure, O2 .i / D fO2 .i; j /g for the segSNR
measure, O3 .i / D fO3 .i; j /g for the CD measure, O4 .i / D fO4 .i; j /g for the
LLR measure, O5 .i / D fO5 .i; j /g for the IS measure, O6 .i / D fO6 .i; j /g
for the LAR measure, and O7 .i / D fO7 .i; j /g for the WSS measure, where
j D 1; 2; : : : ; 20.
As the watermarked signal with ˛w D 0 is excluded in the above calculations,
the length of Or .i / is always one less than that of GQ .i /. To conduct a correlation
analysis, the first value of GQ .i / that corresponds to the watermarked signal with
˛w D 0 is discarded, so that GQ .i / has the same length as Or .i /.
Then we repeat the above procedure of calculating the values of quality measures
for different audio watermarking techniques.
In summary, for a given watermarking technique, each host signal Ai has N˛
watermarked signals, as introduced in Sect. 6.3.1. Based ˚ on these N˛ watermarked
signals, each Ai receives the quasi-SDGs GQ .i / D GQ .i; j / and the values
of seven quality measures Or .i / D fOr .i; j /g, r D 1; 2; : : : ; 7, where j D
1; 2; : : : ; N˛ .
Note that computation time is also one of our concerns. Table 6.1 lists the
computation time of quality measures on one watermarked Bass.wav signal with
˛w D 60 in the proposed audio watermarking scheme. PEMO-Q took around 55 s
to complete the evaluation of one watermarked signal with the default settings in
[49]. In comparison, all quality measures finished in less than 4 s, much faster
than PEMO-Q. Particularly the SNR and segSNR measures took the least time,
less than 0.1 s. Also, the computation time of the LAR, LLR, and IS measures are
not more than 2.2 s. The measured response times are based on using Pentium 4 PC
empowered by 2.4 GHz CPU and 1 GB RAM running under Windows XP operating
system.
6.3 Experiments and Discussion 169
X
Nm
Or .n/ O r GQ .n/ G Q
nD1
D ( ) 1=2 ( ) 1=2 (6.8)
X
Nm
2 X
Nm
2
Or .n/ O r GQ .n/ G Q
nD1 nD1
X
N˛
Or .i; j / O r .i / GQ .i; j / G Q .i /
j D1
.i; r/ D 8 91=2 8 91=2
<XN˛
= <X N˛
=
2 2
Or .i; j / O r .i / GQ .i; j / G Q .i /
: ; : ;
j D1 j D1
(6.9)
X
N˛ X
N˛
where O r .i / D 1
N˛
Or .i; j / and G Q .i / D 1
N˛
GQ .i; j /.
j D1 j D1
The average correlation coefficient of each quality measure 1 .r/, r D
1; 2; : : : ; 7 is calculated by
1 X
Nh
1 .r/ D .i; r/ (6.10)
Nh iD1
In the second analysis, the correlation is directly performed on all the host signals.
The overall correlation coefficient of the rth objective quality measure 2 .r/, r D
1; 2; : : : ; 7 is calculated by [143]
X
Nh X
N˛
Or .i; j / O r GQ .i; j / G Q
iD1 j D1
2 .r/ D 8 91=2 8 91=2
<XNh X
N˛
= <X Nh X
N˛
=
2 2
Or .i; j / O r GQ .i; j / G Q
: ; : ;
iD1 j D1 iD1 j D1
(6.11)
X
Nh X
N˛ X
Nh X
N˛
where O r D 1
Nh Na
Or .i; j / and G Q D 1
Nh Na
GQ .i; j /.
iD1 j D1 iD1 j D1
Note that the average correlation coefficient is widely used in studying objective
quality measures [136–141]. The overall correlation coefficient is more desirable in
some applications, but considered to be rather stringent [138, 144].
Tables 6.2, 6.3, 6.4, 6.5, and 6.6 show the Pearson correlation coefficients
under different audio watermarking techniques. The results include the individual
correlation coefficients .i; r/, the average correlation coefficients (absolute value)
j1 .r/j, and the overall correlation coefficients (absolute value) j2 .r/j, where
i D 1; 2; : : : ; 17 for denoting the host signal Ai and r D 1; 2; : : : ; 7 for denoting
the quality measure. In each table, the highest j1 .r/j and j2 .r/j (i.e., the absolute
value closer to 1) are shaded and the second highest ones are in bold.
Some observations can be obtained from the Pearson correlation coefficients
[143].
• The overall correlation coefficients j2 .r/j are generally lower than the average
correlation coefficients j1 .r/j under different audio watermarking techniques.
Take the proposed audio watermarking scheme in Table 6.2 as an example.
j1 .r/j, r D 1; 2; : : : ; 7 are equal to 0.92, 0.87, 0.92, 0.95, 0.85, 0.95, and 0.94,
respectively, not less than 0.85. However, j2 .r/j, r D 1; 2; : : : ; 7 are equal to
0.38, 0.30, 0.26, 0.58, 0.27, 0.64, and 0.59, respectively, not more than 0.64.
This is because the functional relationship between objective quality measure
and the quasi-SDGs varies across different types of audio signals, i.e., vocal,
percussive instrument, tonal instrument, and music. Even in the same category,
different instruments or different genres of music are most likely to exhibit
different time–frequency characteristics. Consequently, the overall correlation
coefficients j2 .r/j are less than average correlation coefficients j1 .r/j in most
cases of audio watermarking techniques, whereas exceptions exist, due to the
intricacy of different techniques.
If audio signals have similar properties, the overall correlation coefficients
become better. For instance, host audio signals A1 (Soprano.wav), A2 (Bass.wav),
and A3 (Quartet.wav) all belong to the vocal category and also have the same
lyrics. Then the overall correlation coefficients over A1 , A2 , and A3 can be
calculated by eq. (6.11), where Nh D 3. Figure 6.3 shows the results of 2 .r/
6.3 Experiments and Discussion 171
GQ
−2 −2
−3 −3
−4 −4
20 30 40 50 60 20 25 30 35
SNR measure segSNR measure
GQ
−2 −2
−3 −3
−4 −4
0 20 40 60 0 1 2 3 4
CD measure LLR measure
GQ
−2 −2
−3 −3
−4 −4
0 500 1000 0 5 10 15
LAR measure
IS measure
|ρ2(3)| = 0.94
0
−1
GQ
−2
−3
−4
0 5 10
WSS measure
Fig. 6.3 Overall correlation coefficients over audio test signals A1 , A2 , and A3
(j2 .4/j D 0:59). Moreover, the LLR measure provides the highest average
correlation under the proposed scheme (j1 .4/j D 0:95), cepstrum domain
watermarking (j1 .4/j D 0:87), wavelet domain watermarking (j1 .4/j D
0:76), and echo hiding (j1 .4/j D 0:93). However, the LLR measure receives
the second lowest average correlation under histogram-based watermarking
(j1 .4/j D 0:41).
In addition, the WSS measure .r D 7/ shows similar performance to the IS
measure .r D 5/, better than the SNR measure .r D 1/ and the segSNR measure
.r D 2/ on the whole.
By comparison, the CD measure .r D 3/ yields the worst correlation in most
cases, especially quite low overall correlation. The CD measure yields the lowest
overall correlation under the proposed scheme (j2 .3/j D 0:26), cepstrum
domain watermarking (j2 .3/j D 0:17), and echo hiding (j2 .3/j D 0:05).
6.4 Summary 175
Also, the CD measure yields the second lowest overall correlation under wavelet
domain watermarking (j2 .3/j D 0:18) and histogram-based watermarking
(j2 .3/j D 0:28).
• By using different quality measures, every audio watermarking technique can
achieve a satisfactory overall correlation. The highest overall correlation coeffi-
cient is equal to 0.64 under the proposed scheme, equal to 0.72 under cepstrum
domain watermarking, equal to 0.71 under wavelet domain watermarking, equal
to 0.65 under echo hiding, and equal to 0.71 under histogram-based watermark-
ing. As mentioned above, except for the fact that the highest overall correlation
under cepstrum domain watermarking is provided by the LLR measure, the
highest overall correlation under other watermarking techniques is provided by
the LAR measure.
This proves that objective quality measures are able to predict the perceptual
quality of the watermarked audio signals.
6.4 Summary
1
False-negative probability is defined as the probability of missing detecting the existed watermark
and false-positive probability is defined as the probability of detecting the nonexisted watermark.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 177
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
178 A SDMI Standard
STEP 2000 [40] is a joint international evaluation project for audio digital water-
marking technology, undertaken by JASRAC1 and NRI2 together with international
associations of copyright management societies, CISAC and BIEM. It is the first
work of its kind initiated by copyright management bodies.
The objective of STEP 2000 is “to certify the aptitude of digital watermark
technologies, with a view towards promoting its utilization.” Enthusiastic responses
from many technology enterprises were received, contributing to an extensive
technology evaluation.
The evaluation of submitted digital watermark technologies was conducted
mainly with two aspects, i.e., audibility and robustness.
• Audibility—Whether the professionals can perceive if watermarks have been
embedded in music that is played back in a recording studio environment
Subjective listening test, ABX test, was conducted in perceptual quality evaluation.
First, the listener listens to a sound recording with no watermark (A), a sound
recording with watermarks embedded (B), and a sound recording which is one of
the two (X). After that, a listener listens to A and B alternately twice for 40 s each
and listens to X for 40 s again. Then, the listener decides whether X is A or B.
There are two requirements in ABX test to ensure its validity. One is to eliminate
(correct) contingency responses. To this end, the above tests were conducted
five times for each system. Moreover, the listener is defined to have detected
the embedded watermark if the same listener correctly determines whether the
watermark is embedded or not on each of the five tests. Under this definition,
significance of the responses are 95 % or greater. The other is to ensure typicality of
the professionals from the recording industry. For this purpose, a group comprising
of one recording engineer, one mastering engineer, one synthesizer manipulator, and
one audio critic was selected.
1
JASRAC: Japanese Society for Rights of Authors, Composers and Publishers
2
NRI: Nomura Research Institute, Ltd.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 179
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
180 B STEP 2000
StirMark for Audio [134] is a generic tool of robustness test for audio watermarking
systems. It is derived from StirMark,1 a fair benchmark for image watermarking. A
number of attacks as well as attack parameters are included in StirMark for Audio
v0.2, as shown in Table C.1.
1
StirMark v3.1 is a first benchmark for image watermarking released in 1999. The latest version is
StirMark Benchmark 4.0, available at http://www.petitcolas.net/fabien/watermarking/stirmark/.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 181
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
182 C StirMark for Audio
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 185
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
Appendix E
List of Audio Test Files
All the audio test samples are 44.1 kHz, 16 bit, monaural wave format files, as listed
in Table E.1.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 187
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
188 E List of Audio Test Files
Appendix F
Basic Robustness Test
Basic robustness test is applied to the watermarked audio signal Sw to inspect its
capability of resisting different attacks as listed in Table F.1.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 189
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
190 F Basic Robustness Test
Audio signals used in this book are in WAVE format (44.1 kHz, 16 bit), and hence
the nonuniform subbands are designed to cover 100 22; 050 Hz frequency band.
When Nsubband D 32, the lower/upper limits of the subbands obtained are presented
in Table G.1. The number of FFT coefficients in each subband is calculated based
on a frame length N D 512.
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 191
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
192 G Nonuniform Subbands
Y. Lin and W.H. Abdulla, Audio Watermark: A Comprehensive Foundation Using MATLAB, 193
DOI 10.1007/978-3-319-07974-5, © Springer International Publishing Switzerland 2015
194 References
18. M.D. Swanson, M. Kobayashi, A.H. Tewfik, Multimedia data: embedding and watermarking
technologies. Proc. IEEE 86(6), 1064–1087 (1998)
19. Y.Q. Lin, W.H. Abdulla, Audio watermarking for copyrights protection. Technical Report
SoE-650, School of Engineering, The University of Auckland (2007)
20. F.A.P. Petitcolas, R.J. Anderson, M.G. Kuhn, Information hiding: a survey. Proc. IEEE 87(7),
1062–1078 (1999)
21. S.A. Craver, M. Wu, B. Liu, What can we reasonably expect from watermarks? in Proceedings
of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2001, pp.
223–226
22. L.d.C.T. Gomes, P. Cano, E. Gómez, M. Bonnet, E. Batlle, Audio watermarking and
fingerprinting: for which applications? J. New Music Res. 32(1), 65–81 (2003)
23. F. Kurth, M. Muller, Efficient index-based audio matching. IEEE Trans. Audio Speech Lang.
Process. 16(2), 382–395 (2008)
24. M. Barni, F. Bartolini, Watermarking Systems Engineering: Enabling Digital Assets Security
and Other Applications (Marcel Dekker, New York, 2004)
25. J.-S. Pan, H.-C. Huang, L.C. Jain (eds.), Intelligent Watermarking Techniques (World
Scientific, River Edge, 2004)
26. J. Seitz (ed.), Digital Watermarking for Digital Media (Information Science Publishers,
Hershey, 2005)
27. B. Furht, D. Kirovski (eds.), Multimedia Watermarking Techniques and Applications (Auer-
bach Publications, Boca Raton, 2006)
28. F. Hartung, M. Kutter, Multimedia watermarking techniques. Proc. IEEE 87(7), 1079–1107
(1999)
29. T. Page, Digital watermarking as a form of copyright protection. Comput. Law Secur. Rep.
14(6), 390–392 (1998)
30. N. Cvejic, T. Seppanen (eds.), Digital Audio Watermarking Techniques and Technologies:
Applications and Benchmarks (Information Science Reference, Hershey, 2008)
31. M. Arnold, M. Schmucker, S.D. Wolthusen, Techniques and Applications of Digital Water-
marking and Content Protection (Artech House, Boston, 2003)
32. L. Boney, A.H. Tewfik, K.N. Hamdy, Digital watermarks for audio signals, in Proceedings of
IEEE International Conference on Multimedia Computing and Systems, 1996, pp. 473–480
33. C.-P. Wu, P.-C. Su, C.-C.J. Kuo, Robust and efficient digital audio watermarking using audio
content analysis, in Proceedings of SPIE Security and Watermarking of Multimedia Contents
II, vol. 3971, 2000 , pp. 382–392
34. W.-N. Lie, L.-C. Chang, Robust and high-quality time-domain audio watermarking subject to
psychoacoustic masking, in Proceedings of IEEE International Symposium on Circuits and
Systems (ISCAS), 2001, pp. 45–48
35. S.J. Xiang, J.W. Huang, Histogram-based audio watermarking against time-scale modification
and cropping attacks. IEEE Trans. Multimed. 9(7), 1357–1372 (2007)
36. W. Bender, D. Gruhl, N. Morimoto, A. Lu, Techniques for data hiding. IBM Syst. J. 35(3 &
4), 313–336 (1996)
37. F.A.P. Petitcolas, Watermarking schemes evaluation. IEEE Signal Process. Mag. 17(5), 58–64
(2000)
38. A. Lang, J. Dittmann, Transparency and complexity benchmarking of audio watermark-
ing algorithms issues, in Proceedings of Workshop on Multimedia and Security, 2006,
pp. 190–201
39. M. Arnold, Subjective and objective quality evaluation of watermarked audio tracks, in
Proceedings of International Conference on Web Delivering of Music (WEDELMUSIC),
2002, pp. 161–167
40. Announcement of Evaluation Test Results for “STEP 2000". JASRAC and NRI (2000)
[Online], http://www.jasrac.or.jp/watermark/ehoukoku.htm
41. A. Garay Acevedo, Audio watermarking quality evaluation, in e-Business and Telecommuni-
cation Networks, ed. by J. Ascenso et al. (Springer, Netherlands, 2006), pp. 272–283
References 195
42. G. Stoll, F. Kozamernik, EBU listening tests on internet audio codecs. EBU Technical Review,
2000
43. ITU-R Recommendation BS.1116-1, ITU-R Recommendation BS.1116-1: Methods for the
Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound
Systems, 1997
44. ITU-R Recommendation BS.1534-1, ITU-R Recommendation BS.1534-1: Method for the
Subjective Assessment of Intermediate Quality Level of Coding Systems, 2003
45. ITU-R Recommendation BS.1284-1, ITU-R Recommendation BS.1284-1: General methods
for the subjective assessment of sound quality, 2003
46. J.G. Beerends, Audio quality determination based on perceptual measurement techniques,
in Applications of Digital Signal Processing to Audio and Acoustics, ed. by M. Kahrs,
K. Brandenburg (Kluwer Academic, Boston, 1998), pp. 1–38
47. A. Lerch, Software: EAQUAL - Evaluation of Audio Quality, v.0.1.3alpha ed. (2002)
[Online], http://www.rarewares.org/others.php
48. P. Kabal, An examination and interpretation of ITU-R BS.1387: Perceptual evaluation of
audio quality. Technical Report, TSP Lab, McGill University (2003) [Online], http://www-
mmsp.ece.mcgill.ca/Documents
49. R. Huber, B. Kollmeier, PEMO-Q: a new method for objective audio quality assessment
using a model of auditory perception. IEEE Trans. Audio Speech Lang. Process. 14(6),
1902–1911 (2006) [Online]. http://www.hoertech.de/web-en/produkte/downloads.shtml
50. S.R. Quackenbush, T.P. Barnwell III, M.A. Clements, Objective Measures of Speech Quality
(Prentice Hall, Englewood Cliffs, 1988)
51. M. Bosi, R.E. Goldberg, Introduction to Digital Audio Coding and Standards (Kluwer
Academic, Boston, 2003)
52. W.J. Vincoli (ed.), Lewis’ Dictionary of Occupational and Environmental Safety and Health
(Lewis Publishers, Boca Raton, 2000)
53. K. Johnson, Acoustic and Auditory Phonetics (Blackwell Publisher, Malden, 2003)
54. P.H. Lindsay, D.A. Norman, Human Information Processing: An Introduction to Psychology
(Academic, New York, 1977)
55. T.S. Gunawan, Audio compression and speech enhancement using temporal masking models.
Ph.D. dissertation, The University of New South Wales, 2007
56. [Online]. Available: http://projects.cbe.ab.ca/Diefenbaker/Biology/Bio%20Website
%20Final/notes/nervous_system/Image59.gif
57. M.W. Levine, Levine and Shefner’s Fundamentals of Sensation and Perception (Oxford
University Press, Oxford, 2000)
58. W.A. Yost, D.W. Nielsen, Fundamentals of Hearing: An Introduction (Holt, Rinehart and
Winston, New York, 1977)
59. E.A.G. Shaw, Earcanal pressure generated by a free sound field. J. Acoust. Soc. Am. 39(3),
465–470 (1966)
60. B.C.J. Moore, An Introduction to the Psychology of Hearing (Academic, New York, 2003)
61. [Online]. Available: http://www.chicagoear.com/images/earworks.gif
62. T.D. Rossing (ed.), Handbook of Acoustics (Springer, Heidelberg, 2007)
63. [Online]. Available: http://www2.ph.ed.ac.uk/AardvarkDeployments/Public/67158/
views/workspace/dwatts1/66265/inner.node/les/MusicalAcoustics/CourseNotes/
PropertiesoftheEar/web.html
64. [Online]. Available: http://www.ai.rug.nl/acg/cpsp/docs/cochleaModel.html
65. I.J. Hirsh, The Measurement of Hearing (McGraw-Hill, New York, 1952)
66. H. Fletcher, W.A. Munson, Loudness, its definition, measurement and calculation. J. Acoust.
Soc. Am. 5(2), 82–108 (1933)
67. Y.H. Kim, H.I. Kang, K.I. Kim, S.-S. Han, A digital audio watermarking using two masking
effects, in Advances in Multimedia Information Processing - PCM 2002, ed. by Y.-C.
Chen, L.-W. Chang, H.C.-T. Lecture Notes in Computer Science, vol. 2532 (Springer,
Berlin/Heidelberg, 2002), pp. 105–115
196 References
68. X.M. Quan, H.B. Zhang, Statistical audio watermarking algorithm based on perceptual
analysis, in Proceedings of the 5th ACM Workshop on Digital Rights Management, 2005,
pp. 112–118
69. E. Ambikairajah, A.G. Davis, W.T.K. Wong, Auditory masking and MPEG-1 audio compres-
sion. Electron. Comm. Eng. J. 9, 165–173 (1997)
70. A. Spanias, T. Painter, V. Atti, Audio Signal Processing and Coding (Wiley-Interscience,
Hoboken, 2007)
71. M.D. Swanson, B. Zhu, A.H. Tewfik, L. Boney, Robust audio watermarking using perceptual
masking. Signal Process. 66(3), 337–355 (1998)
72. R.A. Garcia, Digital watermarking of audio signals using a psychoacoustic auditory model
and spread spectrum theory. AES E-Library, 1999
73. S. Ratanasanya, S. Poomdaeng, S. Tachphetpiboon, T. Amornraksa, New psychoacoustic
models for wavelet based audio watermarking, in IEEE International Symposium on Com-
munications and Information Technology (ISCIT), vol. 1, pp. 602–605, 2005
74. ISO/IEC IS 11172-3, Information Technology - Coding of Moving Picture and Associated
Audio for Digital Storage Media Up To About 1.5Mbit/s, Part 3: Audio (BSI, London, 1993)
75. K.C. Pohlmann, Principles of Digital Audio (McGraw-Hill, New York, 2000)
76. D. Pan, A tutorial on MPEG/audio compression. IEEE Multimed. 2, 60–74 (1995)
77. F.A.P. Petitcolas, MPEG for Matlab, v.1.2.8 ed. (2003) [Online], http://www.petitcolas.net/
fabien/software/mpeg
78. C.-Y. Lin, An investigation into perceptual audio coding and the use of auditory gammatone
filterbanks. Master’s thesis, The University of Auckland, 2007
79. SQAM - Sound Quality Assessment Material, European Broadcasting Union (EBU)
[Online], http://sound.media.mit.edu/mpeg4/audio/sqam
80. A. Takahashi, R. Nishimura, Y. Suzuki, Multiple watermarks for stereo audio signals using
phase-modulation techniques. IEEE Trans. Signal Process. 53(2), 806–815 (2005)
81. P. Liew, M. Armand, Inaudible watermarking via phase manipulation of random frequencies.
Multimed. Tools Appl. 35(3), 357–377 (2007)
82. A. Piva, M. Barni, F. Bartolini, A. De Rosa, Data hiding technologies for digital radiography.
IEE Proc. Vision Image Signal Process. 152(5), 604–610 (2005)
83. B. Chen, G.W. Wornell, Quantization index modulation: a class of provably good methods
for digital watermarking and information embedding. IEEE Trans. Inform. Theory 47(4),
1423–1443 (2001)
84. A. Zaidi, R. Boyer, P. Duhamel, Audio watermarking under desynchronization and additive
noise attacks. IEEE Trans. Signal Process. 54(2), 570–584 (2006)
85. D. Lam, Audio watermarking. COMPSYS401A Project, The University of Auckland, 2003
86. S. Saito, T. Furukawa, K. Konishi, A digital watermarking for audio data using band division
based on QMF bank, in Proceedings of IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), vol. 4, 2002, pp. 3473–3476
87. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Englewood
Cliffs, 1989)
88. S.-S. Kuo, J.D. Johnston, W. Turin, S.R. Quackenbush, Covert audio watermarking using per-
ceptually tuned signal independent multiband phase modulation. Proc. ICASSP 2, 1753–1756
(2002)
89. I.J. Cox, J. Kilian, F.T. Leighton, T. Shamoon, Secure spread spectrum watermarking for
multimedia. IEEE Trans. Image Process. 6(12), 1673–1687 (1997)
90. H.J. Kim, Audio watermarking techniques, in Proceedings of Pacific Rim Workshop on Digital
Steganography, 2003
91. H. Malik, A. Khokhar, A. Rashid, Robust audio watermarking using frequency selective
spread spectrum theory. Proc. ICASSP 5, 385–388 (2004)
92. N. Cvejic, T. Seppanen, Spread spectrum audio watermarking using frequency hopping and
attack characterization. Signal Process. 84(1), 207–213 (2004)
93. J. Seok, J. Hong, J. Kim, A novel audio watermarking algorithm for copyright protection of
digital audio. ETRI J. 24(3), 181–189 (2002)
References 197
94. L.R. Rabiner, R.W. Schafer, Digital Processing of Speech Signals (Prentice Hall, Englewood
Cliffs, 1978)
95. X. Li, H.H. Yu, Transparent and robust audio data hiding in cepstrum domain, in Proceedings
of IEEE International Conference on Multimedia and Expo (ICME), vol. 1, 2000, pp. 397–400
96. S.-K. Lee, Y.-S. Ho, Digital audio watermarking in the cepstrum domain. IEEE Trans.
Consumer Electron. 46(3), 744–750 (2000)
97. C.-T. Hsieh, P.-Y. Sou, Blind cepstrum domain audio watermarking based on time energy
features, in Proceedings of International Conference on Digital Signal Processing (DSP),
vol. 2, 2002, pp. 705–708
98. L.L. Cui, S.X. Wang, T.F. Sun, The application of binary image in digital audio watermarking,
in Proceedings of International Conference on Neural Networks and Signal Processing,
vol. 2, 2003, pp. 1497–1500
99. K. Gopalan, Audio steganography by cepstrum modification. Proc. ICASSP 5, 481–484
(2005)
100. K. K. Parhi, T. Nishitani, Digital Signal Processing for Multimedia Systems (CRC Press, New
York, 1999)
101. W.Y. Hwang, H.I. Kang, S.S. Han, K.I. Kim, H.S. Kang, Robust audio watermarking using
both DWT and masking effect, in Digital Watermarking, LNCS 2939, ed. by T. Kalker et al.
(Springer, Berlin/Heidelberg, 2004), pp. 382–389
102. A. Prochazka, J. Uhlir, P.W.J. Rayner, N.G. Kingsbury, Signal Analysis and Prediction
(Birkhäuser, Boston, 1998)
103. X. He, M.S. Scordilis, An enhanced psychoacoustic model based on the discrete wavelet
packet transform. J. Franklin Inst. 343(7), 738–755 (2006)
104. C.-S. Ko, K.-Y. Kim, R.-W. Hwang, Y.-S. Kim, S.-B. Rhee, Robust audio watermarking
in wavelet domain using pseudorandom sequences, in Proceedings of Annual International
Conference on Computer and Information Science (ACIS), 2005, pp. 397–401
105. P. Artameeyanant, Wavelet audio watermark robust against MPEG compression, in SICE
Annual Conference, pp. 1414–1417, 2007
106. H.O. Kim, B.K. Lee, N. Lee, Wavelet-based audio watermarking techniques: robustness and
fast synchronization [Online], http://amath.kaist.ac.kr/research/paper/01-11.pdf
107. W. Li, X.Y. Xue, An audio watermarking technique that is robust against random cropping.
Comput. Music J. 27(4), 58–68 (2003)
108. H.O. Oh, J.W. Seok, J.W. Hong, D.H. Youn, New echo embedding technique for robust and
imperceptible audio watermarking. Proc. ICASSP 3, 1341–1344 (2001)
109. D. Gruhl, A. Lu, W. Bender, Echo hiding, in Information Hiding, ed.b y R. Anderson. Lecture
Notes in Computer Science, vol. 1174 (Springer, Berlin/Heidelberg, 1996), pp. 295–315
110. H.J. Kim,Y.H. Choi, A novel echo-hiding scheme with backward and forward kernels. IEEE
Trans. Circ. Syst. Video Tech. 13(8), 885–889 (2003)
111. B.-S. Ko, R. Nishimura, Y. Suzuki, Time-spread echo method for digital audio watermarking.
IEEE Trans. Multimed. 7(2), 212–221 (2005)
112. B.-S. Ko, R. Nishimura, Y. Suzuki, Log-scaling watermark detection in digital audio
watermarking. Proc. ICASSP 3, 81–84 (2004)
113. D. Coltuc, P. Bolon, Robust watermarking by histogram specification, in Proceedings of
International Conference on Image Processing (ICIP), vol. 2, 1999, pp. 236–239
114. M. Mese, P.P. Vaidyanathan, Optimal histogram modification with MSE metric. Proc.
ICASSP 3, 1665–1668 (2001)
115. E. Chrysochos, V. Fotopoulos, A.N. Skodras, M. Xenos, Reversible image watermarking
based on histogram modification, in Proceedings of the 11th Panhellenic Conference on
Informatics (PCI), vol. B, 2007, pp. 93–104
116. G.R. Xuan, Q.M. Yao, C.Y. Yang, J.J. Gao, P.Q. Chai, Y. Shi, Z.C. Ni, Lossless data
hiding using histogram shifting method based on integer wavelets, in Digital Watermark-
ing, ed. by Y.Q. Shi, B. Jeon. Lecture Notes in Computer Science, vol. 4283 (Springer,
Berlin/Heidelberg, 2006), pp. 323–332
198 References
117. S.J. Xiang, J.W. Huang, R. Yang, Time-scale invariant audio watermarking based on the
statistical features in time domain, in Information Hiding, ed. by J. Camenisch et al. Lecture
Notes in Computer Science, vol. 4437 (Springer, Berlin/Heidelberg, , 2007), pp. 93–108.
Matlab implementation available at http://cist.korea.ac.kr/xiangshijun/
118. D.R. Smith, Digital Transmission Systems (Kluwer Academic, Boston, 2004)
119. H. Farid, Detecting hidden messages using higher-order statistical models. Proc. ICIP 2,
905–908 (2002)
120. M. Alghoniemy, A.H. Tewfik, Image watermarking by moment invariants. Proc. ICIP 2,
73–76 (2000)
121. S.J. Xiang, J.W. Huang, R. Yang, C.T. Wang, H.M. Liu, Robust audio watermarking based on
low-order zernike moments, in Digital Watermarking, ed. by Y.Q. Shi, B. Jeon. Lecture Notes
in Computer Science, vol. 4283 (Springer, Berlin/Heidelberg, 2006), pp. 226–240
122. P. Bas, J.-M. Chassery, B. Macq, Geometrically invariant watermarking using feature points.
IEEE Trans. Image Process. 11(9), 1014–1028 (2002)
123. F.-S. Wei, F. Xue, M.Y. Li, A blind audio watermarking scheme using peak point extraction.
Proc. ISCAS 5, 4409–4412 (2005)
124. W.H. Abdulla, Auditory based feature vectors for speech recognition systems, in Advances
in Communications and Software Technologies, ed. by N.E. Mastorakis, V.V. Kluev (WSEAS
Press, Greece, 2002), pp. 231–236
125. Y.Q. Lin, W.H. Abdulla, Robust audio watermarking technique based on Gammatone
filterbank and coded-image, in Proceedings of International Symposium on Signal Processing
and Its Applications (ISSPA), 2007
126. D. Bailey, W. Cammack, J. Guajardo, C. Paar, Cryptography in modern communication
systems, in TI DSPS FEST, pp. 1–15, 1999
127. Y.Q. Lin, W.H. Abdulla, A secure and robust audio watermarking scheme using multiple
scrambling and adaptive synchronization, in Proceedings of the 6th International Conference
on Information, Communications and Signal Processing (ICICS), 2007
128. Y.Q. Lin, W.H. Abdulla, Y. Ma, Audio watermarking detection resistant to time and pitch
scale modification, in Proceedings of IEEE International Conference on Signal Processing
and Communications (ICSPC), 2007, pp. 1379–1382
129. M. Kahrs, K. Brandenburg, Applications of Digital Signal Processing to Audio and Acoustics
(Kluwer Academic, Boston, 1998)
130. C.-W. Tang, H.-M. Hang, A feature-based robust digital image watermarking scheme. IEEE
Trans. Signal Process. 51(4), 950–959 (2003)
131. Y.Q. Lin, W.H. Abdulla, Multiple scrambling and adaptive synchronization for audio
watermarking, in Digital Watermarking, ed. by Y.Q. Shi, H.-J. Kim, S. Katzenbeisser. Lecture
Notes in Computer Science, vol. 5041 (Springer, Berlin/Heidelberg, 2007), pp. 440–453
132. T. Acharya, A.K. Ray, Image Processing: Principles and Applications (Wiley, Hoboken,
2005)
133. N. Collins, Introduction to Computer Music (Wiley, New York, 2009)
134. A. Lang, Documentation for Stirmark for Audio (2002) [Online], http://amsl-smb.cs.uni-
magdeburg.de/stirmark/doc/index.html
135. H. Zhao, M. Wu, Z.J. Wang, K.J.R. Liu, Nonlinear collusion attacks on independent
fingerprints for multimedia. Proc. ICASSP 5, 664–667 (2003)
136. J.H.L. Hansen, B.L. Pellom, An effective quality evaluation protocol for speech enhancement
algorithms, in Proceedings of International Conference on Spoken Language Processing
(INTERSPEECH), vol. 7, 1998, pp. 2819–2822
137. F. Mustiere, M. Bouchard, M. Bolic, Quality assessment of speech enhanced using particle
filters. Proc. ICASSP 3, 1197–1200 (2007)
138. Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE
Trans. Audio Speech Lang. Process. 16(1), 229–238 (2008)
139. W.M. Liu, K.A. Jellyman, J.S.D. Mason, N.W.D. Evans, Assessment of objective quality
measures for speech intelligibility estimation. Proc. ICASSP 1, 1225–1228 (2006)
References 199
140. L. Di Persia, M. Yanagida, H.L. Rufiner, D. Milone, Objective quality evaluation in blind
source separation for speech recognition in a real room. Signal Process. 87(8), 1951–1965
(2007)
141. L. Di Persia, D. Milone, H.L. Rufiner, M. Yanagida, Perceptual evaluation of blind source
separation for robust speech recognition. Signal Process. 88(10), 2578–2583 (2008)
142. Y.Q. Lin, W.H. Abdulla, Perceptual evaluation of audio watermarking using objective quality
measures, in Proceedings of ICASSP, 2008, pp. 1745–1748
143. Y. Lin, W. Abdulla, Objective quality measures for perceptual evaluation in digital audio
watermarking. IET - Signal Process. 5(7), 623–631 (2011)
144. T. Rohdenburg, V. Hohmann, B. Kollmeier, Objective perceptual quality measures for the
evaluation of noise reduction schemes, in Proceedings of the 9th International Workshop on
Acoustic Echo and Noise Control (IWAENC), 2005, pp. 169–172
145. SDMI Portable Device Specification, Part 1 (Version 1.0). SDMI (1999) [Online]. http://
ntrg.cs.tcd.ie/undergrad/4ba2.01/group10/technology.html
146. Call for Proposals for Phase II Screening Technology (Version 1.0). SDMI (2000) [Online].
http://ntrg.cs.tcd.ie/undergrad/4ba2.01/group10/technology.html