Online Hashing: Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng
Online Hashing: Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng
Online Hashing: Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng
Online Hashing
Long-Kai Huang, Qiang Yang, and Wei-Shi Zheng
Abstract— Although hash function learning algorithms have information of data distribution or class labels would make
achieved great success in recent years, most existing hash models significant improvement in fast search, more efforts are
are off-line, which are not suitable for processing sequen- devoted to the data-dependent approach [10]–[16]. For the
tial or online data. To address this problem, this paper proposes
an online hash model to accommodate data coming in stream data-dependent hashing methods, they are categorized into
for online learning. Specifically, a new loss function is proposed unsupervised-based [17]–[20], supervised-based [21]–[26],
to measure the similarity loss between a pair of data samples and semisupervised-based [27]–[29] hash models. In addition
in hamming space. Then, a structured hash model is derived to these works, multiview hashing [30], [31], multimodal
and optimized in a passive-aggressive way. Theoretical analysis hashing [32]–[36], and active hashing [37], [38] have also been
on the upper bound of the cumulative loss for the proposed
online hash model is provided. Furthermore, we extend our online developed.
hashing (OH) from a single model to a multimodel OH that trains In the development of hash models, a challenge remained
multiple models so as to retain diverse OH models in order to unsolved is that most hash models are learned in an off-line
avoid biased update. The competitive efficiency and effectiveness mode or batch mode, that is to assume all data are available in
of the proposed online hash models are verified through extensive advance for learning the hash function. However, learning hash
experiments on several large-scale data sets as compared with
related hashing methods. functions with such an assumption has the following critical
limitations.
Index Terms— Hashing, online hashing (OH). 1) First, they are hard to be trained on very large-scale
training data sets, since they have to make all learning
I. I NTRODUCTION
data kept in the memory, which is costly for processing.
the updated hash model approximate the model learned in the existing works considering active learning and online learning
last round as much as possible for retaining the most historical together [44]–[46], but they are not for hash function learning.
discriminant information during the update. An upper bound Although it is challenging to design hash models in an
on the cumulative similarity loss of the proposed online online learning mode, several hashing methods are related to
algorithm is derived, so that the performance of our online online learning [22], [41], [47]–[50]. Jain et al. [41] realized an
hash function learning can be guaranteed. online LSH by applying an online metric learning algorithm,
Since one-pass online learning only relies on the new data namely, LogDet Exact Gradient Online (LEGO) to LSH.
at the current round, the adaptation could be easily biased Since [41] is operated on LSH, which is a data-independent
by the current round data. Hence, we introduce a multimodel hash model, it does not directly optimize hash functions for
online strategy in order to alleviate such a kind of bias, where generating compact binary code in an online way. The other
multiple but not just one OH models are learned and they are five related works can be categorized into two groups: one is
expected to suit more diverse data pairs and will be selectively the stochastic gradient descent (SGD)-based online methods,
updated. A theoretical bound on the cumulative similarity loss including minimal loss hashing (MLH) [22], online supervised
is also provided. hashing [49], and Adaptive Hashing (AdaptHash) [50]; another
In summary, the contributions of this paper are as follows. group is matrix sketch-based methods, including online
1) Developing a weakly supervised online hash function sketching hashing (OSH) [47] and stream spectral binary
learning model. In our development, a novel similarity coding (SSBC) [48].
loss function is proposed to measure the difference of the MLH [22] follows the loss-adjusted inference used in struc-
hash codes of a pair of data samples in Hamming space. tured SVMs and deduces the convex–concave upper bound on
Following the similarity loss function is the prediction the loss. Since MLH is a hash model relying on SGD update
loss function to penalize the violation of the given for optimization, it can naturally be used for online update.
similarity between the hash codes in Hamming space. However, there are several limitations that make MLH unsuit-
Detailed theoretical analysis is presented to give a the- able for online processing. First, the upper loss bound derived
oretical upper loss bound for the proposed OH method. by MLH is actually related to the number of the historical
2) Developing a multimodel OH (MMOH), in which a samples used from the beginning. In other words, the upper
multimodel similarity loss function is proposed to guide loss bound of MLH may grow as the number of samples
the training of multiple complementary hash models. increases and, therefore, its online performance could not be
The rest of this paper is organized as follows. In Section II, guaranteed. Second, MLH assumes that all input data are
related studies are reviewed. In Section III, we present centered (i.e. with zero mean), but such a preprocessing is
our online algorithm framework, including the optimization challenging for online learning, since all data samples are not
method. Section IV further elaborates one left issue in available in advance.
Section III for acquiring zero-loss binary codes. Then, we give Online supervised hashing [49] is an SGD version of
analysis on the upper bound and time complexity of OH in the supervised hashing with error correcting codes (ECCs)
Section V, and extend our algorithm to a multimodel one in algorithm [51]. It employs a 0-1 loss function which outputs
Section VI. Experimental results for evaluation are reported either 1 or 0 to indicate whether the binary code generated
in Section VII and finally we conclude this paper in by existing hash model is in the codebook generated by error
Section VIII. correcting output codes algorithm. If it is not in the codebook,
the loss is 1. After replacing the 0-1 loss with a convex loss
II. R ELATED W ORK
function and dropping the nondifferentiable sign function in
Online learning, especially one-pass online learning, plays the hash function, SGD is applied to minimize the loss and
an important role for processing large-scale data sets, as it update the hash model online. AdaptHash [50] is also an
is time and space efficient. It is able to learn a model SGD-based methods. It defines a loss function the same as the
based on streaming data, making dynamic update possible. hingle-like loss function used in [22] and [52]. To minimize
In typical one-pass online learning algorithms [39], [40], when this loss, the authors approximated the hash function by a
an instance is received, the algorithm makes a prediction, differentiable sigmoid function and then used SGD to optimize
receives the feedback, and then updates the model based the problem in an online mode. Both Online supervised
on this new data sample only upon the feedback. Generally, hashing with Error Corrected Code (OECC) and AdaptHash
the performance of an online algorithm is guaranteed by the do not assume that data samples have zero mean as used
upper loss bound in the worst case. in MLH. They handle the zero-mean issue by a method similar
There are a lot of existing works of online algorithms to to the one in [52]. All these three SGD-based hashing methods
solve specific machine learning problems [39]–[43]. However, enable online update by applying SGD, but they all cannot
it is difficult to apply these online methods to online hash guarantee a constant loss upper bound.
function learning, because the sign function used in hash OSH [47] was recently proposed to enable learning a hash
models is nondifferentiable, which makes the optimization model on stream data by combining PCA hashing [27] and
problem more difficult to solve. Although one can replace the matrix sketching [53]. It first sketches stream samples into
sign function with sigmoid type functions or other approximate a small size matrix and meanwhile guarantees approximating
differentiable functions and then apply gradient descent, this data covariance, and then PCA hashing can be applied on this
becomes an obstacle on deriving the loss bound. There are sketch matrix to learn a hash model. Sketching overcomes
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HUANG et al.: OH 3
codes hi t and hj t is smaller than βr for a dissimilar pair. Since we are formulating a one-pass learning algorithm,
These two measure the risk of utilizing the already learned the previously observed data points are not available for the
hash projection matrix Wt on a new pair of data points, learning in the current round, and the only information we
i.e., R(t , s t ). can make use of is the current round projection matrix Wt .
If the model learned in the last round predicts zero-loss hash In this case, we force that the newly learned Wt+1 should stay
code pair on a new pair, that is, R(t , s t ) = 0, our strategy close to the projection matrix Wt as much as possible so as to
is to retain the current model. If the model is unsatisfactory, preserve the information learned in the last round as much as
i.e., predicting inaccurate hash codes having R(t , s t ) > 0, possible. Hence, the objective function for updating the hash
we need to update the inaccurate previous hash projection projection matrix becomes
matrix Wt . 1
To update the hash model properly, we claim that the zero- Wt+1 = arg min ||W − Wt ||2F + Cξ
W 2
loss hash code pair t = [gi t , gj t ] for current round data pair t s.t. (W) ≤ ξ and ξ ≥ 0
t
(7)
satisfying R(t , s t ) = 0 is available, and we use the zero-loss
hash code pair to guide update of the hash model, deriving the where · F is the Frobenius norm, ξ is a nonnegative auxiliary
updated Wt+1 toward a better prediction. We leave the details variable to relax the constraint on the prediction loss function
about how to obtain the zero-loss hash code pair presented t (W) = 0, and C is a margin parameter to control the
in Section IV. effect of the slack term, whose influence will be observed
Now, we wish to obtain an updated hash projection matrix in Section VII. Through this objective function, the difference
Wt+1 such that it predicts similar hash code pair toward the between the new projection matrix Wt+1 and the last one Wt is
zero-loss hash code pair t for the current input pair of data minimized, and meanwhile the prediction loss function t (W)
samples. Let us define of the new Wt+1 is bounded by a small value. We call the
T T above-mentioned model the OH model.
H t (W) = hi t W T xi t + hj t W T xj t (4)
Finally, we wish to provide a comment on the function
tT tT
G (W) = gi
t
W xi + gj
T t
W xj .
T t
(5) H t (W) in (4) and (6). Actually, an optimal case should be
to refine function H t (W) as a function of variables W and a
Given hash function (2) with respect to Wt , we have
code pair = [fi , fj ] ∈ {−1, 1}r×2 as follows:
H t (Wt ) ≥ G t (Wt ), since hi t and hj t are the binary solutions
for xi and xj t for the maximization, respectively, while gi t
t
H t (W, ) = fit W T xi t + fjt W T xj t
T T
(8)
and gj t are not. This also suggests that Wt is not suitable for
the generated binary code to approach the zero-loss hash code and then to refine the prediction loss function when an optimal
pair t , and thus a new projection Wt+1 has to be learned. update Wt+1 is used
When updating the projection matrix from Wt to Wt+1 ,
we expect that the binary code generated for xi t is gi t . Accord- t (Wt+1 ) = max H t (Wt+1 , ) − G t (Wt+1 ) + R(t , s t ).
∈{−1,1}r×2
ing to the hash function in structured prediction form in (2), (9)
T T
our expectation is to require gi t T Wt+1 xi t > hi t T Wt+1 xi t .
Similarly, we expect the binary code generated for xj t is gj t The above refinement in theory can make max H t (Wt+1 , ) −
T T
and this is also to require gj t T Wt+1 xj t > hj t T Wt+1 xj t . Com- G t (Wt+1 ) be a more rigorous loss on approximating the zero-
bining these two inequalities together, it would be expected loss hash code pair t . But, it would be an obstacle to the
that the new Wt+1 should meet the condition that G t (Wt+1 ) > optimization, since Wt+1 is unknown when max H t (Wt+1 , )
H t (Wt+1 ). To achieve this objective, we derive the following is computed. Hence, we avert this problem by implicitly intro-
prediction loss function t (W) for our algorithm: ducing an alternating optimization by first fixing to be t ,
then optimizing Wt+1 by Criterion (7), and finally predicting
t (W) = H t (W) − G t (W) + R(t , s t ). (6) the best for max ∈{−1,1}r×2 H t (Wt+1 , ). This process can be
iterative. Although this may be useful to further improve
In the above loss function, t , t , and R(t , s t ) are constants
our online model, we do not completely follow this implicit
rather than variables dependent on Wt . R(t , s t ) can be treated
alternating processing to learn Wt+1 iteratively. This is because
a loss penalization. When used in Criterion (7) later, a small
data are coming in sequence and it would be demanded to
R(t , s t ) means a slight update is expected, and a large
process a new data pair after an update of the projection
R(t , s t ) means a large update is necessary. Note that the
matrix W. Hence, in our implementation, we only update Wt+1
square root of similarity loss function R(t , s t ) is utilized here,
once, and we provide the bound for R(t , s t ) under such a
because it enables an upper bound on the cumulative loss
processing in Theorem 2.
functions, which will be shown in Section V-A.
Note that if t (Wt+1 ) = 0, we can have G t (Wt+1 ) =
√ B. Optimization
H t (Wt+1 ) + R(t , s t ) > H t (Wt+1 ). Let
t be the hash codes
of t computed using the updated Wt+1 by (2). Even though When R(t , s t ) = 0, t is the optimal code pair and t is the
G t (Wt+1 ) > H t (Wt+1 ) cannot guarantee that t is exactly t , same as t , and thus t (Wt ) = 0. In this case, the solution to
it is probable that t is very close to t rather than t . Criterion (7) is Wt+1 = Wt . That is, when the already learned
It, therefore, makes sense to force t (Wt+1 ) to be zero or close hash projection matrix Wt can correctly predict the similarity
to zero. label of the new coming pair of data points t , there is no
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HUANG et al.: OH 5
need to update the hash function. When R(t , s t ) > 0, the Algorithm 1 OH
solution is INITIALIZE W1
Wt+1 = Wt + τ t t
(t − t )T for t = 1, 2, . . . do
Receive a pairwise instance t and similarity label s t ;
t (Wt )
t
τ = min C, . (10) Compute the hash code pair t of t by Eq. (1);
|| t (t − t )T ||2F
Compute the similarity loss R(t , s t ) by Eq. (3);
The procedure for deriving the solution formulation (10) if R(t , s t ) > 0 then
when R(t , s t ) > 0 is detailed as follows. Get the zero-loss code pair t that makes R(t , s t ) = 0;
First, the objective function [see Criterion (7)] can be Compute the prediction loss t (Wt ) by Eq. (6);
Set τ t = min{C, ||t ( t(W )
t t
rewritten in the following when we introduce the Lagrange −t )T ||2F
};
multipliers Update Wt+1 = Wt + τ t t (t − t )T ;
L(W, τ t , ξ, λ) =
||W−Wt ||2F
+ Cξ + τ t (t (W) − ξ ) − λξ (11) else
2
Wt+1 = Wt ;
where τt
≥ 0 and λ ≥ 0 are the Lagrange multipliers. Then, end if
by computing ∂L/∂W = 0, ∂L/∂ξ = 0, and ∂L/∂τ t = 0, we can end for
have
∂L
0 =
∂W T For our online hash model learning, we assume that at
⇒ W = Wt + τ t xi t gi t − hi t + xj t (gj t − hj t )T least m data points have been provided in the initial stage;
= Wt + τ t t (t − t )T (12) otherwise, the online learning will not start until at least m data
∂L
0 = = C − τt − λ points have been collected, and then these m data points are
∂ξ
considered as the m anchors used in the kernel trick. Regarding
⇒ τ t = C − λ. (13)
the kernel used in this paper, we employ the Gaussian RBF
Since λ > 0, we have τ t < C . By putting (12) and (13) back kernel κ (x, y) = ex p(−||x − y||2 /2σ 2 ), where we set σ to 1 in
into (6), we obtain our algorithm.
t (W) = −τ t || t (t − t )T ||2F + t (Wt ). (14)
IV. Z ERO -L OSS B INARY C ODE PAIR I NFERENCE
Also, by putting (12)–(14) back into (11), we have In Section III-A, we have mentioned that our OH algorithm
1 2
L(τ ) = − τ t || t (t − t )T ||2F + τ t t (Wt ).
t relies on the zero-loss code pair t = [gi t , gj t ] which satisfies
2 R(t , s t ) = 0. Now, we detail how to acquire t .
By taking the derivative of L with respect to τ t and setting it
to zero, we get A. Dissimilar Case
∂L
0 = = −τ t || t (t − t )T ||2F + t (Wt ) We first present the case for dissimilar pairs. As mentioned
∂τ t in Section III-A, to achieve zero similarity loss, the Hamming
t (Wt )
⇒ τt = . (15) distance between the hash codes of nonneighbors should not
|| (t − t )T ||2F
t
be smaller than βr . Therefore, we need to seek t such
Since τ t < C , we can obtain that Dh (gi t , gj t ) ≥ βr . Denote the k th bit of hi t by hi t[k] ,
and similarly we have hj t[k] , gi t[k] , andgj t[k] . Then, Dh (hi t , hj t ) =
t (Wt )
r
k=1 Dh (hi [k] , hj [k] ), where
t t t
τ = min C, . (16)
|| (t − t )T ||2F
t
0, if hi t[k] = hj t[k]
In summary, the solution to the optimization problem in Dh hi t[k] , hj t[k] =
Criterion (7) is (10), and the whole procedure of the proposed 1, if hi t[k] = hj t[k] .
OH is presented in Algorithm 1. Let K1 = {k|Dh (hi t[k] , hj t[k] ) = 1} and K0 = {k|Dh (hi t[k] , hj t[k] ) =
C. Kernelization 0}. To obtain t , we first set gi t[k] = hi t[k] and gj t[k] = hj t[k] for
k ∈ K1 , so as to retain the Hamming distance obtained
Kernel trick is well known to make machine learning models
through the hash model learned in the last round. Next,
better adapted to nonlinearly separable data [21]. In this
in order to increase the Hamming distance, we need to make
context, a kernel-based OH is generated by employing explicit
Dh (gi t[k] , gj t[k] ) = 1 for k ∈ K0 . That is, we need to set1 either
kernel mapping to cope with the nonlinear modeling. In detail,
gi t[k] = −hi t[k] or gj t[k] = −hj t[k] , for all k ∈ K0 . Hence, we can
we aim at mapping data in the original space Rd into a feature
space Rm through a kernel function based on m (m < d) anchor pick up p bits, whose indices are in set K0 to change/update
points, and, therefore, we have a new representation of x which such that
can be formulated as follows:
Dh (gi t , gj t ) = Dh hi t , hj t + p. (17)
T
z(x) = [κ(x, x(1) ), κ(x, x(2) ), . . . , κ(x, x(m) )] 1 The hash code in our algorithm is −1 and 1. Note that, g t = −h t
i [k] i [k]
where x(1) , x(2) , . . . , x(m) are m anchors. means set gi t[k] to be different from hi t[k]
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Now the problem is how to set p, namely, the number Algorithm 2 Inference of t for a Dissimilar Pair
of hash bits to update. We first investigate the relationship Calculate the Hamming distance Dh (hi t , hj t ) between hi t and
between the update of projection vectors and t . Note that W hj t ;
consists of r projection vectors wk (k = 1, 2, . . . , r ). From (12), Calculate p0 = βr
− Dh (hi t , hj t );
we can deduce that Compute δk for k ∈ K0 by Eq. (19);
Sort δk ;
wkt+1 = wkt + τ t xi t gi t[k] − hi t[k] + xj t gj t[k] − hj t[k] . (18)
Set the corresponding hash bits of the p0 smallest δk opposite
It can be found that wkt+1 = wkt , when gi t[k] = hi t[k] and to the corresponding ones in t by following the rule in
gj t[k] = hj t[k] ; otherwise, wkt will be updated. So the more Eq.(20) and keep the others in t without change.
wk in W we update, the more corresponding hash bits of
all data points we subsequently have to update when applied
to real-world system. This takes severely much time which Algorithm 3 Inference of t for a Similar Pair
cannot be ignored for online applications. Hence, we should Calculate the Hamming distance Dh (hi t , hj t ) between hi t and
change hash bits as few as possible; in other words, we aim hj t ;
to update wk as few as possible. This means that p should be Calculate p0 = Dh (hi t , hj t ) − α ;
as small as possible, meanwhile guaranteeing that t satisfies Compute δk for k ∈ K1 by Eq. (19);
the constrain R(t , s t ) = 0. Based on the earlier discussion, Sort δk ;
the minimum of p is computed as p0 = βr
− Dh (hi t , hj t ) Set the corresponding hash bits of the p0 smallest δk opposite
by setting Dh (gi t , gj t ) = βr
, as p = Dh (gi t , gj t ) − Dh (hi t , hj t ) to the corresponding values in t by following the rule in
and Dh (gi t , gj t ) ≥ βr
≥ βr . Then, t is ready by selecting Eq.(20) and keep the others in t with no change.
p0 hash bits whose indexes are in K0 .
After determining the number of hash bits to update,
namely, p0 , the problem now becomes which p0 bits should B. Similar Case
be picked up from K0 . To establish the rule, it is necessary to Regarding similar pairs, the Hamming distance of the opti-
measure the potential loss for every bit of hit and htj . For this mal hash code pairs t should be equal or smaller than α . Since
purpose, the prediction loss function in (6) can be reformed the Hamming distance between the predicted hash codes of
as similar pairs may be larger than α , we should pick up p0 bits
T
2hi t[k] wkt xi t +
T
2hj t[k] wkt xj t + R(t , s t ). from set K1 instead of from set K0 , and set them opposite to
hi t[k] =gi t[k] hj t[k] =gj t[k]
the corresponding values in t so as to achieve R(t , s t ) = 0.
Similar to the case for dissimilar pairs as discussed earlier,
T T
This tells that hi t[k] wkt xi t or hj t[k] wkt xj t are parts of the the number of hash bits to be updated is p0 = Dh (hi t , hj t ) − α ,
prediction loss, and thus we use it to measure the potential but these bits are selected in K1 . We will compute δk for k ∈ K1
loss for every bit. The problem is which bit should be picked and pick up p0 bits with the smallest δk for update. Since the
up to optimize. For our one-pass online learning, a large whole processing is similar to the processing for the dissimilar
update does not mean a good performance will be gained, pairs, we only summarize the processing for similar pairs in
since every time we update the model only based on a new Algorithm 3 and skip the details.
arrived pair of samples, and thus a large change on the hash Finally, when wkt is a zero vector, δk is zero as well no
function would not suit the passed data samples very well. This matter what the values of hi t[k] , hj t[k] , xi t , and xj t are. This leads
also conforms to the spirit of passive-aggressive idea that the to the failure in selecting hash bits to be updated. To avert
change of an online model should be smooth. To this end, this, we initialize W1 by applying LSH. In other words, W1 is
we take a conservative strategy by selecting the p0 bits that sampled from a zero-mean multivariate Gaussian N (0, I ), and
corresponding to smallest potential loss as introduced in the we denote this matrix by W L S H .
following. First, the potential loss of every bit with respect to
H (Wt ) is calculated by V. A NALYSIS
T T
δk = min hi t[k] wkt xi t , hj t[k] wkt xj t , k ∈ K0 . (19) A. Bounds for Similarity Loss and Prediction Loss
T In this section, we discuss the loss bounds for the proposed
We only select the smaller one between hi t[k] wkt xi t and OH algorithm. For convenience, at step t , we define
hj t[k] wkt xj t , because we will never set t simultaneously by
T
gi [k] = −hi t[k] and gj t[k] = −hj t[k] for any k ∈ K0 . After sorting δk ,
t t
U = t (U) = H t (U) − G t (U) + R(t , s t ) (21)
the p0 smallest δk are picked up and their corresponding hash where U is an arbitrary matrix in Rd×r . Here, Ut is considered
bits are updated by the following rule: as the prediction loss based on U in the t th round.
gi [k] = −hi[k] ,
T
if hi t[k] wkt xi t ≤ hj t[k] wkt xj t
T We first present a lemma that will be utilized to prove
(20) Theorem 2.
gj [k] = −hj [k] , otherwise.
Lemma 1: Let ( 1 , s 1 ), · · · , ( t , s t ) be a sequence of pair-
The procedure of obtaining t for a dissimilar pair is wise examples, each with a similarity label s t ∈ {1, −1}. The
summarized in Algorithm 2. data pair t ∈ Rd×2 is mapped to a r -bit hash code pair
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
HUANG et al.: OH 7
t ∈ Rr×2 through the hash projection matrix Wt ∈ Rd×r . This deduces that
Let U be an arbitrary matrix in Rd×r . If τ t is defined as that
in (10), we then have τ t (2t (Wt ) − τ t || t (t − t )T ||2F
∞
≥ τ t (2t (Wt ) − t (Wt )) = τ t t (Wt ). (24)
τ (2 (W ) − τ || ( −
t t t t t t t T 2
) || F t
− 2U ) ≤ ||U − W1 ||2F
t=1
According to the definition of prediction loss function in (6)
and the upper bound assumption, we know that for any t
where W1
is the initialized hash projection matrix that consists
of nonzero vectors. R(t , s t ) ≤ t (Wt )
Proof: By using the definition || ( − t )T ||2F ≤ F 2 , and
t t
√
t = ||Wt − U||2F − ||Wt+1 − U||2F R(t , s t )
≤ C. (25)
F2
we can have
∞
∞
With these three inequalities and (16), it can be deduced that
t = ||Wt − U||2F − ||Wt+1 − U||2F
t t t t (Wt )2
t=1 t=1 τ (W ) = min , Ct (Wt )
= ||W1 − U||2F − ||Wt+1 − U||2F ≤ ||W1 − U||2F . (22) || t (t − t )T ||2F
R(t , s t )
From (12), we know Wt+1 = Wt + τt t( t − t )T , so we can ≥ min , C R(t , s t )
rewrite t as || t (t − t )T ||2F
R(t , s t ) R(t , s t )
t = ||Wt − U||2F − ||Wt+1 − U||2F ≥ min ,
F2 F2
= ||Wt − U||2F − ||Wt − U + τ t t
(t − t )T ||2F R( , s )
t t
= . (26)
= ||Wt − U||2F − (||Wt − U||2F F2
+ 2τ t (H t (W) − G t (W) − (H t (U) − G t (U))) By combining (23) and (26), we obtain that
+ (τ ) || ( −
t 2 t t
) || F )
t T 2 ∞ ∞
R(t , s t ) 1 2
≥ −2τ t (( R(t , s t ) − (W )) − ( R(t , s t ) − U
t t t
)) ≤ ||U − W || F + 2 τ t U
t
.
F2
t=1 t=1
− (τ t )2 || t (t − T )T ||2F
t Since, τt ≤ C for all t , we have
= τ t (2t (Wt ) − τ t || t (t − t )T ||2F − 2U .
∞
∞
By computing the sum of the left and the right of the above R( , s ) ≤ F
t t 2
||U − W1 ||2F + 2C t
U . (27)
inequality, we can obtain t=1 t=1
∞
∞
The theorem is proved.
t ≥ τ t (2t (Wt ) − τ t ||( t
(t − t )T ||2F − 2U
t
). Note: In particular, if the assumption that there exists a
t=1 t=1
projection matrix satisfying zero or negative prediction loss for
Finally, putting (22) into the above equation, we prove any pair of data samples holds, there would exist a U satisfying
Lemma 1. U
t ≤ 0 for any t . If so, the second term in inequality (27)
Theorem 2: Let ( 1 , s 1 ), · · · , ( t , s t ) be a sequence of pair- vanishes. In other words, the similarity loss is bounded by a
wise examples, each with a similarity label s t ∈ {1, −1} for constant, i.e., F 2 ||U − W1 ||2F . After a certain amount of update,
all t . The data pair t ∈ Rd×2 is mapped to a r -bit hash code the similarity loss for new data sample pairs becomes zero in
pair t ∈ Rr×2 through the hash projection matrix Wt ∈ Rd×r . such an optimistic case.
If || t (t − )T ||2F is upper bounded by F 2 and √
the margin
parameter C is set as the upper bound of R(F2 ,s ) , then
t t
VI. M ULTIMODEL O NLINE H ASHING and semantic neighbor search. First, four selected data sets are
In order to make the OH model more robust and less introduced in Section VII-A. And then, we evaluate the pro-
biased by current round update, we extend the proposed OH posed models in Section VII-B. Finally, we make comparison
from updating one single model to updating the T models. between the proposed algorithms and several related hashing
Suppose that we are going to train T models, which are models in Section VII-C.
initialized randomly by LSH. Each model is associated with
A. Data Sets
the optimization of its own similarity loss function in terms
of (3), denoted by Rm (tm , s t ) (m = 1, 2, · · · , T ), where tm is The four selected large-scale data sets are: Photo
the binary code of a new pair t predicted by the m th model Tourism [57], 22K LabelMe [58], GIST1M [59], and
at step t . At step t , if t is a similar pair, we only select CIFAR-10 [60], which are detailed in the following.
one of the T models to update. To do that, we compute the 1) Photo Tourism [57]: It is a large collection of 3-D
similarity loss function for each model Rm (tm , s t ), and then photographs, including three subsets, each of which has about
we select the model, supposed the m 0 th model that obtains the 100k patches with 64 × 64 grayscale. In the experiment,
smallest similarity loss, i.e., m 0 = arg minm Rm (tm , s t ). Note we selected one subset consisting of 104k patches taken from
that for a similar pair, it is enough that one of the models Half Dome in Yosemite. We extracted 512-D GIST feature
has positive output, and thus the selected model is the closest vector for each patch and randomly partitioned the whole data
one to suit this similar pair and is more easier to update. set into a training set with 98k patches and a testing set with
If t is a dissimilar pair, all models will be updated if the 6k patches. The pairwise label s t is generated based on the
corresponding loss is not zero, since we cannot tolerate an matching information. That is, s t is 1 if a pair of patches is
wrong prediction for a dissimilar pair. By performing OH matched; otherwise s t is −1.
in this way, we are able to learn diverse models that could 2) 22K LabelMe [58]: It contains 22/,019 images. In the
fit different data samples locally. The update of each model experiment, each image was represented by 512-D GIST
follows the algorithm presented in our section III.B. feature vector. We randomly selected 2k images from the
To guarantee the rationale of the MMOH, we also provide data set as the testing set, and set the remaining images as
the upper bound for the accumulative multimodel similarity the training set. To set the similarity label between two data
loss in Theorem 3. samples, we followed [21] and [27]: if either one is within the
Theorem 3: Let ( 1 , s 1 ), · · · , ( t , s t ) be a sequence of pair- top 5% nearest neighbors of the other measured by Euclidean
wise examples, each with a similarity label s t ∈ {1, −1} for distance, s t = 1 (i.e., they are similar); otherwise s t = −1
all t . The data pair t ∈ Rd×2 is mapped to a r -bit hash (i.e., they are dissimilar).
code pair t ∈ Rr×2 through the hash projection matrix 3) Gist1M [59]: It is a popular large-scale data set to
Wt ∈ Rd×r . Suppose || t (m t − )T ||2 is upper bounded
F
evaluate hash models [20], [61]. It contains one million
byF 2 , and the margin parameter C is set as the upper bound of unlabeled data with each data represented by a 960-D GIST
(( Rm ∗ (t , s t ))/F 2 ) for all m , where R ∗ (t , s t ) is an auxiliary feature vector. In the experiment, we randomly picked up
m m m
function defined as 500 000 points for training and the nonoverlapped 1000 points
⎧ for testing. Owing to the absence of label information, we uti-
⎨ Rm (m , s ),
⎪ t t if the m th model is selected for lize pseudolabel information by thresholding the top 5% of
∗
Rm (tm , s t ) = update at step t
⎪
⎩ 0, the whole data set as the true neighbors of an instance based
otherwise. on Euclidean distance, so every point has 50 000 neighbors.
(28) 4) CIFAR-10 and Tiny Image 80M [60]: CIFAR-10 is
Then, for any matrix U ∈ Rd×r , the cumulative similarity a labeled subset of the 80M Tiny Images collection [62].
loss (3) is bounded, that is It consists of ten classes with each class containing 6k 32 × 32
color images, leading to 60k images in total. In the experiment,
∞
T ∞
every image was represented by 2048-D deep features, and
∗
Rm
( tm , s t ) ≤ TF 2
||U − W1 ||2F + 2C t
U
t=1 m=1 t=1
59k samples were randomly selected to set up the training set
with the remained 1k as queries to search through the whole
where C is the margin parameter defined in Criterion (7).
80M Tiny Image collection.
Proof: Based on Theorem 2, the following inequality
For measurement, the mean average precision (mAP)
holds for m = 1, 2, . . . T :
[63], [64] is used to measure the performance of different
∞
∞
algorithms, and mAP is regarded as a better measure than
∗
Rm
( tm , s t ) ≤F 2
||U − W1 ||2F + 2C t
U .
precision and recall when evaluating the quality of results in
t=1 t=1
retrieval [63], [64]. All experiments were independently run
By summing these multimodel similarity losses of all models, on a server with CPU Intel Xeon X5650, 12-GB memory and
Theorem 3 is proved. 64-b CentOS system.
HUANG et al.: OH 9
TABLE I
D EFAULT VALUES OF THE PARAMETERS IN MMOH
HUANG et al.: OH 11
Fig. 6. mAP comparison results among different online hash models with
respect to different code lengths. (Best viewed in color.) (a) Photo Tourism.
(b) 22K LabelMe. (c) GIST1M. (d) CIFAR-10.
HUANG et al.: OH 13
[5] M. Norouzi, A. Punjani, and D. Fleet, “Fast exact search in Ham- [32] D. Zhang and W.-J. Li, “Large-scale supervised multimodal hashing
ming space with multi-index hashing,” IEEE Trans. Pattern Anal. with semantic correlation maximization,” Assoc. Adv. Artif. Intell., vol. 1,
Mach. Intell., vol. 36, no. 6, pp. 1107–1119, Jun. 2014. no. 2, pp. 2177–2183, 2014.
[6] M. S. Charikar, “Similarity estimation techniques from rounding algo- [33] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber,
rithms,” in Proc. ACM Symp. Theory Comput., 2002, pp. 380–388. “Multimodal similarity-preserving hashing,” IEEE Trans. Pattern Anal.
[7] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive Mach. Intell., vol. 36, no. 4, pp. 824–830, Apr. 2014.
hashing scheme based on p-stable distributions,” in Proc. ACM Symp. [34] Y. Wei, Y. Song, Y. Zhen, B. Liu, and Q. Yang, “Scalable heteroge-
Comput. Geometry, 2004, pp. 253–262. neous translated hashing,” in Proc. ACM Special Interest Group Knowl.
[8] O. Chum, J. Philbin, and A. Zisserman, “Near duplicate image detection: Discovery Data Mining, 2014, pp. 791–800.
Min-hash and TF-IDF weighting,” in Proc. Brit. Mach. Vis. Conf., 2008, [35] D. Wang, X. Gao, X. Wang, and L. He, “Semantic topic multimodal
pp. 1–10. hashing for cross-media retrieval,” in Proc. Int. Joint Conf. Artif. Intell.,
[9] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing 2015, pp. 3890–3896.
for scalable image search,” in Proc. Int. Conf. Comput. Vis., 2009, [36] D. Wang, X. Gao, X. Wang, L. He, and B. Yuan, “Multimodal discrim-
pp. 2130–2137. inative binary embedding for large-scale cross-modal retrieval,” IEEE
[10] Z. Tang, X. Zhang, and S. Zhang, “Robust perceptual image hashing Trans. Image Process., vol. 25, no. 10, pp. 4540–4554, Oct. 2016.
based on ring partition and NMF,” IEEE Trans. Knowl. Data Eng., [37] Y. Zhen and D.-Y. Yeung, “Active hashing and its application to image
vol. 26, no. 3, pp. 711–724, Mar. 2014. and text retrieval,” in Proc. Data Mining Knowl. Discovery, vol. 26,
[11] L. Zhang, Y. Zhang, X. Gu, J. Tang, and Q. Tian, “Scalable similarity no. 2, pp. 255–274, 2013.
search with topology preserving hashing,” IEEE Trans. Image Process., [38] Q. Wang, L. Si, Z. Zhang, and N. Zhang, “Active hashing with joint
vol. 23, no. 7, pp. 3025–3039, Jul. 2014. data example and tag selection,” in Proc. ACM Special Interest Group
[12] A. Shrivastava and P. Li, “Densifying one permutation hashing via Inf. Retr., 2014, pp. 405–414.
rotation for fast near neighbor search,” in Proc. Int. Conf. Mach. Learn., [39] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,
2014, pp. 557–565. “Online passive-aggressive algorithms,” J. Mach. Learn. Res., vol. 7,
[13] Z. Yu, F. Wu, Y. Zhang, S. Tang, J. Shao, and Y. Zhuang, “Hashing pp. 551–585, Dec. 2006.
with list-wise learning to rank,” in Proc. ACM Special Interest Group [40] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online
Inf. Retr., 2014, pp. 999–1002. learning of image similarity through ranking,” J. Mach. Learn. Res.,
[14] X. Liu, J. He, and B. Lang, “Multiple feature kernel hashing for large- vol. 11, pp. 1109–1135, Jan. 2010.
scale visual search,” Pattern Recognit., vol. 47, no. 2, pp. 748–757, [41] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning
Feb. 2014. and fast similarity search,” in Proc. Neural Inf. Process. Syst., 2008,
[15] B. Wu, Q. Yang, W. Zheng, Y. Wang, and J. Wang, “Quantized pp. 761–768.
correlation hashing for fast cross-modal search,” in Proc. Int. Joint Conf. [42] Y. Li and P. M. Long, “The relaxed online maximum margin algorithm,”
Artif. Intell., 2015, pp. 3946–3952. in Proc. Neural Inf. Process. Syst., 1999, pp. 498–504.
[16] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete [43] M. K. Warmuth and D. Kuzmin, “Randomized online PCA algo-
hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, rithms with regret bounds that are logarithmic in the dimension,” J.
pp. 37–45. Mach. Learn. Res., vol. 9, pp. 2287–2320, Oct. 2008.
[17] Y. Weiss, A. Torralba, and R. Frgus, “Spectral hashing,” in Proc. Neural [44] J. Silva and L. Carin, “Active learning for online Bayesian matrix
Inf. Process. Syst., 2008, pp. 1753–1760. factorization,” in Proc. ACM Conf. Knowl. Discovery Data Mining,
[18] W. Liu, J. Wang, S. Kumar, and S. Chang, “Hashing with graphs,” in 2012, pp. 325–333.
Proc. Int. Conf. Mach. Learn., 2011, pp. 1–8. [45] A. Bordes, S. Ertekin, J. Weston, and L. Bottou, “Fast kernel classi-
[19] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean fiers with online and active learning,” J. Mach. Learn. Res., vol. 6,
approach to learning binary codes,” in Proc. IEEE Conf. Comput. Vis. pp. 1579–1619, Sep. 2005.
Pattern Recognit., Oct. 2011, pp. 817–824. [46] W. Chu, M. Zinkevich, L. Li, A. Thomas, and B. Tseng, “Unbiased
[20] J. Heo, Y. Lee, J. He, S. Chang, and S. Yoon, “Spherical hash- online active learning in data streams,” in Proc. ACM Conf. Knowl.
ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Mar. 2012, Discovery Data Mining, 2011, pp. 195–203.
pp. 2957–2964. [47] C. Leng, J. Wu, J. Cheng, X. Bai, and H. Lu, “Online sketching
[21] W. Liu, J. Wang, R. Ji, Y. Jiang, and S.-F. Chang, “Supervised hashing hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015,
with kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2503–2511.
Jun. 2012, pp. 2074–2081. [48] M. Ghashami and A. Abdullah. (Mar. 2015). “Binary coding in stream.”
[22] M. Norouzi and D. Fleet, “Minimal loss hashing for compact binary [Online]. Available: https://arxiv.org/abs/1503.06271
codes,” in Proc. Int. Conf. Mach. Learn., 2011, pp. 353–360. [49] F. Çakir and S. Sclaroff, “Online supervised hashing,” in Proc. IEEE
[23] Y. Mu, J. Shen, and S. Yan, “Weakly-supervised hashing in kernel Int. Conf. Image Process., Jun. 2015, pp. 2606–2610.
space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010, [50] F. Çakir and S. Sclaroff, “Adaptive hashing for fast similarity search,”
pp. 3344–3351. in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1044–1052.
[24] P. Zhang, W. Zhang, W.-J. Li, and M. Guo, “Supervised hashing with [51] F. Çakir and S. Sclaroff, “Supervised hashing with error correcting
latent factor models,” in Proc. ACM Special Interest Group Inf. Retr., codes,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 785–788.
2014, pp. 173–182. [52] L.-K. Huang, Q. Yang, and W.-S. Zheng, “Online hashing,” in Proc. Int.
[25] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan, “Supervised hashing for Joint Conf. Artif. Intell., 2013, pp. 1422–1428.
image retrieval via image representation learning,” Assoc. Adv. Artif. [53] E. Liberty, “Simple and deterministic matrix sketching,” in Proc. ACM
Intell., vol. 1, no. 1, pp. 2156–2162, 2014. Conf. Knowl. Discovery Data Mining, 2013, pp. 581–588.
[26] G. Lin, C. Shen, Q. Shi, A. V. D. Hengel, and D. Suter, “Fast supervised [54] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu, “Complementary
hashing with decision trees for high-dimensional data,” in Proc. IEEE hashing for approximate nearest neighbor search,” in Proc. Int. Conf.
Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 1971–1978. Comput. Vis., 2011, pp. 1631–1638.
[27] J. Wang, O. Kumar, and S. Chang, “Semi-supervised hashing for scalable [55] T. Finley and T. Joachims, “Training structural SVMS when exact
image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., inference is intractable,” in Proc. Int. Conf. Mach. Learn., 2008,
Jun. 2010, pp. 3424–3431. pp. 304–311.
[28] J. Wang, S. Kumar, and S. Chang, “Sequential projection learning for [56] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Large
hashing with compact codes,” in Proc. Int. Conf. Mach. Learn., 2010, margin methods for structured and interdependent output variables,”
pp. 1127–1134. J. Mach. Learn. Res., vol. 6, pp. 1453–1484, Sep. 2005.
[29] J. Cheng, C. Leng, P. Li, M. Wang, and H. Lu, “Semi-supervised [57] N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: Explor-
multi-graph hashing for scalable similarity search,” Comput. Vis. Image ing photo collections in 3D,” ACM Trans. Graph., vol. 25, no. 3,
Understand., vol. 124, pp. 12–21, Jul. 2014. pp. 835–846, 2006.
[30] N. Quadrianto and C. H. Lampert, “Learning multi-view neighbor- [58] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image
hood preserving projections,” in Proc. Int. Conf. Mach. Learn., 2011, databases for recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
pp. 425–432. Recognit., Jun. 2008, pp. 1–8.
[31] M. Rastegari, J. Choi, S. Fakhraei, D. Hal, and L. Davis, “Pre- [59] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest
dictable dual-view hashing,” in Proc. Int. Conf. Mach. Learn., 2013, neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,
pp. 1328–1336. pp. 117–128, Jan. 2011.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[60] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Qiang Yang received the M.S. degree from the
Tech. Rep., 2009. School of Information Science and Technology, Sun
[61] Z. Jin, C. Li, Y. Lin, and D. Cai, “Density sensitive hashing,” IEEE Yat-sen University, Guangzhou, China, in 2014,
Trans. Cybern., vol. 44, no. 8, pp. 1362–1371, Aug. 2014. where he is currently pursuing the Ph.D. degree with
[62] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: the School of Data and Computer Science.
A large data set for nonparametric object and scene recognition,” IEEE His current research interests include data mining
Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958–1970, algorithms, machine learning algorithms, evolution-
Nov. 2008. ary computation algorithms, and their applications
[63] A. Turpin and F. Scholer, “User performance versus precision measures on real-world problems.
for simple search tasks,” in Proc. ACM Special Interest Group Inf. Retr.,
2006, pp. 11–18.
[64] C. Wu, J. Zhu, D. Cai, C. Chen, and J. Bu, “Semi-supervised nonlinear
hashing using bootstrap sequential projection learning,” IEEE Trans.
Knowl. Data Eng., vol. 25, no. 6, pp. 1380–1393, Jun. 2013.
[65] M. Norouzi, A. Punjani, and D. J. Fleet, “Fast search in Hamming space
with multi-index hashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Wei-Shi Zheng is currently a Professor with
Recognit., Jun. 2012, pp. 3108–3115. Sun Yat-sen University. He has authored over
[66] L. Huang and S. J. Pan, “Class-wise supervised hashing with label 90 papers, including over 60 publications in main
embedding and active bits,” in Proc. Int. Joint Conf. Artif. Intell., 2016, journals, such as the IEEE T RANSACTIONS ON PAT-
pp. 1585–1591. TERN A NALYSIS AND M ACHINE I NTELLIGENCE,
the IEEE T RANSACTIONS ON I MAGE P ROCESSING,
Long-Kai Huang received the B.Eng. degree from and Public Relations, and top conferences, such as
the School of Information Science and Technology, the International Conference on Computer Vision,
Sun Yat-sen University, Guangzhou, China, in 2013. the Computer Vision and Pattern Recognition, and
He is currently pursuing the Ph.D. degree with the International Joint Conference on Artificial Intel-
the School of Computer Science and Engineering, ligence. His current research interests include dis-
Nanyang Technological University, Singapore. tance metric learning, algorithms for fast search, and deep learning methods,
His current research interests include machine with their particular applications to person/object association and activity
learning and computer vision, and specially focus on understanding in visual surveillance.
fast large-scale image search, nonconvex optimiza- He has joined Microsoft Research Asia Young Faculty Visiting Programme.
tion, and its applications. He was a recipient of the Excellent Young Scientists Fund of the NSFC and
the Royal Society-Newton Advanced Fellowship.