Human Attribute Recognition by Rich Appearance Dictionary
Human Attribute Recognition by Rich Appearance Dictionary
Abstract
722
Figure 2. Two region decomposition methods based on the image
grid: (left) spatial pyramid [14] and (right) our multiscale overlap- Figure 3. Window specic part learning. For every window on the
ping windows. The spatial pyramid subdivides the image region grid, we learn a set of part detectors from clustered image patches
into four quadrants recursively, while we use all rectangular sub- in training set. Each learned detector is reapplied to the images
regions on the grid, which is similar to [20, 18]. and rened.
geometric and an appearance basis. We rst need to spec- The advantage of exible geometric partitioning has
ify what kinds of region primitives are allowed to decom- been also advocated in the recent literature of scene model-
pose the whole image region into subregions at the part level ing and object recognition [11, 20, 18, 1]. In particular, our
(Sec. 3.1). Then, we discuss how to learn appearance mod- grid decomposition method is very similar to the initial step
els to explain the local appearance of each part (Sec. 3.2). of [20, 18], which then further attempt to pick up a subset of
good windows and represent each image with a small num-
3.1. Geometric Conguration ber of non-overlapping windows, which reduces to a single
best conguration through explicit parsing. However, we
Given no prior knowledge on human body ontology, our
empirically found that it leads to a better performance to
objective is to dene a geometric basis (i.e., region decom-
allow many number of overlapping windows, therefore we
position) which is expressive enough to capture the parts
only prune inferior part templates in the later stage but do
of arbitrary shapes, but is still of manageable complexity.
not eliminate or suppress any windows. In other words, our
While there exist simpler methods such as spatial pyra-
method allows all congurations, each of which is implic-
mid [14] or uniform partitioning where all subregions are
itly dened and partially contributes to explain each image.
squares, it is difcult to represent many body parts such as
arms and legs in squares, and moreover, we do not know 3.2. Part Appearance Learning
what would be sufcient. Therefore, we examine many pos-
sible subregions from which we can learn many part candi- Once we dene all windows, we visit each window and
dates, some of which will be pruned in later stages. We learn a set of part detectors that are spatially associated with
only use rectangular subregions to limit the complexity, but that particular window. Our motivation is that the human
allow them to be of arbitrary aspect ratios and sizes. body consists of a number of parts, which are usually spa-
tially constrained, i.e. anchored at their canonical positions.
Specically, our method starts by dividing the image lat-
Fig. 3. shows the general procedure and examples. For
tice into a number of overlapping subregions. In this paper,
each window, wi , we rst crop all the corresponding image
we refer to each subregion as a window. We dene a grid
patches from the entire set of training images. Then each
of size W H 2 , and any rectangle on the grid containing
patch is represented by the feature descriptor. We use the
one or more number of cells of the grid forms a window.
Histogram of Oriented Gradient (HoG) [6] and color his-
Fig. 2. illustrates our region decomposition method in com-
togram as the low-level features of image patches. On the
parison with the spatial pyramid matching structure (SPM)
extracted features, we perform K-means clustering and ob-
[14]. Both methods are based on the spatial grid on images.
tain K = 50 clusters, {v1i , ...vK i
}. Each obtained cluster
The SPM recursively divides the region into four quadrants
represents a specic appearance type of a part. Since the ini-
and thus, all subregions are squares that do not overlap with
tial clusters are noisy, we rst train a local part detector for
each other at the same level. In contrast, we allow more
each cluster by logistic regression as a initial detector and
exibility in shape, size, and location of part window. An-
then, iteratively rene it by applying it onto the entire set
other important difference between our approach and SPM
again and updating the best location and scale. We mine the
is that we treat each window as a template by a set of de-
negative patches from the regions outside given bounding
tectors that can be deformed locally, whereas each region in
boxes. At the initial iteration, we discard the noisy part can-
SPM is used for spatial pooling.
didates by cross validation, and limit the maximum number
2 We use W = 6, H = 9 and let the unit cell be of aspect ratio of 2:3 of useful parts to 30 (we will discuss the choice of this quan-
in the experiment. tity in the experimental section). The detection score, g, of
723
an image I for a part vki can be expressed as follows: it is possible that the same image can be included in multi-
ple number of clusters as long as its part detection score is
P (vki = +|I i ) positive (g(vki |I i ) > 0). Then for each attribute, we have
g(vki |I i ) = log , (1)
P (vki = |I i ) two disjoint subsets of the image patches in the cluster, the
positive and the negative. By using the same image fea-
where I i is the image subregion occupied by the window, tures used for detection, we train an attribute classier for
wi . We transform this log posterior ratio by logistic func- an individual attribute, aj , by another logistic regression as
tion, as in [2]: follows:
724
sometimes ambiguous in crowded scenes (Fig. 4) and such
box is difcult to obtain in fully automated systems which
would typically deploy a person detector prior to attribute
inference; such detector would provide the alignment at the
level of full-body or upper-body.
Therefore, we rst aligned all the training images by us-
ing two keypoints on the head and middle-hip and trained
upper-body detectors. And by applying it onto the images
(while penalizing deviation from the original boxes), we ob-
tained roughly aligned boxes. These aligned boxes are sim-
ply enlarged from the outputs of upper-body detectors and
share the same aspect ratio.
For fair comparison, we used the same set of image fea-
tures as [2], HoG and color histogram. While the Poselet-
based approach additionally uses the human skin-tone as
another feature, we do not use any extra image features in
this paper because our goal is weakly-supervised learning
which requires no supervision on skin. The total number of
parts was set to 1000 while [2] used 1200 parts.
Table 1. shows the full comparison where our full model
outperforms the Poselet based approaches in 8 out of 9 at-
tributes. Note that the full model indicates the approach
using multiscale overlapping windows and the rich appear-
ance part dictionary as we have discussed in this paper. In
order to verify the contribution of each factor to the nal
Figure 4. (a) (top) Images from the dataset of Poselet-based ap-
performance, we conducted two additional tests as follows.
proach, copied from [2]. This dataset exhibits a huge variation Rich Appearance Dictionary. We have argued that it is
of pose, viewpoint, and appearance type of people. (a) (middle) important to learn a rich appearance dictionary that can ad-
A few selected, extremely challenging images from the same set. dress the appearance variation of parts effectively. We val-
Either the bounding box (to indicate the person of interest) is am- idate this argument by varying the number of parts learned
biguous in cluttered scene, the resolution is too low, or occlusion at each window, K, ranging from 1 to 30. However, we still
and truncation is too severe. (b) Randomly selected images from perform clustering to form many clusters and then choose
HAT Database [15]. All the bounding boxes have been obtained K best part candidates, judged by cross validation detec-
by person detector [8], therefore the complexity is limited.
tion score.
Table 3. shows the performance improvement according
Datasets. For evaluation of our approach, we use two
to K and this result can support the importance of rich vi-
publicly available datasets of human images labeled with
sual dictionary. In particular, having many parts per window
attributes: the dataset of Poselet [2], and the Database of
is important for subtle attributes, such as glasses. Note
Human Attributes (HAT) [15]. Fig. 4 shows the examples
that, K = 1 does not mean that we only have one part tem-
taken from both sets. Each set has been constructed in a dif-
plate for each true part. Since we have multiscale overlap-
ferent way. We discuss the details in following subsections.
ping windows, and we can still have many other templates
learned at the other windows. This can also explain why the
5.1. Attribute Classication on Dataset of Poselet
gender attribute, whose cues would be more distributed over
The dataset contains 3864 training images and 3600 many subregions as a global attribute, has the least amount
testing images, each of which is labeled with 9 binary of gain from increasing K.
attributes. Fig. 4 shows a few examples from Poselets Multiscale Overlapping Windows. We also tested the
dataset. Each image is manually annotated by a visible effect of multiscale overlapping window structure used in
bounding box of each person. This bounding box is pro- our approach. Table 1. (b) shows the performance when
vided at training as well as testing time, i.e., detection is we only used a set of non-overlapping windows at single
given. Since these boxes that cover visible parts of humans layer, which reduces to a simple grid decomposition, and
do not provide any alignment, it is very challenging to learn the row (c) shows the result when we use the windows at
or detect the parts from them. Also, the evaluation may two more additional layers as spatial pyramid scheme. Nei-
be problematic because the interested person indication is ther method performed as well as the full approach.
725
male long glasses hat t- long shorts jeans long Mean
hair shirt sleeves pants AP
Base Frequency .593 .300 .220 .166 .235 .490 .179 .338 .747 .363
(a) Full .880 .801 .560 .754 .535 .752 .476 .693 .911 .707
Ours (2) (b) Uniform Grid .857 .734 .429 .631 .405 .687 .349 .560 .862 .613
(c) Spatial Pyramid .857 .725 .407 .641 .429 .707 .356 .565 .886 .620
(d) Full * .824 .725 .556 .601 .512 .742 .455 .547 .903 .652
Poselet (33) (e) No context * .829 .700 .489 .537 .430 .743 .392 .533 .878 .615
(f) No skintone .825 .732 .561 .603 .484 .663 .330 .428 .850 .608
Table 1. The attribute classication performance on the dataset of poselet [2]. The number in parentheses is the number of keypoints used
in learning of each method. * indicates the methods to use an additional image feature (skintone).
726
ve attribute categories. We denote the most contributed,
most discriminative part window for each image by blue
boxes. Although there are some meaningless activation (for
example, NOT JEANS was inferred from the head), the
most parts show reasonable localization.
Fig. 5 shows the most discriminative parts for ve se-
lected attributes. We measure this by correlation between
attribute labels and the part-attribute feature. As one can
easily see, the most discriminative Poselets are unbiased de-
tectors which would respond to both female and male. In
contrast, our discriminative parts has distinct polarities.
6. Conclusion
We presented an approach to the problem of human at-
tribute recognition from human body parts. We argue that it
is critical to learn a rich appearance visual dictionary to han-
dle appearance variation of parts as well as to use a exible
and expressive geometric basis. While the major focus has
been made on appearance learning in this paper, we plan to
expand the current model into structured models where we
can learn more meaningful geometric representation, as for
the future work.
7. Acknowledgement
This work was gratefully supported by NSF CNS
1028381, DARPA MSEE project FA 8650-11-1-7149, ONR
MURI N00014-10-1-0933, and NSF IIS 1018751. The rst
author was also partially supported by Kwanjeong Educa-
tional Foundation.
Figure 5. The most discriminative parts in Poselet-based approach
[2] and our learned model. Our rich dictionary distinguishes many References
different appearance part types, which are directly informative [1] R. Benenson, M. Mathias, T. Tuytelaars, and L. Van Gool.
for attribute classication, while the selected poselets are generic Seeking the strongest rigid detector. In CVPR, 2013.
parts. [2] L. Bourdev, S. Maji, and J. Malik. Describing people:
Poselet-based attribute classication. In ICCV, 2011.
[3] L. Bourdev and J. Malik. Poselets: Body part detectors
from person bounding box region. However, the advantage
trained using 3d human pose annotations. In ICCV, 2009.
of our method is to learn a common dictionary shared by all
[4] L. Cao, M. Dikmen, Y. Fu, and T. S. Huang. Gender recog-
attribute categories whereas the EPM uses a separate dictio- nition from body. In ACM MM, 2008.
nary for each category. [5] H. Chen, A. Gallagher, and B. Girod. Describing clothing by
5.3. Discriminative Parts for Localization semantic attributes. In ECCV, 2012.
[6] N. Dalal and B. Triggs. Histograms of oriented gradients for
Finally, we discuss the issue of the discriminative parts human detection. In CVPR, 2005.
by providing qualitative results. The quantitative evalua- [7] I. Endres, K. J. Shih, J. Jiaa, and D. Hoiem. Learning collec-
tion is difcult because neither dataset provides required tions of part models for object recognition. In CVPR, 2013.
ground-truth annotation. The most discriminative part for [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
an attribute is the part whose contribution to the attribute manan. Object detection with discriminatively trained part-
prediction is the biggest. We measure this from product of based models. TPAMI, 32:16271645, 2010.
the weights of the nal logistic regression classier and the [9] B. A. Golomb, D. T. Lawrence, and T. J. Sejnowski. Sexnet:
part-attribute feature of each image, (I). Fig. 6. shows A neural network identies sex from human faces. In NIPS,
the examples in the testing set (from the Poselets dataset), 1990.
which output the most positive and negative responses for [10] S. Gutta, H. Wechsler, and P. J. Phillips. Gender and ethnic
classication of face images. In FG, 1998.
727
Figure 6. The qualitative results of attribute classication for male, glasses, hat, t-shirt, and jeans categories. (left) The most positive, and
(right) the most negative images for each attribute. The red boxes denote the bounding boxes and each blue box represents a part detection
whose contribution to prediction is the biggest.
[11] Y. Jia, C. Huang, and T. Darrell. Beyond spatial pyramids: [17] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discovery
Receptive eld learning for pooled image features. In CVPR, of mid-level discriminative patches. In ECCV, 2012.
2012. [18] X. Song, T. Wu, Y. Jia, and S.-C. Zhu. Discriminatively
[12] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar. Describ- trained and-or tree models for object detection. In CVPR,
able visual attributes for face verication and image search. 2013.
TPAMI, 33(10):19621977, 2011. [19] D. A. Vaquero, R. S. Feris, D. Tran, L. M. G. Brown,
[13] Y. H. Kwon and N. da Vitoria Lobo. Age classication from A. Hampapur, and M. Turk. Attribute-based people search
facial images. CVIU, 1999. in surveillance environments. In WACV, 2009.
[14] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of [20] S. Wang, J. Joo, Y. Wang, and S.-C. Zhu. Weakly super-
features: Spatial pyramid matching for recognizing natural vised learning for attribute localization in outdoor scenes. In
scene categories. In CVPR, 2006. CVPR, 2013.
[15] G. Sharma and F. Jurie. Learning discriminative spatial rep- [21] Y. Yang and D. Ramanan. Articulated pose estimation with
resentation for image classication. In BMVC, 2011. exible mixtures-of-parts. In CVPR, 2011.
[22] B. Yao, X. Yang, L. Lin, M. W. Lee, and S. C. Zhu. I2T:
[16] G. Sharma, F. Jurie, and C. Schmid. Expanded parts model
Image parsing to text description. Proceedings of the IEEE,
for human attribute and action recognition in still images. In
pages 14851508, 2010.
CVPR, 2013.
728