An Improved Deep Mutual-Attention Learning Model for Person Re-Identification
Abstract
:1. Introduction
2. Related Works
2.1. Siamese-Based CNN Model
2.2. Patch-Based Method
2.3. Local/Global Feature and Scale Learning Methods
2.4. Attention-Based Methods
3. Proposed Method
3.1. Model Architecture
3.2. Self- and Mutual-Attention
3.3. Self-Attention Map
3.4. Mutual Attention Layer
4. Loss Function and Similarity Learning
5. Experiment
5.1. Experimental Settings
- Input preparation: We used pre-trained ResNet-50 trained on ImageNet and used the feature extracted from the third and fourth residual block to compute the self- and mutual-attention map. We re-sized all input images to a resolution of 256 × 188, horizontally flipped, and the mean image computed from all training was subtracted from all the images. We also used random-erasing to regularize and make the model robust. Positive and negative pairs were randomly chosen and shuffled in each mini-batch to avoid the model benefiting from a fixed input sequence and to avoid overfitting.
- Training: We used the Pytorch deep learning framework to implement the proposed model. The mini-batch size was set to 32 and with initial learning initialized as . The learning rate gradually faded by a factor of between the 40th and 60th epoch. We used stochastic gradient descent to optimize the model and trained for 60 epochs. We maintained a drop-out rate of 0.75 for the fully connected layer to reduce the risk of overfitting. To maintain the stability of training and avoid vanishing and/or exploding gradients, the weights for the model were initialized using Kaiming initialization. As the model was trained with the classification and verification tasks, two cross-entropy losses, namely multi-class cross-entropy and binary cross-entropy loss, jointly optimized the training. We experimentally set the regulating weight coefficient of = 0.5 for verification loss (). Total training loss from the two tasks is computed as:
- Testing: During testing, the feature was first extracted from the gallery and queried using the trained model by feed-forwarding test dataset images of 256 × 188 and obtained the person descriptors of 512 dimensions. The final ranking was performed by calculating the Euclidean distance between each query image and all galleries. We used the commonly used Cumulative Match Curve (CMC) and mean Average Precision (mAP) for the performance evaluation of the model.
5.2. Datasets and Protocols
6. Result and Discussion
6.1. Result on DukeMTMC-reID
6.2. Result on Market-1501
6.3. Ablation Study
7. Conclusions and Future Work
Author Contributions
Funding
Conflicts of Interest
Abbreviations
SA | Self-Attention |
MU | Mutual-Attention |
Cls | Classification Loss |
V | Verification Loss |
References
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Twenty-Sixth Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
- Zhang, G.P. Neural networks for classification: A survey. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2000, 30, 451–462. [Google Scholar] [CrossRef] [Green Version]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- Dang, L.M.; Min, K.; Lee, S.; Han, D.; Moon, H. Tampered and Computer-Generated Face Images Identification Based on Deep Learning. Appl. Sci. 2020, 10, 505. [Google Scholar] [CrossRef] [Green Version]
- Kubanek, M.; Bobulski, J.; Kulawik, J. A Method of Speech Coding for Speech Recognition Using a Convolutional Neural Network. Symmetry 2019, 11, 1185. [Google Scholar] [CrossRef] [Green Version]
- Gadde, R.; Jampani, V.; Kiefel, M.; Kappler, D.; Gehler, P.V. Superpixel convolutional networks using bilateral inceptions. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 597–613. [Google Scholar]
- Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.J.; Wierstra, D. Draw: A recurrent neural network for image generation. arXiv 2015, arXiv:1502.04623. [Google Scholar]
- Zhao, B.; Wu, X.; Feng, J.; Peng, Q.; Yan, S. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimed. 2017, 19, 1245–1256. [Google Scholar] [CrossRef] [Green Version]
- Su, C.; Li, J.; Zhang, S.; Xing, J.; Gao, W.; Tian, Q. Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3960–3969. [Google Scholar]
- Bai, X.; Yang, M.; Huang, T.; Dou, Z.; Yu, R.; Xu, Y. Deep-person: Learning discriminative deep features for person re-identification. Pattern Recognit. 2020, 98, 107036. [Google Scholar] [CrossRef] [Green Version]
- Liu, H.; Feng, J.; Qi, M.; Jiang, J.; Yan, S. End-to-end comparative attention networks for person re-identification. IEEE Trans. Image Process. 2017, 26, 3492–3506. [Google Scholar] [CrossRef] [Green Version]
- Wu, L.; Wang, Y.; Gao, J.; Li, X. Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recognit. 2018, 73, 275–288. [Google Scholar] [CrossRef] [Green Version]
- Varior, R.R.; Haloi, M.; Wang, G. Gated siamese convolutional neural network architecture for human re-identification. In European cOnference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 791–808. [Google Scholar]
- Wang, H.; Fan, Y.; Wang, Z.; Jiao, L.; Schiele, B. Parameter-Free Spatial Attention Network for Person Re-Identification. arXiv 2018, arXiv:1811.12150. [Google Scholar]
- Yang, F.; Yan, K.; Lu, S.; Jia, H.; Xie, X.; Gao, W. Attention driven person re-identification. Pattern Recognit. 2019, 86, 143–155. [Google Scholar] [CrossRef] [Green Version]
- Wu, S.; Chen, Y.C.; Li, X.; Wu, A.C.; You, J.J.; Zheng, W.S. An enhanced deep feature representation for person re-identification. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–8. [Google Scholar]
- Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef] [PubMed]
- Abdel-Hamid, O.; Deng, L.; Yu, D. Exploring convolutional neural network structures and optimization techniques for speech recognition. Interspeech 2013, 2013, 1173–1175. [Google Scholar]
- Venugopalan, S.; Xu, H.; Donahue, J.; Rohrbach, M.; Mooney, R.; Saenko, K. Translating videos to natural language using deep recurrent neural networks. arXiv 2014, arXiv:1412.4729. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE cOnference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. In Proceedings of the Advances in Neural Information Processing Systems Conference, San Francisco, CA, USA, November 1993; pp. 737–744. [Google Scholar]
- Li, W.; Zhao, R.; Xiao, T.; Wang, X. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 152–159. [Google Scholar]
- Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Deep metric learning for person re-identification. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 34–39. [Google Scholar]
- Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
- Liu, J.; Zha, Z.J.; Tian, Q.; Liu, D.; Yao, T.; Ling, Q.; Mei, T. Multi-scale triplet cnn for person re-identification. In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands, October 2016; pp. 192–196. [Google Scholar]
- Chen, W.; Chen, X.; Zhang, J.; Huang, K. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 403–412. [Google Scholar]
- Rahimpour, A.; Liu, L.; Taalimi, A.; Song, Y.; Qi, H. Person re-identification using visual attention. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 4242–4246. [Google Scholar]
- Zhang, X.; Luo, H.; Fan, X.; Xiang, W.; Sun, Y.; Xiao, Q.; Jiang, W.; Zhang, C.; Sun, J. Alignedreid: Surpassing human-level performance in person re-identification. arXiv 2017, arXiv:1711.08184. [Google Scholar]
- Farenzena, M.; Bazzani, L.; Perina, A.; Murino, V.; Cristani, M. Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2360–2367. [Google Scholar]
- Ahmed, E.; Jones, M.; Marks, T.K. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3908–3916. [Google Scholar]
- Subramaniam, A.; Chatterjee, M.; Mittal, A. Deep neural networks with inexact matching for person re-identification. In Proceedings of the Advances in Neural Information Processing Systems Conference, Barcelona, Spain, 5–10 December 2016; pp. 2667–2675. [Google Scholar]
- Li, W.; Zhu, X.; Gong, S. Person re-identification by deep joint learning of multi-loss classification. arXiv 2017, arXiv:1705.04724. [Google Scholar]
- Chen, W.; Chen, X.; Zhang, J.; Huang, K. A multi-task deep network for person re-identification. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Chen, Y.; Zhu, X.; Gong, S. Person re-identification by deep learning multi-scale representations. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2590–2600. [Google Scholar]
- Chung, D.; Tahboub, K.; Delp, E.J. A two stream siamese convolutional neural network for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1983–1991. [Google Scholar]
- Zhao, L.; Li, X.; Zhuang, Y.; Wang, J. Deeply-learned part-aligned representations for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3219–3228. [Google Scholar]
- Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2285–2294. [Google Scholar]
- Kalayeh, M.M.; Basaran, E.; Gökmen, M.; Kamasak, M.E.; Shah, M. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1062–1071. [Google Scholar]
- Wu, L.; Wang, Y.; Gao, J.; Tao, D. Deep Co-attention based Comparators For Relative Representation Learning in Person Re-identification. arXiv 2018, arXiv:1804.11027. [Google Scholar]
- Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
- Gou, M.; Karanam, S.; Liu, W.; Camps, O.; Radke, R.J. DukeMTMC4ReID: A large-scale multi-camera person re-identification dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2197–2206. [Google Scholar]
- Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 3754–3762. [Google Scholar]
- Deng, W.; Zheng, L.; Ye, Q.; Kang, G.; Yang, Y.; Jiao, J. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 994–1003. [Google Scholar]
- Matsukawa, T.; Okabe, T.; Suzuki, E.; Sato, Y. Hierarchical gaussian descriptor for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1363–1372. [Google Scholar]
- Bak, S.; Carr, P. One-shot metric learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 2990–2999. [Google Scholar]
- Wang, Y.; Wu, L.; Lin, X.; Gao, J. Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4833–4843. [Google Scholar] [CrossRef] [Green Version]
- Zheng, Z.; Zheng, L.; Yang, Y. Pedestrian alignment network for large-scale person re-identification. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3037–3045. [Google Scholar] [CrossRef] [Green Version]
- Fan, H.; Zheng, L.; Yan, C.; Yang, Y. Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimed. Comput. Commun. Appl. 2018, 14, 83. [Google Scholar] [CrossRef]
- Li, D.; Chen, X.; Zhang, Z.; Huang, K. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 384–393. [Google Scholar]
- Zhao, H.; Tian, M.; Sun, S.; Shao, J.; Yan, J.; Yi, S.; Wang, X.; Tang, X. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1077–1085. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Setting | Rank | Market-1501 | DukeMTMC-reID |
---|---|---|---|
R = 1 | 90.74 | 80.83 | |
R = 5 | 90.36 | 90.08 | |
Single-shoot | R = 10 | 97.86 | 93.49 |
R = 20 | 98.57 | 94.74 | |
mAP | 76.92 | 64.52 | |
R = 1 | 93.82 | - | |
R = 5 | 97.86 | - | |
Multi-shoot | R = 10 | 98.81 | - |
R = 20 | 99.34 | - | |
mAP | 83.55 | - | |
R = 1 | 91.77 | 85.18 | |
R = 5 | 95.39 | 91.29 | |
Re-ranking | R = 10 | 96.70 | 93.49 |
R = 20 | 98.07 | 95.51 | |
mAP | 87.30 | 80.65 |
Methods | Rank 1 | mAP |
---|---|---|
LOMO+XQDA [49] | 30.7 | 17.04 |
BoW+Kissme [47] | 25.13 | 12.17 |
GAN(R) [50] | 67.68 | 47.13 |
SPGAN [51] | 46.4 | 26.2 |
IDE [31] | 66.7 | 46.3 |
GOG [52] | 65.8 | - |
GAN(R) [22] | 67.68 | 47.13 |
LSRO [53] | 67.7 | 47.1 |
SVDNet [54] | 76.70 | 56.80 |
DCC [46] | 80.3 | 59.2 |
PAN [55] | 71.6 | 51.5 |
DPFL [41] | 79.2 | 60.6 |
HA-CNN [44] | 80.50 | 63.80 |
ResNet-50 Baseline | 75.27 | 57.13 |
Ours | 80.83 | 64.52 |
Method | Single-Query | Multi-Shoot | ||
---|---|---|---|---|
Rank 1 | mAP | Rank 1 | mAP | |
PUL [56] | 45.5 | - | - | - |
BoW [47] | 34.4 | 14.1 | - | - |
OSML [53] | 42.6 | - | - | - |
PIE [52] | 65.7 | 41.1 | - | - |
S -CNN [15] | 76.04 | 48.45 | - | - |
MSCAN [57] | 80.3 | 57.5 | - | - |
SpindleNet [58] | 76.9 | - | - | - |
LSRO [53] | 83.9 | 66.1 | - | |
Part-aligned [43] | 81.0 | - | - | - |
VGG16-Basel [59] | 65.02 | 38.27 | 74.14 | 52.25 |
CaffeNet-Basel [3] | 50.89 | 26.79 | 59.80 | 36.50 |
ResNet-50-Basel [27] | 73.69 | 51.48 | 81.47 | 63.95 |
DCC [14] | 86.7 | 69.4 | - | - |
ResNet-50 Base | 87.97 | 72.46 | - | - |
Ours with MA | 90.74 | 76.92 | 93.82 | 83.55 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jamal, M.B.; Zhengang, J.; Ming, F. An Improved Deep Mutual-Attention Learning Model for Person Re-Identification. Symmetry 2020, 12, 358. https://doi.org/10.3390/sym12030358
Jamal MB, Zhengang J, Ming F. An Improved Deep Mutual-Attention Learning Model for Person Re-Identification. Symmetry. 2020; 12(3):358. https://doi.org/10.3390/sym12030358
Chicago/Turabian StyleJamal, Miftah Bedru, Jiang Zhengang, and Fang Ming. 2020. "An Improved Deep Mutual-Attention Learning Model for Person Re-Identification" Symmetry 12, no. 3: 358. https://doi.org/10.3390/sym12030358