DRNet: A Depth-Based Regression Network for 6D Object Pose Estimation
Abstract
:1. Introduction
2. Related Work
3. Method
3.1. Architecture Overview
3.2. Object Segmentation
3.3. Translation Estimation Module
3.4. Pose Regression Module
3.5. Synthetic Depth Map
3.6. Training and Architecture Details
4. Experiments
4.1. Datasets
4.2. Evaluation Metrics
4.3. Ablation Study
4.4. Performance of Synthetic Depth Map
4.5. Comparison with State-of-the-Art Methods
4.6. Symmetric Object Loss
4.7. Time Efficiency
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Song, Y.; Chen, X.; Wang, X.; Zhang, Y.; Li, J. 6-DOF Image Localization from Massive Geo-tagged Reference Images. IEEE Trans. Multimed. 2016, 18, 1542–1554. [Google Scholar] [CrossRef]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-View 3d Object Detection Network for Autonomous Driving. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The Kitti Vision Benchmark Suite. In Proceedings of the CVPR, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. arXiv 2017, arXiv:1711.10871. [Google Scholar]
- Farbiz, F.; Cheok, A.D.; Wei, L.; ZhiYing, Z.; Ke, X.; Prince, S.; Billinghurst, M.; Kato, H. Live three-dimensional content for augmented reality. IEEE Trans. Multimed. 2005, 7, 514–523. [Google Scholar] [CrossRef]
- Marder-Eppstein, E. Project tango. In ACM SIGGRAPH 2016-Real-Time Live; Association for Computing Machinery: New York, NY, USA, 2016. [Google Scholar]
- Raphaèle, B.; Patrick, G. Scalable and Efficient Video Coding Using 3-D Modeling. IEEE Trans. Multimed. 2006, 8, 1147–1155. [Google Scholar]
- Collet, A.; Martinez, M.; Srinivasa, S.S. The moped framework: Object recognition and pose estimation for manipulation. Int. J. Robot. Res. 2011, 30, 1284–1306. [Google Scholar] [CrossRef] [Green Version]
- Zhu, M.; Derpanis, K.G.; Yang, Y.; Brahmbhatt, S.; Zhang, M.; Phillips, C.; Lecce, M.; Daniilidis, K. Singleimage 3d object detection and pose estimation for grasping. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 3936–3943. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. arXiv 2019, arXiv:1901.04780. [Google Scholar]
- Cheng, Y.; Zhu, H.; Acar, C.; Jing, W.; Lim, J.H. 6D Pose Estimation with Correlation Fusion. arXiv 2019, arXiv:1909.12936. [Google Scholar]
- Brachmann, E.; Krull, A.; Michel, F.; Gumhold, S.; Shotton, J.; Rother, C. Learning 6d object pose estimation using 3d object coordinates. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 536–551. [Google Scholar]
- Hinterstoisser, S.; Holzer, S.; Cagniart, C.; Ilic, S.; Konolige, K.; Navab, N.; Lepetit, V. Multimodal Templates for Real-Time Detection of Texture-Less Objects in Heavily Cluttered Scenes. In Proceedings of the ICCV, Barcelona, Spain, 6–13 November 2011; pp. 858–865. [Google Scholar]
- Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; Navab, N. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012; pp. 548–562. [Google Scholar]
- Kehl, W.; Milletari, F.; Tombari, F.; Ilic, S.; Navab, N. Deep learning of local rgb-d patches for 3d object detection and 6d pose estimation. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 205–220. [Google Scholar]
- Rios-Cabrera, R.; Tuytelaars, T. Discriminatively Trained Templates for 3d Object Detection: A Real Time Scalable Approach. In Proceedings of the ICCV, Sydney, Australia, 1–8 December 2013; pp. 2048–2055. [Google Scholar]
- Tejani, A.; Tang, D.; Kouskouridas, R.; Kim, T.K. Latent-class hough forests for 3d object detection and pose estimation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 462–477. [Google Scholar]
- Hinterstoisser, S.; Cagniart, C.; Ilic, S.; Sturm, P.; Navab, N.; Fua, P.; Lepetit, V. Gradient response maps for real-time detection of textureless objects. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 876–888. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv 2017, arXiv:1711.00199. [Google Scholar]
- Billings, G.; Johnson-Roberson, M. SilhoNet: An RGB Method for 6D Object Pose Estimation. IEEE Robot. Autom. Lett. 2019, 4, 3727–3734. [Google Scholar] [CrossRef] [Green Version]
- Aubry, M.; Maturana, D.; Efros, A.A.; Russell, B.C.; Sivic, J. Seeing 3d Chairs: Exemplar Part-Based 2d-3d Alignment Using a Large Dataset of Cad Models. In Proceedings of the CVPR, Columbus, OH, USA, 23–28 June 2014; pp. 3762–3769. [Google Scholar]
- Gu, C.; Ren, X. Discriminative Mixture-of-Templates for Viewpoint Classification; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
- Huttenlocher, D.P.; Klanderman, G.A.; Rucklidge, W.J. Comparing images using the hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 850–863. [Google Scholar] [CrossRef] [Green Version]
- Suwajanakorn, S.; Snavely, N.; Tompson, J.; Norouzi, M. Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning. arXiv 2018, arXiv:1807.03146. [Google Scholar]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 7074–7082. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Chan, A.; Derpanis, K.G.; Daniilidis, K. 6-dof object pose from semantic keypoints. arXiv 2017, arXiv:1703.04670. [Google Scholar]
- Peng, S.; Liu, Y.; Huang, Q.; Bao, H.; Zhou, X. PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation. In Proceedings of the CVPR, Long Beach, CA, USA, 15–21 June 2019; pp. 4561–4570. [Google Scholar]
- Kendall, A.; Grimes, M.; Cipolla, R. Posenet: Aconvolutional Network for Real-Time 6-Dof Camera Relocalization. arXiv 2015, arXiv:1505.07427. [Google Scholar]
- Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; Fox, D. DeepIM: Deep Iterative Matching for 6D Pose Estimation; Springer: Berlin/Heidelberg, Germany, 2018; pp. 683–698. [Google Scholar]
- Flynn, J.; Neulander, I.; Philbin, J.; Snavely, N. Deepstereo: Learning to Predict New Views from the World’s Imagery. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Forsyth, D.; Ponce, J. Computer Vision: A Modern Approach; Prentice Hall: Upper Saddle River, NJ, USA, 2002. [Google Scholar]
- Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
- Guler, R.A.; Trigeorgis, G.; Antonakos, E.; Snape, P.; Kokkinos, I. DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the 3DV, Stanford, CA, USA, 25–28 October 2016. [Google Scholar]
- Roy, A.; Todorovic, S. Monocular Depth Estimation Using Neural Regression Forest. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. In Proceedings of the NIPS, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Xie, J.; Girshick, R.; Farhadi, A. Deep3d: Fully Automatic 2d-to-3d Video Conversion with Deep Convolutional Neural Networks; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
- Garg, R.; Carneiro, G.; Reid, I. Unsupervised Cnn for Single View Depth Estimation: Geometry to the Rescue; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Kuznietsov, Y.J.S.; Leibe, B. Semi-Supervised Deep Learning for Monocular Depth Map Prediction. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv 2016, arXiv:1606.00915. [Google Scholar] [CrossRef] [PubMed]
- Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Deep Residual Learning for Image Recognition. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. Sun Database: Large-Scale Scene Recognition from Abbey to Zoo. In Proceedings of the CVPR, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar]
- Nathan, S.; Derek Hoiem, P.K.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Hodan, T.; Haluza, P.; Obdržálek, Š.; Matas, J.; Lourakis, M.; Zabulis, X. T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-Less Objects. In Proceedings of the WACV, Santa Rosa, CA, USA, 24–31 March 2017. [Google Scholar]
- Tekin, B.; Sinha, S.N.; Fua, P. Real-Time Seamless Single Shot 6d Object Pose Prediction. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Sundermeyer, M.; Marton, Z.C.; Durner, M.; Brucker, M.; Triebel, R. Implicit 3d Orientation Learning for 6d Object Detection from Rgb Images; Springer: Berlin/Heidelberg, Germany, 2018; pp. 712–729. [Google Scholar]
- Besl, P.J.; McKay, N.D. A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
- ZhiGang, L.; Gu Wang, X.J. CDPN:Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation. In Proceedings of the ICCV, Seoul, Korea, 27 October–2 November 2019; pp. 7678–7687. [Google Scholar]
- Fabian, M.; Diego Martin, A. Explaining the Ambiguity of Object Detection and 6D Pose From Visual Data. In Proceedings of the ICCV, Seoul, Korea, 27 October–2 November 2019; pp. 6841–6850. [Google Scholar]
- Rad, M.; Lepetit, V. Bb8: A Scalable, Accurate, Robust to PARTIAL Occlusion Method for Predicting the 3d Poses of Challenging Objects without Using Depth. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017; pp. 3828–3836. [Google Scholar]
- Kiru, P.; Timothy Patten, M.V. Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation. In Proceedings of the ICCV, Seoul, Korea, 27–28 October 2019; pp. 7668–7677. [Google Scholar]
- Tomas, H.; Daniel, B. EPOS: Estimating 6D Pose of Objects With Symmetries. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 11703–11712. [Google Scholar]
Nondepth | NonRefine | Refine | +Gtmask | Gtmask+Gtdepth | |||||
---|---|---|---|---|---|---|---|---|---|
Objects | Non | GT | SYN | GT | SYN | GT | SYN | GT | SYN |
002_master_chef_can | 41.08 | 44.64 | 55.43 | 58.99 | 66.99 | 61.16 | 70.33 | 68.12 | 71.79 |
003_cracker_box | 58.04 | 59.02 | 59.1 | 75.21 | 74.28 | 78.9 | 77.61 | 82.25 | 83.52 |
004_sugar_box | 80.91 | 62.44 | 66.16 | 86.84 | 89 | 88.38 | 90.68 | 94.55 | 94.49 |
005_tomato_soup_can | 65.45 | 59.77 | 60.56 | 79.84 | 83.17 | 80.61 | 85.3 | 87.4 | 91.23 |
006_mustard_bottle | 80.07 | 67.73 | 67.44 | 87.52 | 86.5 | 88.8 | 87.66 | 90.88 | 89.0 |
007_tuna_fish_can | 64.57 | 46.96 | 53.86 | 69.13 | 68.64 | 65.78 | 69.17 | 67.86 | 70.96 |
008_pudding_box | 38.98 | 65.0 | 66.07 | 83.96 | 87.97 | 91.0 | 93.34 | 93.78 | 93.8 |
009_gelatin_box | 58.67 | 64.45 | 64.16 | 86.29 | 89.36 | 90.86 | 94.44 | 97.59 | 96.58 |
010_potted_meat_can | 62.14 | 48.45 | 59.6 | 72.97 | 79.16 | 74.67 | 80.98 | 76.19 | 82.1 |
011_banana | 80.58 | 69.98 | 60.07 | 83.51 | 80.38 | 86.75 | 90.4 | 92.0 | 91.21 |
019_pitcher_base | 83.25 | 57.56 | 84.1 | 86.78 | 77.96 | 88.13 | 77.97 | 88.13 | 87.56 |
021_bleach_cleanser | 68.81 | 52.52 | 58.72 | 67.41 | 73.73 | 70.39 | 79.66 | 89.93 | 86.89 |
024_bowl | 83.8 | 80.67 | 83.89 | 80.49 | 89.73 | 82.52 | 91.37 | 90.38 | 93.82 |
025_mug | 56.76 | 44.63 | 62.22 | 71.94 | 85.5 | 63.68 | 86.88 | 76.43 | 92.88 |
035_power_drill | 73.75 | 52.61 | 65.92 | 78.37 | 88.99 | 83.03 | 90.8 | 89.85 | 92.58 |
036_wood_block | 69.54 | 63.54 | 73.03 | 66.43 | 77.5 | 70.4 | 82.71 | 86.38 | 91.04 |
037_scissors | 39.06 | 43.7 | 38.2 | 63.14 | 55.39 | 67.37 | 47.91 | 71.68 | 56.92 |
040_large_marker | 71.1 | 59.84 | 60.74 | 85.75 | 84.58 | 83.78 | 83.57 | 85.88 | 85.99 |
051_large_clamp | 15.52 | 46.29 | 57.08 | 47.94 | 58.2 | 86.48 | 89.69 | 87.61 | 92.69 |
052_extra_large_clamp | 21.4 | 46.88 | 55.65 | 54.43 | 59.86 | 71.57 | 80.36 | 71.08 | 82.66 |
061_foam_brick | 74.87 | 87.44 | 86.9 | 87.88 | 91.38 | 87.64 | 92.2 | 92.4 | 93.33 |
AVG | 61.35 | 58.29 | 63.76 | 74.99 | 78.5 | 79.14 | 83.0 | 84.78 | 86.72 |
PoseCNN [20] | DeepIM [30] | SilhoNet [21] | Our | |||||
---|---|---|---|---|---|---|---|---|
Objects | ADD | ADDS | ADD | ADDS | ADD | ADDS | ADD | ADDS |
002_master_chef_can | 50.9 | 84 | 71.2 | 93.1 | - | 84 | 66.99 | 93.35 |
003_cracker_box | 51.7 | 76.9 | 83.6 | 91 | - | 73.5 | 74.28 | 92.53 |
004_sugar_box | 68.6 | 84.3 | 94.1 | 96.2 | - | 86.6 | 89 | 96.85 |
005_tomato_soup_can | 66 | 80.9 | 86.1 | 92.4 | - | 88.7 | 83.17 | 92.51 |
006_mustard_bottle | 79.9 | 90.2 | 91.5 | 95.1 | - | 89.8 | 86.5 | 95.2 |
007_tuna_fish_can | 70.4 | 87.9 | 87.7 | 96.1 | - | 89.5 | 68.64 | 94.25 |
008_pudding_box | 62.9 | 79 | 82.7 | 90.7 | - | 60.1 | 87.97 | 93.89 |
009_gelatin_box | 75.2 | 87.1 | 71.9 | 94.3 | - | 92.7 | 89.36 | 93.81 |
010_potted_meat_can | 59.6 | 78.5 | 76.2 | 86.4 | - | 78.8 | 79.16 | 93 |
011_banana | 72.3 | 85.9 | 81.2 | 91.3 | - | 80.7 | 80.38 | 92.37 |
019_pitcher_base | 52.5 | 76.8 | 90.1 | 94.6 | - | 91.7 | 77.96 | 91.37 |
021_bleach_cleanser | 50.5 | 71.9 | 81.2 | 90.3 | - | 73.6 | 73.73 | 89.64 |
024_bowl | 6.5 | 69.7 | 8.6 | 81.4 | - | 79.6 | 30.97 | 89.73 |
025_mug | 57.7 | 78 | 81.4 | 91.3 | - | 86.8 | 85.5 | 94 |
035_power_drill | 55.1 | 72.8 | 85.5 | 92.3 | - | 56.5 | 88.99 | 94.04 |
036_wood_block | 31.8 | 65.8 | 60 | 81.9 | - | 66.2 | 65.85 | 77.5 |
037_scissors | 35.8 | 56.2 | 60.9 | 75.4 | - | 49.1 | 55.39 | 77.1 |
040_large_marker | 58 | 71.4 | 75.6 | 86.2 | - | 75 | 84.58 | 91.85 |
051_large_clamp | 25 | 49.9 | 48.4 | 74.3 | - | 69.2 | 35.97 | 58.2 |
052_extra_large_clamp | 15.8 | 47 | 31 | 73.3 | - | 72.3 | 36.82 | 59.86 |
061_foam_brick | 40.4 | 87.8 | 35.9 | 81.9 | - | 77.9 | 68.1 | 91.38 |
AVG | 53.7 | 75.9 | 71.7 | 88.1 | - | 79.6 | 71.87 | 88.21 |
Refine+ | Refine+ | gtdepth+ | gtdepth+ | |||||
---|---|---|---|---|---|---|---|---|
Objects | ADDS | ADDR | ADDS | ADDR | ADDS | ADDR | ADDS | ADDR |
024_bowl | 85.16 | 45.12 | 91.37 | 83.2 | 86.78 | 48.09 | 93.82 | 89.55 |
036_wood_block | 83.08 | 67.03 | 82.71 | 67.89 | 91.17 | 83.2 | 91.04 | 83.2 |
051_large_clamp | 85.29 | 45.81 | 88.69 | 80.56 | 87.2 | 50.28 | 92.69 | 88.18 |
052_extra_large_clamp | 76.43 | 50.13 | 80.36 | 84.13 | 78.17 | 53.06 | 82.66 | 91.18 |
061_foam_brick | 93.34 | 43.75 | 92.2 | 72.25 | 95.54 | 44.26 | 93.33 | 73.53 |
AVG | 84.66 | 50.37 | 87.07 | 77.61 | 87.77 | 55.78 | 90.71 | 85.13 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, L.; Wang, X.; He, M.; Wang, J. DRNet: A Depth-Based Regression Network for 6D Object Pose Estimation. Sensors 2021, 21, 1692. https://doi.org/10.3390/s21051692
Jin L, Wang X, He M, Wang J. DRNet: A Depth-Based Regression Network for 6D Object Pose Estimation. Sensors. 2021; 21(5):1692. https://doi.org/10.3390/s21051692
Chicago/Turabian StyleJin, Lei, Xiaojuan Wang, Mingshu He, and Jingyue Wang. 2021. "DRNet: A Depth-Based Regression Network for 6D Object Pose Estimation" Sensors 21, no. 5: 1692. https://doi.org/10.3390/s21051692