Document Type : Original/Review Paper

Authors

1 Faculty of Electrical Engineering, K.N.Toosi University of Technology, Tehran, Iran.

2 Faculty of Computer Engineering, K.N.Toosi University of Technology, Tehran, Iran.

3 Faculty of Informatics, University of Wollongong, Wollongong, Australia.

Abstract

Classical SFM (Structure From Motion) algorithms are widely used to estimate the three-dimensional structure of a stationary scene with a moving camera. However, when there are moving objects in the scene, if the equation of the moving object is unknown, the approach fails. This paper first demonstrates that when the frame rate is high enough and the object movement is continuous in time, meaning that acceleration is limited, a simple linear model can be effectively used to estimate the motion. This theory is first mathematically proven in a closed-form expression and then optimized by a nonlinear function applicable for our problem. The algorithm is evaluated both on synthesized and real data from Hopkins dataset.

Keywords

[1] M. R. U. Saputra, A. Markham, and N. Trigoni, "Visual SLAM and structure from motion in dynamic environments: A survey," ACM Computing Surveys (CSUR), vol. 51, no. 2, p. 37, 2018.
[2] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
[3] D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004.
[4] H. Bay, T. Tuytelaars, and L. Van Gool, "Surf: Speeded up robust features," in European conference on computer vision, 2006: Springer, pp. 404-417.
[5] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, "ORB: An efficient alternative to SIFT or SURF," in ICCV, 2011, vol. 11, no. 1: Citeseer, p. 2.
[6] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, "Lift: Learned invariant feature transform," in European Conference on Computer Vision, 2016: Springer, pp. 467-483.
[7] H. Kamali Ardakani, S. A. Mousavinia, and F. Safaei, "Camera Arrangement using Geometric Optimization to Minimize Localization Error in Stereo-vision Systems," Journal of AI and Data Mining, vol. 9, no. 3, pp. 295-307, 2021.
[8] M. Karami, A. Moosavie Nia, and M. Ehsanian, "Camera Arrangement in Visual 3D Systems using Iso-disparity Model to Enhance Depth Estimation Accuracy," Journal of AI and Data Mining, vol. 8, no. 1, pp. 1-12, 2020.
[9] J. Civera, A. J. Davison, and J. M. M. Montiel, Structure from motion using the extended Kalman filter. Springer Science & Business Media, 2011.
[10] M. Pupilli and A. Calway, "Real-Time Camera Tracking Using a Particle Filter," in BMVC, 2005: Citeseer.
[11] H. Strasdat, J. M. Montiel, and A. J. Davison, "Visual SLAM: why filter?," Image and Vision Computing, vol. 30, no. 2, pp. 65-77, 2012.
[12] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography," Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.
[13] O. Chum and J. Matas, "Matching with PROSAC-progressive sample consensus," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005, vol. 1: IEEE, pp. 220-226.
[14] P. H. Torr and A. Zisserman, "MLESAC: A new robust estimator with application to estimating image geometry," Computer vision and image understanding, vol. 78, no. 1, pp. 138-156, 2000.
[15]  J. Engel, T. Schöps, and D. Cremers, "LSD-SLAM: Large-scale direct monocular SLAM," in European conference on computer vision, 2014: Springer, pp. 834-849.
[16] C. Yu et al., "DS-SLAM: A semantic visual SLAM towards dynamic environments," in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018: IEEE, pp. 1168-1174.
[17] J. Cheng, Z. Wang, H. Zhou, L. Li, and J. Yao, "DM-SLAM: A Feature-Based SLAM System for Rigid Dynamic Scenes," ISPRS International Journal of Geo-Information, vol. 9, no. 4, p. 202, 2020.
[18] M. S. Bahraini, A. B. Rad, and M. Bozorg, "Slam in dynamic environments: A deep learning approach for moving object tracking using ml-ransac algorithm," Sensors, vol. 19, no. 17, p. 3699, 2019.
[19] C.-C. Wang and C. Thorpe, "Simultaneous localization and mapping with detection and tracking of moving objects," in Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), 2002, vol. 3: IEEE, pp. 2918-2924.
[20] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte, "Simultaneous localization, mapping and moving object tracking," The International Journal of Robotics Research, vol. 26, no. 9, pp. 889-916, 2007.
[21] M. Derome, A. Plyer, M. Sanfourche, and G. L. Besnerais, "Moving object detection in real-time using stereo from a mobile platform," Unmanned Systems, vol. 3, no. 04, pp. 253-266, 2015.
[22] J. P. Costeira and T. Kanade, "A multibody factorization method for independently moving objects," International Journal of Computer Vision, vol. 29, no. 3, pp. 159-179, 1998.
[23] Y. Murakami, T. Endo, Y. Ito, and N. Babaguchi, "Depth-Estimation-Free condition for projective factorization and its application to 3d reconstruction," in Asian Conference on Computer Vision, 2012: Springer, pp. 150-162.
[24] R. Sabzevari and D. Scaramuzza, "Monocular simultaneous multi-body motion segmentation and reconstruction from perspective views," in 2014 IEEE International Conference on Robotics and Automation (ICRA), 2014: IEEE, pp. 23-30.
[25] R. Sabzevari and D. Scaramuzza, "Multi-body motion estimation from monocular vehicle-mounted cameras," IEEE Transactions on Robotics, vol. 32, no. 3, pp. 638-651, 2016.
[26] C. Bregler, A. Hertzmann, and H. Biermann, "Recovering non-rigid 3D shape from image streams," in cvpr, 2000, vol. 2, no. 2: Citeseer, p. 2690.
[27] Y. Dai, H. Li, and M. He, "A simple prior-free method for non-rigid structure-from-motion factorization," International Journal of Computer Vision, vol. 107, no. 2, pp. 101-122, 2014.
[28] S. Kumar, Y. Dai, and H. Li, "Multi-body non-rigid structure-from-motion," in 2016 Fourth International Conference on 3D Vision (3DV), 2016: IEEE, pp. 148-156.
[29] M. Paladini, A. Del Bue, M. Stosic, M. Dodig, J. Xavier, and L. Agapito, "Factorization for non-rigid and articulated structure using metric projections," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: IEEE, pp. 2898-2905.
[30] J. Xiao, J.-x. Chai, and T. Kanade, "A closed-form solution to non-rigid shape and motion recovery," in European conference on computer vision, 2004: Springer, pp. 573-587.
[31] R. I. Hartley and P. Sturm, "Triangulation," Computer vision and image understanding, vol. 68, no. 2, pp. 146-157, 1997.
[32] S. Avidan and A. Shashua, "Trajectory triangulation of lines: Reconstruction of a 3d point moving along a line from a monocular image sequence," in Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), 1999, vol. 2: IEEE, pp. 62-66.
[33] S. Avidan and A. Shashua, "Trajectory triangulation: 3D reconstruction of moving points from a monocular image sequence," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 4, pp. 348-357, 2000.
[34] A. Shashua, S. Avidan, and M. Werman, "Trajectory triangulation over conic section," in Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999, vol. 1: IEEE, pp. 330-336.
[35] J. Y. Kaminski and M. Teicher, "General trajectory triangulation," in European Conference on Computer Vision, 2002: Springer, pp. 823-836.
[36] H. S. Park, T. Shiratori, I. Matthews, and Y. Sheikh, "3D reconstruction of a moving point from a series of 2D projections," in European conference on computer vision, 2010: Springer, pp. 158-171.
[37] A. Kundu, K. M. Krishna, and C. Jawahar, "Realtime multibody visual SLAM with a smoothly moving monocular camera," in 2011 International Conference on Computer Vision, 2011: IEEE, pp. 2080-2087.
[38] H. S. Park, T. Shiratori, I. Matthews, and Y. Sheikh, "3D trajectory reconstruction under perspective projection," International Journal of Computer Vision, vol. 115, no. 2, pp. 115-135, 2015.
[39] S. Qiao, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, "Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3997-4008.
[40] S. Lee, J. Lee, B. Kim, E. Yi, and J. Kim, "Patch-Wise Attention Network for Monocular Depth Estimation," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 3, pp. 1873-1881.
[41] S. Aich, J. M. U. Vianney, M. A. Islam, M. Kaur, and B. Liu, "Bidirectional attention network for monocular depth estimation," arXiv preprint arXiv:2009.00743, 2020.
[42] F. Aleotti, G. Zaccaroni, L. Bartolomei, M. Poggi, F. Tosi, and S. Mattoccia, "Real-time single image depth perception in the wild with handheld devices," Sensors, vol. 21, no. 1, p. 15, 2021.
[43] S. F. Bhat, I. Alhashim, and P. Wonka, "Adabins: Depth estimation using adaptive bins," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009-4018.
[44] R. T. a. R. Vidal, "A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms," in IEEE Conference on Computer Vision and Pattern Recognition, 2007, doi: 10.1109/CVPR.2007.382974.