Document Type : Applied Article


Computer engineering department, Yazd University, Yazd, Iran.


In video prediction it is expected to predict next frame of video by providing a sequence of input frames. Whereas numerous studies exist that tackle frame prediction, suitable performance is not still achieved and therefore the application is an open problem. In this article multiscale processing is studied for video prediction and a new network architecture for multiscale processing is presented. This architecture is in the broad family of autoencoders. It is comprised of an encoder and decoder. A pretrained VGG is used as an encoder that processes a pyramid of input frames at multiple scales simultaneously. The decoder is based on 3D convolutional neurons. The presented architecture is studied by using three different datasets with varying degree of difficulty. In addition, the proposed approach is compared to two conventional autoencoders. It is observed that by using the pretrained network and multiscale processing results in a performant approach.


[1] C. Zhang and J. Kim, “Modeling Long- and Short-Term Temporal Context for Video Object Detection,” in 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 71–75, doi: 10.1109/ICIP.2019.8802920.
[2] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection - A new baseline,” in Computer Vision and Pattern Recognition, 2018, pp. 6536–6545.
[3] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W. Wong, and W. Woo, “Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting,” in Advances in neural information processing systems, 2015, pp. 802–810.
[4] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in International Conference on Learning Representations, 2016, pp. 1–14.
[5] W. Lotter, G. Kreiman, and D. Cox, “Unsupervised Learning of Visual Structure using Predictive Generative Networks,” in International Conference on Learning Representations - Workshop track, 2016, pp. 1–12.
[6] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in International Conference on Learning Representations, 2017, pp. 1–18.
[7] S. Oprea et al., “A Review on Deep Learning Techniques for Video Prediction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–26, 2020, doi: 10.1109/TPAMI.2020.3045007.
[8] X. Jin et al., “Video Scene Parsing with Predictive Feature Learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5580–5588.
[9] J. Walker, K. Marino, A. Gupta, and M. Hebert, “The Pose Knows: Video Forecasting by Generating Pose Futures,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3352–3361, doi: 10.1109/ICCV.2017.361.
[10] M. Jamaseb Khollari, V. Derhami, and M. Yazdian Dehkordi, “Variational Generative Adversarial Networks for Preventing Mode Collapse,” Computational Intelligence in Electrical Engineering, Vol. 13, No. 3, 2021, doi: 10.22108/isee.2021.129742.1495.
[11] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating Videos with Scene Dynamics,” in Neural Information Processing Systems (NIPS), 2016, pp. 613–621.
[12] J. van Amersfoort, A. Kannan, M. Ranzato, A. Szlam, D. Tran, and S. Chintala, “Transformation-Based Models of Video Sequences,” arXiv preprint arXiv:1701.08435, pp. 1–11, 2017.
[13] L. A. Lim and H. Yalim Keles, “Foreground segmentation using convolutional neural networks for multiscale feature encoding,” Pattern Recognition Letters, Vol. 112, pp. 256–262, 2018, doi: 10.1016/j.patrec.2018.08.002.
[14] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised Learning of Video Representations using LSTMs,” in Proceedings of Machine Learning Research, 2015, Vol. 37, pp. 843–852, doi: citeulike-article-id:13519737.
[15] N. Mayer et al., “A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048, doi: 10.1109/CVPR.2016.438.
[16] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Computer Vision and Pattern Recognition, 2015, pp. 3061–3070, doi: 10.1109/CVPR.2015.7298925.
[17] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 1–8, doi: 10.1007/978-3-319-24574-4_28.
[18] M. Sabokrou, M. Fathy, Z. Moayed, and R. Klette, “Fast and accurate detection and localization of abnormal behavior in crowded scenes,” Machine Vision and Applications, Vol. 28, No. 8, pp. 965–985, 2017, doi: 10.1007/s00138-017-0869-8.
[19] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015, pp. 1–14.
[20] C. Szegedy et al., “Going Deeper with Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
[21] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, “PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” in International Conference on Machine Learning, 2018, pp. 5123–5132.
[22] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs,” in Neural Information Processing Systems, 2017, pp. 880–889.
[23] J. Zhang, Y. Wang, M. Long, W. Jianmin, and P. S. Yu, “Z-order recurrent neural networks for video prediction,” in Proceedings - IEEE International Conference on Multimedia and Expo, 2019, pp. 230–235, doi: 10.1109/ICME.2019.00048.
[24] R. Mahjourian, M. Wicke, and A. Angelova, “Geometry-Based Next Frame Prediction from Monocular Video,” in IEEE Intelligent Vehicles Symposium, 2017, pp. 1700–1707, doi: 10.1109/IVS.2017.7995953.
[25] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 600–612, 2004, doi: 10.1109/TIP.2003.819861.
[26] V. Patraucean, A. Handa, and R. Cipolla, “Spatio-temporal video autoencoder with differentiable memory,” 2016.
[27] T. Wang et al., “MSU-Net: Multiscale Statistical U-Net for Real-Time 3D Cardiac MRI Video Segmentation,” Lecture Notes in Computer Science, Vol. 11765, pp. 614–622, 2019, doi: 10.1007/978-3-030-32245-8_68.