Document Type : Original/Review Paper

Authors

1 Electrical and Computer Engineering Faculty, Semnan University, Semnan 3513119111, Iran.

2 Department of Mathematics and Informatics, Universität de Barcelona, and Computer Vision Center, Barcelona, Spain.

10.22044/jadm.2025.15970.2713

Abstract

Image inpainting, the process of restoring missing or corrupted regions of an image by reconstructing pixel information, has recently seen considerable advancements through deep learning-based approaches. Aiming to tackle the complex spatial relationships within an image, in this paper, we introduce a novel deep learning-based pre-processing methodology for image inpainting utilizing the Vision Transformer (ViT). Unlike CNN-based methods, our approach leverages the self-attention mechanism of ViT to model global contextual dependencies, improving the quality of inpainted regions. Specifically, we replace masked pixel values with those generated by the ViT, utilizing the attention mechanism to extract diverse visual patches and capture discriminative spatial features. To the best of our knowledge, this is the first instance of such a pre-processing model being proposed for image inpainting tasks. Furthermore, we demonstrate that our methodology can be effectively applied using a pre-trained ViT model with a pre-defined patch size, reducing computational overhead while maintaining high reconstruction fidelity. To assess the generalization capability of the proposed methodology, we conduct extensive experiments comparing our approach with four standard inpainting models across four public datasets. The results validate the efficacy of our pre-processing technique in enhancing inpainting performance, particularly in scenarios involving complex textures and large missing regions.

Keywords

Main Subjects

[1] A. Fakhari, K. Kiani, "A new restricted boltzmann machine training algorithm for image restoration," Multimedia Tools and Applications, vol. 80, pp. 2047-2062, 2021.
 
[2] A. Fakhari, K. Kiani, "An image restoration architecture using abstract features and generative models," Journal of AI and Data Mining, vol. 9, pp. 129-139, 2021.
 
[3] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, "Generative image inpainting with contextual attention," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5505-5514, 2018.
 
[4] Y. Wang, X. Tao, X. Qi, X. Shen, J. Jia, "Image Inpainting via Generative Multi-column Convolutional Neural Networks," In NIPS'18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 329 – 338, 2018.
 
[5] N. Majidi, K. Kiani, and R. Rastgoo, "A Deep Model for Super-resolution Enhancement from a Single Image," Journal of AI and Data Mining, vol 8, No 4, 2020, pp. 451-460.
 
[6] A. Pourreza, K. Kiani, " A partial-duplicate image retrieval method using color-based SIFT," In 24th Iranian Conference on Electrical Engineering (ICEE), pp. 1410-1415, 2016.
 
[7] T. S. Cho, M. Butman, S. Avidan and W. T. Freeman, "The patch transform and its applications to image editing," In IEEE Conference on Computer Vision and Pattern Recognition, 2008.
 
[8] K. Kiani, R. Hematpour, R. Rastgoo, "Automatic grayscale image colorization using a deep hybrid model," Journal of AI and data mining, vol. 9, no. 3, pp. 321-328, 2021.
 
[9] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, "Improved training of wasserstein gans," In NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5769-5779, 2017.
 
[10] O. Elharrouss, R. Damseh, A. Nasreddine Belkacem, E. Badidi, A. Lakas, "Transformer-based Image and Video Inpainting: Current Challenges and Future Directions," Artif Intell Rev, vol. 58, 2025.
 
[11] S. Iizuka, E. Simo-Serra, H. Ishikawa, "Globally and locally consistent image completion," ACM Transactions on Graphics (TOG), vol. 36, pp. 1–14, 2017.
 
[12] Y. Li, S. Liu, J. Yang, M.H. Yang, "Generative face completion," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 
[13] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, A.A. Efros, "Context encoders: Feature learning by inpainting," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
 
[14] R.A. Yeh, C. Chen, T.Y. Lim, A.G. Schwing, M. Hasegawa-Johnson, M.N. Do, "Semantic image inpainting with perceptual and contextual losses," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 
[15] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, H. Li, "High-resolution image inpainting using multi-scale neural patch synthesis," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 
[16] K. He and J. Sun, "Statistics of patch offsets for image completion," In European Conference on Computer Vision (ECCV), pp. 16-29, 2012.
 
[17] C. Barnes, E. Shechtman, A. Finkelstein, D.B. Goldman, "Patchmatch: A randomized correspondence algorithm for structural image editing," ACM Transactions on Graphics (TOG), vol. 28, pp. 1-11, 2009.
 
[18] K. He, J. Sun, "Image completion approaches using the statistics of similar patches," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, p. 2423–2435, 2014.
 
[19] D. Ciregan, U. Meier, J. Schmidhuber, "Multi-column deep neural networks for image classification," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 3642–3649, 2012.
 
[20] Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, "Single-image crowd counting via multi-column convolutional neural network," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 589–597, 2016.
 
[21] F. Agostinelli, M.R. Anderson, H. Lee, "Adaptive multi-column deep neural networks with application to robust image denoising," In Advances in Neural Information Processing Systems (NIPS), vol. 26, pp. 1493–1501, 2013.
 
[22] T. Karras, T. Aila, S. Laine, J. Lehtinen, "Progressive growing of gans for improved quality, stability, and variation," In Sixth International Conference on Learning Representations (ICLR), 2018.
 
[23] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, J. Verdera, "Filling-in by joint interpolation of vector fields and gray levels," IEEE Transactions on Image Processing, vol. 10, pp. 1200–1211, 2001.
 
[24] M. Bertalmio, G. Sapiro, V. Caselles, C. Ballester, "Image inpainting," In SIGGRAPH '00: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417– 424, 2000.
 
[25] D. Simakov, Y. Caspi, E. Shechtman, M. Irani, "Summarizing visual data using bidirectional similarity," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
 
[26] R. Köhler, C. Schuler, B. Schölkopf, S. Harmeling, "Mask-specific inpainting with deep neural network," In German Conference on Pattern Recognition, pp. 523–534, 2014.
 
[27] L. Xu, J.S.J. Ren, C. Liu, J. Jia, "Deep convolutional neural network for image deconvolution," In Advances in Neural Information Processing Systems (NIPS), vol. 27, pp. 1790–1798, 2014.
 
[28] X. Snelgrove, "High-resolution multi-scale neural texture synthesis," In SIGGRAPH ASIA 2017 Technical Briefs ACM, pp. 1-4, 2017.
 
[29] C. Li, M. Wand, "Combining markov random fields and convolutional neural networks for image synthesis," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2479–2486, 2016.
 
[30] Y. Jeon, J. Kim, "Active convolution: Learning the shape of convolution for image classification," In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 
[31] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, "Deformable convolutional networks," In IEEE International Conference on Computer Vision (ICCV), 2017.
 
[32] W. Li, Z. Lin, K. Zhou, L. Qi, Y. Wang, J. Jia, "MAT: Mask-Aware Transformer for Large Hole Image Inpainting," In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10758-10768, 2022.
 
[33] H. Wu, J. Zhou, Y. Li, "Deep Generative Model for Image Inpainting with Local Binary Pattern Learning and Spatial Attention," IEEE Transactions on Multimedia, vol. 24, pp. 4016-4027, 2021.
 
[34] Z. Mohammadi, A. Akhavanpour, R. Rastgoo, M. Sabokrou, " Diverse hand gesture recognition dataset," Multimedia Tools and Applications, vol. 83, pp. 50245-50267, 2024.
 
[35] Places2 Dataset, Available: http://places2.csail.mit. edu/download-private.html, Access Date: Nov. 2024.
 
[36] ImageNet Dataset, Available: https://www.image-net.org/, Access Date: Nov. 2024.
 
[37] A. Horé, D. Ziou, "Image Quality Metrics: PSNR vs. SSIM," In 20th International Conference on Pattern Recognition Date of Conference, 2010.
 
[38] Python, Available: https://www.python.org. Access Date: Feb. 2024.
 
[39] PyTorch, Available: https://pytorch.org. Access Date: Feb. 2024.
 
[40] MalImg, Available: https://www.kaggle.com/datasets/manmandes/malimg. Access Date: Feb. 2024.
 
[41] MaleVis, Available: https://www.kaggle.com/datasets/sohamkumar1703/malevis-dataset. Access Date: Feb. 2024.
 
[42] MNIST, Available: https://www.kaggle.com/datasets/hojjatk/mnist-dataset. Access Date: Feb. 2024.
 
[43] L. Maaten, G. Hinton, "Visualizing data using t-SNE. Journal of machine learning research," Journal of Machine Learning Research, pp. 2579-2605, 2008.