A Transformer-Based Approach with Contextual Position Encoding for Robust Persian Text Recognition in the wild

Raisi, Zobeir; Nazarzehi, Vali Mohammad

doi:10.22044/jadm.2024.14669.2569

Document Type : Original/Review Paper

Authors

Electrical Engineering Department, Chabahar Maritime University, Chabahar, Iran.

https://doi.org/10.22044/jadm.2024.14669.2569

Abstract

The Persian language presents unique challenges for scene text recognition due to its distinctive script. Despite advancements in AI, recognition in non-Latin scripts like Persian still faces difficulties. In this paper, we extend the vanilla transformer architecture to recognize arbitrary shapes of Persian text instances. We apply Contextual Position Encoding (CPE) to the baseline transformer architecture to improve the recognition of Persian scripts in wild images, especially for oriented and spaced characters. The CPE utilizes position information to generate contrastive data pairs that help better in capturing Persian characters written in a different direction. Moreover, we evaluate several state-of-the-art deep-learning models using our prepared challenging Persian scene text recognition dataset and develop a transformer-based architecture to enhance recognition accuracy. Our proposed scene text recognition architecture achieves superior word recognition accuracy compared to existing methods on a real-world Persian text dataset.

Keywords

Main Subjects

Document and Text Processing

References

[1] J. Achiam et al., "GPT-4 Technical Report," arXiv preprint arXiv:2303.08774, 2023.

[2] F. Alimorad et al., "Synthesizing an Image Dataset for Text Detection and Recognition in Images," Journal of Information and Communication Technology, vol. 53, no. 53, pp. 78, 2023.

[3] J. Baek et al., "What Is Wrong with Scene Text Recognition Model Comparisons? Dataset and Model Analysis," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 4714-4722, doi: 10.1109/ICCV.2019.00481.

[4] D. Bautista and R. Atienza, "Scene Text Recognition with Permuted Autoregressive Sequence Models," ECCV, Lecture Notes in Computer Science, vol. 13688, Springer, Cham, doi: 10.1007/978-3-031-19815-1_11.

[5] F. Borisyuk, A. Gordo, and V. Sivakumar, "Rosetta: Large Scale System for Text Detection and Recognition in Images," Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18), Association for Computing Machinery, New York, NY, USA, pp. 71–79, doi: 10.1145/3219819.3219861.

[6] R. Buoy et al., "PARSTR: Partially Autoregressive Scene Text Recognition," International Journal on Document Analysis and Recognition (IJDAR), pp. 303-316, 2024, doi: 10.1007/s10032-024-00470-1.

[7] X. Chen et al., "Text Recognition in the Wild: A Survey," ACM Comput. Surv., vol. 54, no. 2, Article 42, March 2022, doi: 10.1145/3440756.

[8] J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, Minneapolis, Minnesota, pp. 4171-4186, doi: 10.18653/v1/N19-1423.

[9] A. Fateh et al., "Persian Printed Text Line Detection Based on Font Size," Multimedia Tools and Applications, vol. 82, no. 2, pp. 2393–2418, 2023, doi: 10.1007/s11042-022-13243-x.

[10] O. Golovneva et al., "Contextual Position Encoding: Learning to Count What’s Important," 13th International Conference on Learning Representations, 2024.

[11] A. Gupta et al., "Synthetic Data for Text Localisation in Natural Images," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 2315-2324, doi: 10.1109/CVPR.2016.254.

[12] K. He et al., "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.

[13] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," in Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.

[14] M. Jaderberg et al., "Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition," arXiv preprint arXiv:1406.2227, 2014.

[15] M. Jaderberg et al., "Spatial Transformer Networks," Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'15), MIT Press, Cambridge, MA, USA, pp. 2017–2025, 2015.

[16] L. Kang et al., "Pay Attention to What You Read: Nonrecurrent Handwritten Text-Line Recognition," Pattern Recognition, vol. 129, 2022, doi: 10.1016/j.patcog.2022.108766.

[17] D. Karatzas et al., "ICDAR 2013 Robust Reading Competition," 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, pp. 1484-1493, doi: 10.1109/ICDAR.2013.221.

[18] D. Karatzas et al., "ICDAR 2015 Competition on Robust Reading," 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, pp. 1156-1160, doi: 10.1109/ICDAR.2015.7333942.

[19] S. Kheirinejad et al., "Persian Text Based Traffic Sign Detection with Convolutional Neural Network: A New Dataset," 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, pp. 060-064, doi: 10.1109/ICCKE50421.2020.9303646.

[20] A. Kirillov et al., "Segment Anything," 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, pp. 3992-4003, doi: 10.1109/ICCV51070.2023.00371.

[21] J. Lee et al., "On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, pp. 2326-2335, doi: 10.1109/CVPRW50498.2020.00281.

[22] V. I. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions, and Reversals," Soviet Physics Doklady, pp. 707–710, 1966.

[23] W. Liu et al., "STAR-Net: A Spatial Attention Residue Network for Scene Text Recognition," Proc. Brit. Mach. Vision Conf. (BMVC), pp. 43.1–43.13, BMVA Press, 2016, available: https://api.semanticscholar.org/CorpusID:22482128.

[24] X. Liu et al., "Learning to Encode Position for Transformer with Continuous Dynamical Model," Proceedings of the 37th International Conference on Machine Learning, pp. 6327–6335, 2020.

[25] S. Long et al., "Scene Text Detection and Recognition: The Deep Learning Era," International Journal of Computer Vision, vol. 129, pp. 161–184, 2021, doi: 10.1007/s11263-020-01369-0.

[26] Z. Raisi and J. Zelek, “Visual Place Recognition from end-to-end semantic scene text features, Frontiers in Robotics and AI, Vol. 11, Article 1424883, 2024, doi: 10.3389/frobt.2024.1424883.

[27] A. Mishra et al., "Scene Text Recognition Using Higher Order Language Priors," BMVC - British Machine Vision Conference, Sep 2012, Surrey, United Kingdom, doi: 10.5244/C.26.127.

[28] T. Q. Phan et al., "Recognizing Text with Perspective Distortion in Natural Scenes," 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, pp. 569-576, doi: 10.1109/ICCV.2013.76.

[29] A. Rahman et al., "UTRNet: High-Resolution Urdu Text Recognition in Printed Documents," International Conference on Document Analysis and Recognition, pp. 305–324, Springer, 2023, Lecture Notes in Computer Science, vol. 14191, doi: 10.1007/978-3-031-41734-4_19.

[30] M. Rahmati et al., "Printed Persian OCR System Using Deep Learning," IET Image Processing, vol. 14, no. 15, pp. 3920–3931, 2020, doi: 10.1049/iet-ipr.2019.0728.

[31] Z. Raisi and J. Zelek, "Occluded Text Detection and Recognition in the Wild," 2022 19th Conference on Robots and Vision (CRV), Toronto, ON, Canada, 2022, pp. 140-150, doi: 10.1109/CRV55824.2022.00026.

[32] Z. Raisi, M. Naiel, P. Fieguth, S. Wardell, and J. Zelek, "2D Positional Embedding-Based Transformer for Scene Text Recognition," Journal of Computational Vision and Imaging Systems, vol. 6, no. 1, pp. 1–4, 2021, doi: 10.15353/jcvis.v6i1.3533.

[33] Z. Raisi et al., "2LSPE: 2D Learnable Sinusoidal Positional Encoding Using Transformer for Scene Text Recognition," 2021 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 2021, pp. 119-126, doi: 10.1109/CRV52889.2021.00024.

[34] Z. Raisi, "Text Detection and Recognition in the Wild," PhD thesis, 2022, available: http://hdl.handle.net/10012/18453.

[35] A. Ramesh et al., "Hierarchical Text-Conditional Image Generation with CLIP Latents," arXiv preprint arXiv:2204.06125, 2022.

[36] A. Risnumawan et al., "A Robust Arbitrary Text Detection System for Natural Scene Images," Expert Systems with Applications, vol. 41, no. 18, pp. 8027–8048, 2014, doi: 10.1016/j.eswa.2014.07.008.

[37] B. Shi, X. Bai, and C. Yao, "An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298-2304, 1 Nov. 2017, doi: 10.1109/TPAMI.2016.2646371.

[38] B. Shi et al., "Robust Scene Text Recognition with Automatic Rectification," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 4168-4176, doi: 10.1109/CVPR.2016.452.

[39] B. Shi et al., "ASTER: An Attentional Scene Text Recognizer with Flexible Rectification," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2035-2048, 1 Sept. 2019, doi: 10.1109/TPAMI.2018.2848939.

[40] Y. Sun et al., "ICDAR 2019 Competition on Large-Scale Street View Text with Partial Labeling - RRC-LSVT," 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 2019, pp. 1557-1562, doi: 10.1109/ICDAR.2019.00250.

[41] R. Anil et al., "Gemini: A Family of Highly Capable Multimodal Models," arXiv preprint arXiv:2312.11805, 2023.

[42] A. Vaswani et al., "Attention is All You Need," Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), Curran Associates Inc., Red Hook, NY, USA, pp. 6000–6010, 2017.

[43] A. Veit et al., "COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images," arXiv preprint arXiv:1601.07140, 2016.

[44] B.Wang et al., "On Position Embeddings in BERT," International Conference on Learning Representations, Austria, 2021.

[45] K. Wang and S. Belongie, "Word Spotting in the Wild," ECCV 2010, Lecture Notes in Computer Science, vol. 6311, Springer, Berlin, Heidelberg, 2010, doi: 10.1007/978-3-642-15549-9_43.

[46] F. Zhan and S. Lu, "ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 2054-2063, doi: 10.1109/CVPR.2019.00216.

[47] H. Zhang et al., "Self-Attention Generative Adversarial Networks," Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 7354-7363, 09-15 Jun 2019, PMLR.

[48] S. Zhao et al., "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model," arXiv preprint arXiv:2305.14014, 2023.

[49] F. Ariai et al., "Enhancing Aspect-based Sentiment Analysis with ParsBERT in Persian Language," Journal of AI and Data Mining, vol. 12, no. 1, pp. 1–14, 2024, doi: 10.22044/jadm.2023.13666.2482.

A Transformer-Based Approach with Contextual Position Encoding for Robust Persian Text Recognition in the wild

References

References

Volume 12, Issue 3July 2024Pages 455-464

Volume 12, Issue 3
July 2024
Pages 455-464