Document Type : Review Article

Authors

Department of Computer Engineering, Science and Research SR.C., Islamic Azad University, Tehran, Iran.

10.22044/jadm.2025.15584.2679

Abstract

The attention mechanisms have significantly advanced the field of machine learning and deep learning across various domains, including natural language processing, computer vision, and multimodal systems. This paper presents a comprehensive survey of attention mechanisms in Transformer architectures, emphasizing their evolution, design variants, and domain-specific applications in NLP, computer vision, and multimodal learning. We categorize attention types by their goals like efficiency, scalability, and interpretability, and provide a comparative analysis of their strengths, limitations, and suitable use cases. This survey also addresses the lack of visual intuitions, offering a clearer taxonomy and discussion of hybrid approaches, such as sparse-hierarchical combinations. In addition to foundational mechanisms, we highlight hybrid approaches, theoretical underpinnings, and practical trade-offs. The paper identifies current challenges in computation, robustness, and transparency, offering a structured classification and proposing future directions. By comparing state-of-the-art techniques, this survey aims to guide researchers in selecting and designing attention mechanisms best suited for specific AI applications, ultimately fostering the development of more efficient, interpretable, and adaptable Transformer-based models.

Keywords

Main Subjects

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," Advances in Neural Information Processing Systems (NeurIPS), vol. 30, Long Beach, CA, USA, 2017.
[2] D. Bahdanau, "Neural machine translation by jointly learning to align and translate," in 3rd International Conference on Learning Representation (ICLR 2015), 2015.
[3] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, "Linformer: Self-attention with linear complexity," arXiv: 2006.04768, 2020.
[4] P. Shaw, J. Uszkoreit, and A. Vaswani, "Self-attention with relative position representations," in NAACL, 2018.
[5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, Transformer-xl: "Attentive language models beyond a fixed-length context," in 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). ACL, 2019, pp. 2978–2988.
[6] R. Child, S. Gray, A. Radford, and I. Sutskever, "Generating long sequences with sparse transformers," arXiv:1904.10509, 2019.
[7] J. Devlin, M. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv: 1810.04805, 2018.
[8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI blog, 2019. 1(8): p. 9.
[9] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are rnns: Fast autoregressive transformers with linear attention," in International conference on machine learning. 2020. PMLR.
[10] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed, "Big bird: Transformers for longer sequences," Advances in neural information processing systems, 2020. 33: p. 17283-17297.
[11] W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," Journal of Machine Learning Research, 2022. 23(120): p. 1-39.
[12] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans, "Axial attention in multidimensional transformers," arXiv: 1912.12180, 2019.
[13] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, "Dual attention network for scene segmentation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv: 2010.11929, 2020.
[15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[16] H. Wu, B. Xiao, N. Codella, and M. Liu, "Cvt: Introducing convolutions to vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[17] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, "SegFormer: Simple and efficient design for semantic segmentation with transformers," Advances in neural information processing systems, 2021. 34: p. 12077-12090.
[18] Z. Peng, W. Huang, S. Gu, L. Xie, Y. Wang, J. Jiao, and Q. Ye, "Conformer: Local features coupling global representations for visual recognition," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[19] L. Meng, H. Li, B. Chen, S. Lan, Z. Wu, Y. Jiang, and S. Lim, "Adavit: Adaptive vision transformers for efficient image recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[20] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, "Multiscale vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[21] H. Lin, X. Cheng, X. Wu, F. Yang, D. Shen, Z. Wang, Q. Song, and W. Yuan, "Cat: Cross attention in vision transformer," in 2022 IEEE international conference on multimedia and expo (ICME). 2022. IEEE.
[22] Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov, "Multimodal transformer for unaligned multimodal language sequences," in Proceedings of the conference. Association for computational linguistics. Meeting. 2019. NIH Public Access.
[23] T. Munkhdalai, M. Faruqui, and S. Gopal, "Leave no context behind: Efficient infinite context transformers with infini-attention," arXiv: 2404.07143, 2024.
[24] J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. The, "Set transformer: A framework for attention-based permutation-invariant neural networks," in International conference on machine learning. 2019. PMLR.
[25] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," Neurocomputing, 2024. 568: p. 127063.
[26] D. Heo, and H. Choi, "Generalized Probabilistic Attention Mechanism in Transformers," arXiv: 2410.15578, 2024.
[27] R. Zhang, Y. Zou, and J. Ma, "Hyper-SAGNN: a self-attention based graph neural network for hypergraphs," arXiv: 1911.02613, 2019.
[28] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung, "Gaan: Gated attention networks for learning on large and spatiotemporal graphs," arXiv: 1803.07294, 2018.
[29] K. Choromanski, V. Likhosherstov, D. Dohan, and X. Song, "Rethinking attention with performers," arXiv: 2009.14794, 2020.
[30] N. Kitaev, Ł. Kaiser, and A. Levskaya, "Reformer: The efficient transformer," in 8th International Conference on Learning Representations (ICLR 2020), 2020.
[31] H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han, "Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[32] A. Mohtashami, and M. Jaggi, "Landmark attention: Random-access infinite context length for transformers," arXiv: 2305.16300, 2023.
[33] R. Sanovar, S. Bharadwaj, R. S. Amant, V. Rühle, and S. Rajmohan, "Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers," arXiv: 2405.10480, 2024.
[34] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, "Efficient content-based sparse attention with routing transformers," Transactions of the Association for Computational Linguistics, 2021. 9: p. 53-68.
[35] I. Beltagy, M. E. Peters, and A. Cohan, "Longformer: The long-document transformer," arXiv: 2004.05150, 2020.
[36] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, "Exploring the limits of transfer learning with a unified text-to-text transformer," Journal of machine learning research, 2020. 21(140): p. 1-67.
[37] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, "Payless attention with lightweight and dynamic convolutions," arXiv: 1901.10430, 2019.
[38] C. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, "Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[39] Y. Li, C. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, "Mvitv2: Improved multiscale vision transformers for classification and detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[40] Y. Tay, D. Bahri, D. Metzler, D. Juan, Z. Zhao, and C. Zheng, "Synthesizer: Rethinking self-attention for transformer models," in International conference on machine learning. 2021. PMLR.
[41] P. Xu, X. Zhu, and D. A. Clifton, "Multimodal learning with transformers: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 45(10): p. 12113-12132.