[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention Is All You Need," Advances in Neural Information Processing Systems (NeurIPS), vol. 30, Long Beach, CA, USA, 2017.
[2] D. Bahdanau, "Neural machine translation by jointly learning to align and translate," in 3rd International Conference on Learning Representation (ICLR 2015), 2015.
[3] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, "Linformer: Self-attention with linear complexity," arXiv: 2006.04768, 2020.
[4] P. Shaw, J. Uszkoreit, and A. Vaswani, "Self-attention with relative position representations," in NAACL, 2018.
[5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, Transformer-xl: "Attentive language models beyond a fixed-length context," in 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019). ACL, 2019, pp. 2978–2988.
[6] R. Child, S. Gray, A. Radford, and I. Sutskever, "Generating long sequences with sparse transformers," arXiv:1904.10509, 2019.
[7] J. Devlin, M. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv: 1810.04805, 2018.
[8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI blog, 2019. 1(8): p. 9.
[9] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are rnns: Fast autoregressive transformers with linear attention," in International conference on machine learning. 2020. PMLR.
[10] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed, "Big bird: Transformers for longer sequences," Advances in neural information processing systems, 2020. 33: p. 17283-17297.
[11] W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity," Journal of Machine Learning Research, 2022. 23(120): p. 1-39.
[12] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans, "Axial attention in multidimensional transformers," arXiv: 1912.12180, 2019.
[13] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, "Dual attention network for scene segmentation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv: 2010.11929, 2020.
[15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin transformer: Hierarchical vision transformer using shifted windows," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[16] H. Wu, B. Xiao, N. Codella, and M. Liu, "Cvt: Introducing convolutions to vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[17] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, "SegFormer: Simple and efficient design for semantic segmentation with transformers," Advances in neural information processing systems, 2021. 34: p. 12077-12090.
[18] Z. Peng, W. Huang, S. Gu, L. Xie, Y. Wang, J. Jiao, and Q. Ye, "Conformer: Local features coupling global representations for visual recognition," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[19] L. Meng, H. Li, B. Chen, S. Lan, Z. Wu, Y. Jiang, and S. Lim, "Adavit: Adaptive vision transformers for efficient image recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[20] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, "Multiscale vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[21] H. Lin, X. Cheng, X. Wu, F. Yang, D. Shen, Z. Wang, Q. Song, and W. Yuan, "Cat: Cross attention in vision transformer," in 2022 IEEE international conference on multimedia and expo (ICME). 2022. IEEE.
[22] Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov, "Multimodal transformer for unaligned multimodal language sequences," in Proceedings of the conference. Association for computational linguistics. Meeting. 2019. NIH Public Access.
[23] T. Munkhdalai, M. Faruqui, and S. Gopal, "Leave no context behind: Efficient infinite context transformers with infini-attention," arXiv: 2404.07143, 2024.
[24] J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. The, "Set transformer: A framework for attention-based permutation-invariant neural networks," in International conference on machine learning. 2019. PMLR.
[25] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," Neurocomputing, 2024. 568: p. 127063.
[26] D. Heo, and H. Choi, "Generalized Probabilistic Attention Mechanism in Transformers," arXiv: 2410.15578, 2024.
[27] R. Zhang, Y. Zou, and J. Ma, "Hyper-SAGNN: a self-attention based graph neural network for hypergraphs," arXiv: 1911.02613, 2019.
[28] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D. Yeung, "Gaan: Gated attention networks for learning on large and spatiotemporal graphs," arXiv: 1803.07294, 2018.
[29] K. Choromanski, V. Likhosherstov, D. Dohan, and X. Song, "Rethinking attention with performers," arXiv: 2009.14794, 2020.
[30] N. Kitaev, Ł. Kaiser, and A. Levskaya, "Reformer: The efficient transformer," in 8th International Conference on Learning Representations (ICLR 2020), 2020.
[31] H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han, "Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[32] A. Mohtashami, and M. Jaggi, "Landmark attention: Random-access infinite context length for transformers," arXiv: 2305.16300, 2023.
[33] R. Sanovar, S. Bharadwaj, R. S. Amant, V. Rühle, and S. Rajmohan, "Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers," arXiv: 2405.10480, 2024.
[34] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, "Efficient content-based sparse attention with routing transformers," Transactions of the Association for Computational Linguistics, 2021. 9: p. 53-68.
[35] I. Beltagy, M. E. Peters, and A. Cohan, "Longformer: The long-document transformer," arXiv: 2004.05150, 2020.
[36] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, "Exploring the limits of transfer learning with a unified text-to-text transformer," Journal of machine learning research, 2020. 21(140): p. 1-67.
[37] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, "Payless attention with lightweight and dynamic convolutions," arXiv: 1901.10430, 2019.
[38] C. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, "Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[39] Y. Li, C. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, "Mvitv2: Improved multiscale vision transformers for classification and detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[40] Y. Tay, D. Bahri, D. Metzler, D. Juan, Z. Zhao, and C. Zheng, "Synthesizer: Rethinking self-attention for transformer models," in International conference on machine learning. 2021. PMLR.
[41] P. Xu, X. Zhu, and D. A. Clifton, "Multimodal learning with transformers: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 45(10): p. 12113-12132.