Document Type : Original/Review Paper

Authors

Faculty of Electrical & Computer Engineering, Malek Ashtar University of Technology, Iran.

10.22044/jadm.2025.15873.2701

Abstract

Word Sense Disambiguation (WSD) is a longstanding challenge in natural language processing, particularly in morphologically rich and low-resource languages such as Persian. The inherent ambiguity of Persian named entities exacerbated by domain-specific contexts and limited labeled data complicates both semantic interpretation and information extraction. In this study, we introduce the PWNC corpus, a large-scale, integrated dataset designed for both Named Entity Recognition (NER) and WSD in Persian. The corpus was automatically constructed through a semi-supervised framework, incorporating contextual similarity measures and clustering algorithms to annotate ambiguous entities across ten semantic categories. Utilizing a semi-supervised framework, the proposed homograph semantic categorization method achieved robust performance, with a precision of 83%, recall of 81%, and an F1-score of 82% across over 305K annotated paragraphs. Detailed error analysis revealed challenges in disambiguating closely related senses and weak entities, which were mitigated through contextual embedding strategies. This work provides the first publicly available dual-task corpus for Persian NER and WSD, offering a scalable solution for disambiguation in low-resource tasks and laying the baseline for future research in Persian semantic processing.

Keywords

Main Subjects

[1]    R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural Language Processing (almost) from Scratch,” Mar. 2011, doi: 10.48550/arXiv.1103.0398.
 
[2]    M. Saeidi, E. Milios, and N. Zeh, “Biomedical Word Sense Disambiguation with Contextualized Representation Learning,” in Companion Proceedings of the Web Conference 2022, in WWW ’22. New York, NY, USA: Association for Computing Machinery, Aug. 2022, pp. 843–848. doi: 10.1145/3487553.3524703.
 
[3]    G. A. Miller, “WordNet: a lexical database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995, doi: 10.1145/219717.219748.
 
[4]    R. Navigli and S. P. Ponzetto, “BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network,” Artificial Intelligence, vol. 193, pp. 217–250, Dec. 2012, doi: 10.1016/j.artint.2012.07.001.
 
[5]    “Entity Linking meets Word Sense Disambiguation: a Unified Approach | Transactions of the Association for Computational Linguistics | MIT Press.” Accessed: Feb. 21, 2023. [Online]. Available: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00179/43316/Entity-Linking-meets-Word-Sense-Disambiguation-a
 
[6]    A. Raganato, C. Delli Bovi, and R. Navigli, “Neural Sequence Learning Models for Word Sense Disambiguation,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 1156–1167. doi: 10.18653/v1/D17-1120.
 
[7]    “BiLSTM-CRF for Persian Named-Entity Recognition ArmanPersoNERCorpus: the First Entity-Annotated Persian Dataset - ACL Anthology.” Accessed: Nov. 15, 2023. [Online]. Available: https://aclanthology.org/L18-1701/
 
[8]    M. S. Shahshahani, M. Mohseni, A. Shakery, and H. Faili, “PEYMA: A Tagged Corpus for Persian Named Entities,” Jan. 30, 2018, arXiv: arXiv:1801.09936. doi: 10.48550/arXiv.1801.09936.
 
[9]    R. Makki and M. M. Homayounpour, “Word Sense Disambiguation of Farsi Homographs Using Thesaurus and Corpus,” in Advances in Natural Language Processing, B. Nordström and A. Ranta, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2008, pp. 315–323. doi: 10.1007/978-3-540-85287-2_30.
 
[10]  B. Masoudi and A. Zandvakili, “Persian Word Sense Disambiguation using LDA topic model,” in International Conference on Science and Engineering, Nov. 1394. Accessed: Nov. 16, 2023. [Online]. Available: https://civilica.com/doc/424627/
 
[11]  H. Rouhizadeh, M. Shamsfard, M. Dehghan, and M. Rouhizadeh, “Persian SemCor: A Bag of Word Sense Annotated Corpus for the Persian Language,” in Proceedings of the 11th Global Wordnet Conference, University of South Africa (UNISA): Global Wordnet Association, Jan. 2021, pp. 147–156. Accessed: Apr. 28, 2023. [Online]. Available: https://aclanthology.org/2021.gwc-1.17
 
[12]  P. S. Mortazavi,Mehrnoush Shamsfard, “Named Entity Recognition in Persian Texts,” in 15th National CSI Computer Conference, Tehran, Iran, 2009.
 
[13]  H. Poostchi, E. Z. Borzeshi, M. Abdous, and M. Piccardi, PersoNER: Persian named-entity recognition. 2016. Accessed: Nov. 16, 2023. [Online]. Available: https://opus.lib.uts.edu.au/handle/10453/80094
 
[14]  M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, “ParsBERT: Transformer-based Model for Persian Language Understanding,” Neural Process Lett, vol. 53, no. 6, pp. 3831–3847, Dec. 2021, doi: 10.1007/s11063-021-10528-4.
 
[15]  M. Mahmoodvand and M. Hourali, “Semi-supervised approach for Persian word sense disambiguation,” in 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE), Oct. 2017, pp. 104–110. doi: 10.1109/ICCKE.2017.8167937.
 
[16]  B. Moradi, E. Ansari, and Z. Žabokrtský, “Unsupervised Word Sense Disambiguation Using Word Embeddings,” in 2019 25th Conference of Open Innovations Association (FRUCT), Nov. 2019, pp. 228–233. doi: 10.23919/FRUCT48121.2019.8981526.
 
[17]  M. Ghayoomi, “Identifying Persian Words’ Senses Automatically by Utilizing the Word Embedding Method,” Iranian Journal of Information Processing & Management, Jan. 2019, Accessed: Nov. 16, 2023. [Online]. Available: https://www.academia.edu/67083634/Identifying_Persian_Words_Senses_Automatically_by_Utilizing_the_Word_Embedding_Method
 
[18]  H. Rouhizadeh, M. Shamsfard, and V. Tajalli, “SBU-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation,” International Journal of Web Research, vol. 5, no. 2, pp. 77–85, Dec. 2022, doi: 10.22133/ijwr.2023.354098.1128.
 
[19]  M. Shamsfard, A. Hesabi, H. Fadaei, N. Mansoory, A. Reza Gholi Famian, and S. Bagherbeigi, “Semi Automatic Development Of FarsNet: The Persian Wordnet,” Jan. 2010.
 
[20]  M. Asgari-Bidhendi, B. Janfada, O. R. Roshani Talab, and B. Minaei-Bidgoli, “ParsNER-Social: A Corpus for Named Entity Recognition in Persian Social Media Texts,” Journal of AI and Data Mining, vol. 9, no. 2, pp. 181–192, Apr. 2021, doi: 10.22044/jadm.2020.9949.2143.
 
[21]  B. Sabeti, H. Abedi Firouzjaee, A. Janalizadeh Choobbasti, S. H. E. Mortazavi Najafabadi, and A. Vaheb, “MirasText: An Automatically Generated Text Corpus for Persian,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga, Eds., Miyazaki, Japan: European Language Resources Association (ELRA), May 2018. Accessed: Dec. 07, 2024. [Online]. Available: https://aclanthology.org/L18-1188
 
[22]  A. AleAhmad, H. Amiri, E. Darrudi, M. Rahgozar, and F. Oroumchian, “Hamshahri: A standard Persian text collection,” Knowledge-Based Systems, vol. 22, no. 5, pp. 382–387, Jul. 2009, doi: 10.1016/j.knosys.2009.05.002.
 
[23]  L. Xue et al., “mT5: A massively multilingual pre-trained text-to-text transformer,” Mar. 11, 2021, arXiv: arXiv:2010.11934. doi: 10.48550/arXiv.2010.11934.
 
[24]  A. Vaswani et al., “Attention Is All You Need,” Aug. 01, 2023, arXiv: arXiv:1706.03762. doi: 10.48550/arXiv.1706.03762.
 
[25]  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.