H.3.8. Natural Language Processing
Arash Keshtkar; Saeedeh Sadat Sadidpour; Hossien Shirazi
Abstract
Word Sense Disambiguation (WSD) is a longstanding challenge in natural language processing, particularly in morphologically rich and low-resource languages such as Persian. The inherent ambiguity of Persian named entities exacerbated by domain-specific contexts and limited labeled data complicates both ...
Read More
Word Sense Disambiguation (WSD) is a longstanding challenge in natural language processing, particularly in morphologically rich and low-resource languages such as Persian. The inherent ambiguity of Persian named entities exacerbated by domain-specific contexts and limited labeled data complicates both semantic interpretation and information extraction. In this study, we introduce the PWNC corpus, a large-scale, integrated dataset designed for both Named Entity Recognition (NER) and WSD in Persian. The corpus was automatically constructed through a semi-supervised framework, incorporating contextual similarity measures and clustering algorithms to annotate ambiguous entities across ten semantic categories. Utilizing a semi-supervised framework, the proposed homograph semantic categorization method achieved robust performance, with a precision of 83%, recall of 81%, and an F1-score of 82% across over 305K annotated paragraphs. Detailed error analysis revealed challenges in disambiguating closely related senses and weak entities, which were mitigated through contextual embedding strategies. This work provides the first publicly available dual-task corpus for Persian NER and WSD, offering a scalable solution for disambiguation in low-resource tasks and laying the baseline for future research in Persian semantic processing.