Mehr: A Persian Coreference Resolution Corpus

Haji Mohammadi, Hassan; Talebpour, Alireza; Mahmoudi Aznaveh, Ahamd; Yazdani, Samaneh

doi:10.22044/jadm.2023.12641.2418

Document Type : Original/Review Paper

Authors

¹ Department of Computer Engineering, North Tehran Branch, Islamic Azad University, Tehran, Iran.

² Department of computer engineering, Shahid Beheshti University, Tehran, Iran.

https://doi.org/10.22044/jadm.2023.12641.2418

Abstract

Coreference resolution is one of the essential tasks of natural language
processing. This task identifies all in-text expressions that refer to the
same entity in the real world. Coreference resolution is used in other
fields of natural language processing, such as information extraction,
machine translation, and question-answering.
This article presents a new coreference resolution corpus in Persian
named Mehr corpus. The article's primary goal is to develop a Persian
coreference corpus that resolves some of the previous Persian corpus's
shortcomings while maintaining a high inter-annotator agreement. This
corpus annotates coreference relations for noun phrases, named
entities, pronouns, and nested named entities. Two baseline pronoun
resolution systems are developed, and the results are reported. The
corpus size includes 400 documents and about 170k tokens. Corpus
annotation is done by WebAnno preprocessing tool.

Keywords

20.1001.1.23225211.2023.11.3.6.5

Main Subjects

H.3. Artificial Intelligence

References

[1] J. Antunes, R. D. Lins, R. Lima, H. Oliveira, M.

Riss, and S. J. L. Simske, "Automatic cohesive summarization with pronominal anaphora resolution," Computer Speech & Language, vol. 52, pp. 141-164, 2018.

[2] V. K. P. Artari, R. Mahendra, M. A. Jiwanggi, A. Anggraito, and I. Budi, "A Multi-Pass Sieve Coreference Resolution for Indonesian," in Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), 2021, pp. 79-85.

[3] S. Martschat and M. Strube, "Latent structures for coreference resolution," Transactions of the Association for Computational Linguistics", vol. 3, pp. 405-418, 2015.

[4] H. Chai and M. Strube, "Incorporating Centering Theory into Neural Coreference Resolution," in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2996-3002.

[5] T. M. Lai, T. Bui, and D. S. Kim, "End-to-end neural coreference resolution revisited: A simple yet effective baseline," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8147-8151: IEEE.

[6] L. Miculicich and J. Henderson, "Graph Refinement for Coreference Resolution," arXiv preprint arXiv:.16574, 2022.

[7] V. Žitkus, R. Butkienė, and R. Butleris, "Linguistically aware evaluation of coreference resolution from the perspective of higher-level applications," Natural Language Engineering, pp. 1-30, 2023.

[8] B. Bohnet, C. Alberti, and M. Collins, "Coreference Resolution through a seq2seq Transition-Based System," Transactions of the Association for Computational Linguistics, vol. 11, pp. 212-226, 2023.

[9] K. Lee, L. He, M. Lewis, and L. Zettlemoyer, "End-to-end neural coreference resolution," arXiv preprint arXiv:.05365, 2017.

[10] M. Klemen and S. Žitnik, "Neural coreference resolution for Slovene language," Computer Science and Information Systems, vol. 19, no. 2, pp. 495-521, 2022.

[11] Ş. Demir, "Neural Coreference Resolution for Turkish," Journal of Intelligent Systems: Theory and Applications, vol. 6, no. 1, pp. 85-95, 2023.

[12] R. Grishman and B. M. Sundheim, "Message understanding conference-6: A brief history," in COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, 1996.

[13] G. R. Doddington, A. Mitchell, M. A. Przybocki, L. A. Ramshaw, S. M. Strassel, and R. M. Weischedel, "The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation," in Lrec, 2004, vol. 2, p. 1: Lisbon.

[14] S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, and Y. Zhang, "CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes," in Joint Conference on EMNLP and CoNLL-Shared Task, 2012, pp. 1-40: Association for Computational Linguistics.

[15] L. Hirschman, "MUC-7 coreference task definition, version 3.0," Proceedings of MUC-7,, 1997.

[16] S. Levy, K. Lazar, and G. Stanovsky, "Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation," arXiv preprint arXiv:.03858, 2021.

[17] M. Asgari-Bidhendi, B. Janfada, O. Roshani Talab, and B. Minaei-Bidgoli, "Parsner-social: A corpus for named entity recognition in persian social media texts," Journal of AI and Data Mining, vol. 9, no. 2, pp. 181-192, 2021.

[18] N. S. Moosavi and G. Ghassem-Sani, "A ranking approach to Persian pronoun resolution," Advances in Computational Linguistics". Research in Computing Science, vol. 41, pp. 169-180, 2009.

[19] M. Bijankhan, "The role of the corpus in writing a grammar: An introduction to a software," Iranian Journal of Linguistics," vol. 19, no. 2, pp. 48-67, 2004.

[20] M. Nazaridoust, B. M. Bidgoli, and S. Nazaridoust, "Co-reference Resolution in Farsi Corpora," in In Advance Trends in Soft Computing: Proceedings of WCSC 2013, Cham, 2013, pp. 155-162: Springer International Publishing.

[21] A. Mirzaei and P. Safari, "Persian Discourse Treebank and coreference corpus," in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.

[22] Z. Rahimi and S. HosseinNejad, "Corpus based coreference resolution for Farsi text " (in eng), Signal and Data Processing Research vol. 17, no. 1, pp. 79-98, 2020.

[23] S. S. Pradhan, E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel, "Ontonotes: A unified relational semantic representation," in International Conference on Semantic Computing (ICSC 2007), 2007, pp. 517-526: IEEE.

[24] A. Cybulska and P. Vossen, "Guidelines for ECB+ annotation of events and their coreference," in Technical Report: Technical Report NWR-2014-1, VU University Amsterdam, 2014.

[25] Y.-H. Chen and J. D. Choi, "Character identification on multiparty conversation: Identifying mentions of characters in tv shows," in Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2016, pp. 90-100.

[26] A. Ghaddar and P. Langlais, "Wikicoref: An english coreference-annotated corpus of wikipedia articles," in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 2016, pp. 136-142.

[27] A. Zeldes, "The GUM corpus: Creating multilayer resources in the classroom," Language Resources and Evaluation, vol. 51, no. 3, pp. 581-612, 2017.

[28] A. Emami, P. Trichelair, A. Trischler, K. Suleman, H. Schulz, and J. C. K. Cheung, "The KnowRef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution," arXiv preprint arXiv:.01747, 2018.

[29] H. Chen, Z. Fan, H. Lu, A. L. Yuille, and S. Rong, "PreCo: A large-scale dataset in preschool vocabulary for coreference resolution," arXiv preprint arXiv:.09807, 2018.

[30] D. Bamman, O. Lewke, and A. J. a. p. a. Mansoor, "An annotated dataset of coreference in English literature," arXiv preprint arXiv:.01140, 2019.

[31] S. M. Yimam, I. Gurevych, R. E. de Castilho, and C. Biemann, "WebAnno: A flexible, web-based and visually supported system for distributed annotations," in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2013, pp. 1-6.

[32] H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky, "Deterministic coreference resolution based on entity-centric, precision-ranked rules," Computational Linguistics, vol. 39, no. 4, pp. 885-916, 2013.

[33] M. S. Rasooli, M. Kouhestani, and A. Moloodi, "Development of a Persian syntactic dependency treebank," in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 306-314.

Mehr: A Persian Coreference Resolution Corpus

References

References

Volume 11, Issue 3July 2023Pages 407-416

Volume 11, Issue 3
July 2023
Pages 407-416