ParsNER-Social: A Corpus for Named Entity Recognition in Persian Social Media Texts

Asgari-Bidhendi, M.; Janfada, B.; Roshani Talab, O. R.; Minaei-Bidgoli, B.

doi:10.22044/jadm.2020.9949.2143

Document Type : Original/Review Paper

Authors

¹ Computer Engineering School, Iran University of Science and Technology, Tehran, Iran.

² School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran.

https://doi.org/10.22044/jadm.2020.9949.2143

Abstract

Named Entity Recognition (NER) is one of the essential prerequisites for many natural language processing tasks. All public corpora for Persian named entity recognition, such as ParsNERCorp and ArmanPersoNERCorpus, are based on the Bijankhan corpus, which is originated from the Hamshahri newspaper in 2004. Correspondingly, most of the published named entity recognition models in Persian are specially tuned for the news data and are not flexible enough to be applied in different text categories, such as social media texts. This study introduces ParsNER-Social, a corpus for training named entity recognition models in the Persian language built from social media sources. This corpus consists of 205,373 tokens and their NER tags, crawled from social media contents, including 10 Telegram channels in 10 different categories. Furthermore, three supervised methods are introduced and trained based on the ParsNER-Social corpus: Two conditional random field models as baseline models and one state-of-the-art deep learning model with six different configurations are evaluated on the proposed dataset. The experiments show that the Mono-Lingual Persian models based on Bidirectional Encoder Representations from Transformers (MLBERT) outperform the other approaches on the ParsNER-Social corpus. Among different Configurations of MLBERT models, the ParsBERT+BERT-TokenClass model obtained an F1-score of 89.65%.

Keywords

20.1001.1.23225211.2021.9.2.5.8

References

[1] M. E. Khademi and M. Fakhredanesh, “Persian automatic text summarization based on named entity recognition”. Iranian Journal of Science and Technology, Transactions of Electrical Engineering, pp. 1–12, 2020.

[2] R. Grishman and B. Sundheim. “Message understanding conference- 6: A brief history”. In proceedings of the 16th international conference on computational linguistics, COLING, 1996, pp. 466–471.

[3] A. Borthwick and R. Grishman, “A maximum entropy approach to named entity recognition”, Ph.D. dissertation, New York university, 1999.

[4] H. Poostchi, E. Z. Borzeshi, Abdous, M., and M. Piccardi, “PersoNER: Persian named-entity recognition”. In proceedings of the 26th international conference on computational linguistics, proceedings of the conference: Technical papers, COLING 2016, 2016, pp. 3381–3389.

[5] E. F. T. K. Sang and F. D. Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition”. In Proceedings of the seventh conference on natural language learning, CoNLL-2003, held in cooperation with HLT-NAACL, 2003, pp. 142–147.

[6] A. Farzindar and D. Inkpen. “Natural language processing for social media”. Synthesis Lectures on Human Language Technologies, vol 8(2), pp. 1–166, 2015.

[7] Y. Kim, J. Kim, and J. Seo, “Noise improves noise: Verification of pre-training effect with weakly labelled data on social media NER”. In 2020 IEEE international conference on big data and smart computing, BigComp, 2020, pp. 225–228.

[8] R. Weischedel, E. Hovy, M. Marcus, M. Palmer, R. Belvin, S. Pradhan, and N. Xue, “Ontonotes: A large training corpus for enhanced processing”. Handbook of Natural Language Processing and Machine Translation. Springer, 2011, pp. 59–66.

[9] L. Derczynski, E. Nichols, M. van Erp, and N. Limsopatham, “Results of the wnut2017 shared task on novel and emerging entity recognition”. In Proceedings of the 3rd workshop on noisy user-generated text, 2017, pp. 140–147.

[10] J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for named entity recognition”. IEEE Transactions on Knowledge and data Engineering, pp. 1–20, 2020.

[11] F. Saad, H. Aras, and R. Hackl-Sommer, “Improving named entity recognition for biomedical and patent data using BiLSTM deep neural network models”. International conference on applications of natural language to information systems, 2020, pp. 25–36.

[12] Zhou, C., Li, B., & Sun, X. “Improving software bug-specific named entity recognition with deep neural network”. Journal of Systems and Software, 2020.

[13] R. Sharma, S. Morwal, B. Agarwal, R. Chandra, and M. S. Khan, “A deep neural network-based model for named entity recognition for Hindi language”. Neural Computing and Applications, pp. 1–13, 2020.

[14] V. Yadav and S. Bethard, “A survey on recent advances in named entity recognition from deep learning models”. In proceedings of the 27th international conference on computational linguistics, COLING 2018, 2018, pp. 2145–2158.

[15] I. Segura-Bedmar, P. Martínez, and M. Herrero-Zazo, “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013)”. In proceedings of the 7th international workshop on semantic evaluation, SEMEVAL@NAACL-HLT 2013, 2013, pp. 341–350.

[16] S. Zhang, and N. Elhadad, “Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts”. Journal of Biomedical Informatics, vol. 46(6), pp. 1088–1098, 2013.

[17] F. Balouchzahi and H. Shashirekha, “Puner - Parsi ULMFiT for named-entity recognition in Persian texts”. EasyChair Preprint, no.4224, 2020.

[18] S. Liu, B. Tang, Q. Chen, and X. Wang, “Effects of semantic features on machine learning-based drug name recognition systems: Word embeddings vs. manually constructed dictionaries”. Information, vol. 6(4), pp. 848–865, 2015.

[19] M. Habibi, L. Weber, M. L. Neves, D. L. Wiegandt, and U. Leser, “Deep learning with word embeddings improves biomedical named entity recognition”. Bioinformatics, vol. 33(14), pp. 37–48, 2017.

[20] Y. Xin, E. Hart, V. Mahajan, and J. D. Ruvini, “Learning better internal structure of words for sequence labelling”. In proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP 2018, 2018, pp. 2584–2593.

[21] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations”. In proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, vol. 1 (long papers) 2018, pp. 2227–2237.

[22] J. Devlin, M. Chang, L. Kristina, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding”. In proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, vol. 1 (long and short papers), 2019, pp. 4171–4186.

[23] A. Akbik, D. Blythe, and R. Vollgraf. “Contextual string embeddings for sequence labelling”. In proceedings of the 27th international conference on computational linguistics, 2018, pp. 1638–1649.

[24] A. Akbik, T. Bergmann, and R. Vollgraf, “Pooled contextualized embeddings for named entity recognition”. In proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, vol. 1 (long and short papers), 2019, pp. 724–728.

[25] J. Straková, M. Straka, and J. Hajic, “Neural architectures for nested NER through linearization”. In proceedings of the 57^th annual meeting of the association for computational linguistics, 2019, pp. 5326–5331.

[26] Y. Jiang, C. Hu, T. Xiao, C. Zhang, and J. Zhu, “Improved differentiable architecture search for language modelling and named entity recognition”. In proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP, 2019, pp. 3585–3590.

[27] A. Baevski, S. Edunov, Y. Liu, L. Zettlemoyer, and M. Auli, “Cloze-driven pre-training of self-attention networks”. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP, 2019, pp. 5359–5368.

[28] P. S. Mortazavi and M. Shamsfard. “Named entity recognition in Persian texts”. In proceedings of the 15th national computer society of Iran conference, 2009, pp. 1–10.

[29] S. Rahati-Ghoochani, S. A. Esfahani, and J. Nader, “Persian name entity recognition and classification”. Signal and Data Processing, 2010.

[30] M. Kolali Khormuji and M. Bazrafkan, “Persian named entity recognition based with local filters”. International Journal of computer Applications, vol. 100(4), 2014.

[31] O. Moradiannasab, S. Momtazi, and A. Palmer, “A named entity recognition tool for Persian”. In proceedings of the 3rd Iranian conference on computational linguistics, 2014.

[32] F. Ahmadi and H. Moradi, “A hybrid method for Persian named entity recognition”. In 2015 7th conference on information and knowledge technology (IKT), 2015, pp. 1–7.

[33] S. Hosseinnejad, Y. Shekofteh, and T. Emami Azadi, “A’laam corpus: A standard corpus of named entity for Persian language”. Signal and Data Processing, vol. 14(3), 2017, pp. 127–142.

[34] K. Dashtipour, M. Gogate, A. Adeel, A. Algarafi, N. Howard, and A. Hussain. “Persian named entity recognition”. In 2017 IEEE 16th international conference on cognitive informatics and cognitive computing, ICCI* CC, 2017, pp. 79–83.

[35] M. Khodakarami, “Toward implementation of a named entity recognition system using machine learning methods”, Ph.D. dissertation, University of Tehran, 2018.

[36] H. Poostchi, E. Z. Borzeshi, and M. Piccardi, “BiLSTM-CRF for Persian named-entity recognition ArmanPersoNER corpus: the first entity-annotated Persian dataset”. In proceedings of the eleventh international conference on language resources and evaluation, LREC 2018, 2018, pp. 4427–4431.

[37] M. S. Shahshahani, M. Mohseni, A. Shakery, and H. Faili, “Payma: A tagged corpus of Persian named entities”. Signal and data processing, vol. 16(1), 2019.

[38] N. Taghizadeh, Z. Borhanifard, M. GolestaniPour, and H. Faili. “NSURL-2019 task 7: Named entity recognition (NER) in Farsi”. arXiv preprint arXiv:2003.09029, 2020.

[39] E. Taher, S. A. Hoseini, and M. Shamsfard, “Beheshti-NER: Persian named entity recognition using BERT”. arXiv preprintarXiv:2003.08875, 2020.

[40] S. Momtazi and F. Torabi, “Named entity recognition in Persian text using deep learning”. Signal and Data Processing, vol. 16(4), pp. 93–112, 2020.

[41] L. Jafar Tafreshi and F. Soltanzadeh. “A novel approach to conditional random field-based named entity recognition using Persian specific features”. Journal of AI and Data Mining, vol. 8(2), pp. 227–236, 2020.

[42] M. S. Rasooli, M. Kouhestani, and A. Moloodi, “Development of a Persian syntactic dependency treebank”. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2013, pp. 306–314.

[43] T. Baldwin, M. de Marneffe, B. Han, Y. Kim, A. Ritter, and W. Xu, “Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition”. In proceedings of the workshop on noisy user-generated text, NUT@IJCNLP 2015, 2015, pp. 126–135.

[44] B. Strauss, B. Toma, A. Ritter, M. de Marneffe, and W. Xu, “Results of the WNUT16 named entity recognition shared task”. In proceedings of the 2nd workshop on noisy user-generated text, NUT@COLING 2016, 2016, pp. 138–144.

[45] P. von Daniken and M. Cieliebak, “Transfer learning and sentence-level features for named entity recognition on tweets”. In proceedings of the 3rd workshop on noisy user-generated text (pp.166–171). Association for Computational Linguistics, 2017.

[46] G. Aguilar, A. Pastor ́Lopez-Monroy, F. A. Gon ́zalez and T. Solorio “Modeling noisiness to recognize named entities using multitask neural networks on social media”. In proceeding of 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL HLT 2018, 2018, pp. 1401–1412.

[47] F. Oroumchian, S. Tasharofi, H. Amiri, H. Hojjat, and F. Raja, “Creating a feasible corpus for Persian pos tagging” (UOWD Technical Reports Series No. no. TR 3/2006). Dubai Campus: University of Wollongong, 2006.

[48] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, “ParsBERT: Transformer-based model for Persian language understanding”. arXiv preprint arXiv:2005.12515, 2020.

ParsNER-Social: A Corpus for Named Entity Recognition in Persian Social Media Texts

References

References

Volume 9, Issue 2April 2021Pages 181-192

Volume 9, Issue 2
April 2021
Pages 181-192