A Transformer-based Approach for Persian Text Chunking

Kavehzadeh, P.; Abdollah Pour, M. M.; Momtazi, S.

doi:10.22044/jadm.2022.11035.2250

Document Type : Original/Review Paper

Authors

Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran.

https://doi.org/10.22044/jadm.2022.11035.2250

Abstract

Over the last few years, text chunking has taken a significant part in sequence labeling tasks. Although a large variety of methods have been proposed for shallow parsing in English, most proposed approaches for text chunking in Persian language are based on simple and traditional concepts. In this paper, we propose using the state-of-the-art transformer-based contextualized models, namely BERT and XLM-RoBERTa, as the major structure of our models. Conditional Random Field (CRF), the combination of Bidirectional Long Short-Term Memory (BiLSTM) and CRF, and a simple dense layer are employed after the transformer-based models to enhance the model's performance in predicting chunk labels. Moreover, we provide a new dataset for noun phrase chunking in Persian which includes annotated data of Persian news text. Our experiments reveal that XLM-RoBERTa achieves the best performance between all the architectures tried on the proposed dataset. The results also show that using a single CRF layer would yield better results than a dense layer and even the combination of BiLSTM and CRF.

Keywords

20.1001.1.23225211.2022.10.3.7.9

Main Subjects

H.3.8. Natural Language Processing

References

[1] A. Akbik, D. Blythe, and R. Vollgraf. Contextual string embeddings for sequence labeling. In Proceedings of the 27th international conference on computational linguistics, 2018. (pp. 1638–1649).

[2] A. Akhundov, D. Trautmann, and G. Groh. Sequence labeling: A practical approach, 2018. arXiv preprint arXiv:1808.03926.

[3] A. AleAhmad, H. Amiri, E. Darrudi, M. Rahgozar, and F. Oroumchian. Hamshahri: A standard persian text collection. Knowledge-Based Systems. 2009. 22(5), 382–387.

[4] M. Bijankhan, J. Sheykhzadegan, M. Bahrani, and M. Ghayoomi, Lessons from building a Persian written corpus: Peykare. Language resources and evaluation, 2011, 45(2), 143–164.

[5] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation. 2014. arXiv preprint arXiv:1406.1078.

[6] K. Clark, M. Luong, C. D. Manning, and Q. V. Le, Semi-supervised sequence modeling with cross-view training. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), Proceedings of the 2018 conference on empirical methods in natural language processing, brussels, belgium, october 31-November 4, 2018 (pp. 1914–1925). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/d18-1217.

[7] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE), 2011, 2493–2537.

[8] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wen- zek, F. Guzmán, . . . V. Stoyanov, Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the association for computational linguistics, July 2020. (pp. 8440–8451). Online: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/
2020.acl-main.747.

[9] J. Devlin, M. Chang, K. Lee, and K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies,
NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/n19-1423.

[10] S. R. Eddy, Hidden Markov models. Current opinion instructural biology, 1996. 6(3), 361–365.

[11] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, Parsbert: Transformer-based model for Persian language understanding, 2020. arXiv preprint arXiv:2005.12515.

[12] A. K. Ghalibaf, S. Rahati, and A. Estaji, Shallow semantic parsing of Persian sentences. In Proceedings of the 23rd pacific Asia conference on language, information and computation, 2009. volume 1 (pp. 150–159).

[13] M. Ghayoomi, Bootstrapping the development of an HPSG-based treebank for Persian. Linguistic Issues in Language Technology, 2012, 7(1), 1–13.

[14] A. Graves, and J. Schmidhuber, Framewise phoneme classification with bidirectional lstm and other neural network architectures. 2005. Neural networks, 18(5-6), 602–610.

[15] C. Grover, and R. Tobin, Rule-based chunking and reusability, 2006. In Lrec (pp. 873–878).

[16] A. Hadifar, and S. Momtazi, The impact of corpus domainon word representation: a study on Persian word embeddings. Language Resources and Evaluation, 2018. 52(4), 997–1019.

[17] K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher, A joint many-task model: Growing a neural network for multiple NLP tasks. In M. Palmer, R. Hwa, and S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017 (pp.
1923–1933). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/d17-1206.

[18] S. Hochreiter, and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8), 1735–1780.

[19] M. Homayoonpour, and A. Salimibadr, Determining the boundaries and syntactic phrases in Persian text. In Journal of signal and data processing. 2013.

[20] S. Hosseinnejad, Y. Shekofteh, and T. A. Emami Azadi, A’laam corpus: A standard corpus of named entity for Persian language. Signal and Data Processing, 14(3). 2017.

[21] Z. Huang, W. Xu, and K. Yu, Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991. Retrieved from http://arxiv.org/abs/1508.01991. 2015.

[22] S. Kiani, T. Akhavan, and M. Shamsfard, Developing a Persian chunker using a hybrid approach. In 2009 international multiconference on computer science and information technology (pp. 227–234). 2009.

[23] J. Lafferty, A. McCallum, and F. C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.

[24] G. Lample, and A. Conneau, Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. 2019.

[25] L. Liu, J. Shang, F. F. Xu, X. Ren, H. Gui, J. Peng, and J. Han, Empower sequence labeling with task-aware neural language model. CoRR, abs/1709.04109. Retrieved from http://
arxiv.org/abs/1709.04109. 2017.

[26] Y. Liu, F. Meng, J. Zhang, J. Xu, Y. Chen, and J. Zhou, GCDT: A global context enhanced deep transition architecture for sequence labeling. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 2431–2441). Florence, Italy: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P19-1233. July 2019.

[27] C. Ma, H. Zheng, P. Xie, C. Li, L. Li, and L. Si, Dm-nlp at semeval-2018 task 8: neural sequence labeling with linguistic features. In Proceedings of the 12th international workshop on semantic evaluation (pp. 707–711). 2018.

[28] C. Manning, and D. Klein, Optimization, maxent models, and conditional estimation without magic. In Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology: Tutorials-volume 5 (pp. 8–8). 2003.

[29] M. Mohseni, J. Ghofrani, and H. Faili, Persianp: a Persian text processing toolbox. In International conference on intelligent text processing and computational linguistics (pp. 75–87). 2016.

[30] S. Mohtaj, B. Roshanfekr, A. Zafarian, and H. Asghari, Parsivar: A language processing toolkit for Persian. In Proceedings of the eleventh international conference on language resources and evaluation (lrec 2018).

[31] S. Noferesti, and M. Shamsfard, A rule-based model and genetic algorithm combination for Persian text chunking. Int. J. Comput. Their Appl., 21(2), 133–140. 2014.

[32] S.-B. Park, and B.-T. Zhang, Text chunking by combining hand-crafted rules and memory-based learning. In Proceedings of the 41st annual meeting of the association for computational linguistics
(pp. 497–504). 2003.

[33] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power, Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108. 2017.

[34] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, and M. Iyyer, Bert with history answer embedding for conversational question answering. In Proceedings of the 42nd international acm sigir conference on research and development in information retrieval (pp.1133–1136). 2019.

[35] A. Ramponi, R. van der Goot, R. Lombardo, and B. Plank, Biomedical event extraction as sequence labeling. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 5357–5367). 2020.

[36] M. S. Rasooli, M. Kouhestani, and A. Moloodi, Development of a Persian syntactic dependency treebank. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 306–314). 2013.

[37] A. Ratnaparkhi, A linear observed time statistical parser based on maximum entropy models. In C. Cardie and R. M. Weischedel (Eds.), Second conference on empirical methods in natural language processing, EMNLP 1997, providence, ri, USA, august 1-2, 1997. ACL. Retrieved from https://www.aclweb.org/anthology/W97-0301/

[38] M. Rei, Semi-supervised multitask learning for sequence labeling. In R. Barzilay and M. Kan (Eds.), Proceedings of the 55^th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30-August 4, volume 1: Long papers.
(pp. 2121–2130). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/P17-1194.

[39] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors. nature, 323(6088), 533–536. 1986.

[40] S. K. Saha, and A. Prakash, Experiments on document chunking and query formation for plagiarism source retrieval. In Notebook for pan at clef 2014 (p. 990-996). September 2014.

[41] M. Shamsfard, and M. S. Mousavi, Thematic role

extraction using shallow parsing. International Journal of Computational Intelligence, 4(2), 126–132. 2008.

[42] M. Shamsfard, and M. SadrMousavi, A rule-based semantic role labeling approach for Persian sentences. In Proc. of 2nd computational approach to Arabic script language. 2007.

[43] M. SharifiAtshgah, Semi-automatic development of Persian treebank. In PhD dissertation dep. of letters, Tehran uni. 2009.

[44] K. Simov, Z. Peev, M. Kouylekov, A. Simov, M. Dimitrov, and A. Kiryakov, Clark-an xml-based system for corpora development. In Proc. of the corpus linguistics 2001 conference (pp. 558–560). 2001.

[45] A. Søgaard, and Y. Goldberg, Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers) (pp. 231–235). 2016.

[46] S. Tabatabayi, and S. HoseinNezhad, Finding the boundaries and syntactic phrases by using the corpus generated by dependency treebank. In Proceedings of the 3rd national conference on computational linguistics. 2014.

[47] E. Taher, S. A. Hoseini, and M. Shamsfard, Beheshti-NER: Persian named entity recognition using BERT. In Proceedings of the first international workshop on NLP solutions for under resourced languages (NSURL 2019) co-located with ICNLSP 2019-short papers (pp. 37–42). Trento, Italy: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/
anthology/2019.nsurl-1.6. 2019, 11–12 September.

[48] C. Thompson, USF: Chunking for aspect-term identification and polarity classification. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014) (pp.790–795). Dublin, Ireland: Association for Computational Linguistics. Retrieved from https://aclanthology.org/S14-2140. August 2014.

[49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . . . I. Polosukhin, Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008). 2017.

[50] H. Xu, B. Liu, L. Shu, and P. S. Yu, BERT post-training for review reading comprehension and aspect-based sentiment analysis.
In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, Minneapolis, mn, USA, June 2-7, 2019, volume 1 (long and short papers) (pp. 2324–2335). Association for Computational Linguis-tics. Retrieved from https://doi.org/10.18653/v1/n19-1242.

[51] J. Yang, M. Wang, H. Zhou, C. Zhao, W. Zhang, Y. Yu, and L. Li, Towards making the most of BERT in neural machine translation. In the thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 (pp. 9378–
9385). AAAI Press. Retrieved from https://aaai.org/ojs/
index.php/AAAI/article/view/6479

[52] Zhai, F., Potdar, S., Xiang, B., and Zhou, B. (2017). Neural models for sequence chunking. In S. P. Singh and S. Markovitch (Eds.), Proceedings of the thirty-first AAAI conference on artificial intelligence, February 4-9, 2017, San Francisco, California, USA (pp. 3365–3371). AAAI Press. Retrieved from http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14776.

[52] M. Asgari-Bidhendi, B. Janfada, O. R. Roshani Talab, and B. Minaei-Bidgoli, ParsNER-Social: A Corpus for Named Entity Recognition in Persian Social Media Texts. Journal of AI and Data Mining, 9(2), 2021, 181-192.

A Transformer-based Approach for Persian Text Chunking

References

References

Volume 10, Issue 3July 2022Pages 373-383

Volume 10, Issue 3
July 2022
Pages 373-383