H.3.8. Natural Language Processing
Mohammad Hadi Goldani; Saeedeh Momtazi; Reza Safabakhsh
Abstract
The widespread use of web-based forums and social media has led to an increase in news consumption. To mitigate the impact of misinformation on users' health-related decisions, it is crucial to develop machine learning models that can automatically detect and combat fake news. In this paper, we propose ...
Read More
The widespread use of web-based forums and social media has led to an increase in news consumption. To mitigate the impact of misinformation on users' health-related decisions, it is crucial to develop machine learning models that can automatically detect and combat fake news. In this paper, we propose a novel multilingual model with dynamic transformer model called Hybrid CapsNet for Covid-19 fake news detection in English and Persian languages. Our model incorporates two dynamic pre-trained representation models that incrementally uptrain and update the word embeddings in the training phase., dynamic RoBERTa for English and dynamic ParsBERT for Persian, and two parallel classifiers with new loss function namely margin loss. By utilizing dynamic transformer and both Deep Convolutional Neural Networks (DCNN) and Capsule Neural Networks (CapsNet), we achieve better performance than state-of-the-art baselines. To evaluate the proposed model, we use two recent Covid-19 datasets in English and Persian. Our results, in terms of F1-score, demonstrate the effectiveness of the Hybrid CapsNet model. Our model outperforms existing baselines, suggesting that it can be an effective tool for detecting and combating fake news related to Covid-19 in multiple languages. Overall, our study highlights the importance of developing effective machine learning models for combating misinformation during critical events such as the Covid-19 pandemic. The proposed model has the potential to be applied to other languages and domains and can be a valuable tool for protecting public health and safety.
Document and Text Processing
A.R. Mazochi; S. Bourbour; M. R. Ghofrani; S. Momtazi
Abstract
Converting a postal address to a coordinate, geocoding, is a helpful tool in many applications. Developing a geocoder tool is a difficult task if this tool relates to a developing country that does not follow a standard addressing format. The lack of complete reference data and non-persistency of names ...
Read More
Converting a postal address to a coordinate, geocoding, is a helpful tool in many applications. Developing a geocoder tool is a difficult task if this tool relates to a developing country that does not follow a standard addressing format. The lack of complete reference data and non-persistency of names are the main challenges besides the common natural language process challenges. In this paper, we propose a geocoder for Persian addresses. To the best of our knowledge, our system, TehranGeocode, is the first geocoder for this language. Considering the non-standard structure of Persian addresses, we need to split the address into small segments, find each segment in the reference dataset, and connect them to find the target of the address. We develop our system based on address parsing and dynamic programming for this aim. We specify the contribution of our work compared to similar studies. We discuss the main components of the program, its data, and its results and show that the proposed framework achieves promising results in the field by finding 83\% of addresses with less than 300 meters error.
H.3.8. Natural Language Processing
P. Kavehzadeh; M. M. Abdollah Pour; S. Momtazi
Abstract
Over the last few years, text chunking has taken a significant part in sequence labeling tasks. Although a large variety of methods have been proposed for shallow parsing in English, most proposed approaches for text chunking in Persian language are based on simple and traditional concepts. In this paper, ...
Read More
Over the last few years, text chunking has taken a significant part in sequence labeling tasks. Although a large variety of methods have been proposed for shallow parsing in English, most proposed approaches for text chunking in Persian language are based on simple and traditional concepts. In this paper, we propose using the state-of-the-art transformer-based contextualized models, namely BERT and XLM-RoBERTa, as the major structure of our models. Conditional Random Field (CRF), the combination of Bidirectional Long Short-Term Memory (BiLSTM) and CRF, and a simple dense layer are employed after the transformer-based models to enhance the model's performance in predicting chunk labels. Moreover, we provide a new dataset for noun phrase chunking in Persian which includes annotated data of Persian news text. Our experiments reveal that XLM-RoBERTa achieves the best performance between all the architectures tried on the proposed dataset. The results also show that using a single CRF layer would yield better results than a dense layer and even the combination of BiLSTM and CRF.
Document and Text Processing
S. Momtazi; A. Rahbar; D. Salami; I. Khanijazani
Abstract
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, ...
Read More
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose. In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model.